Opened 13 years ago

Closed 12 years ago

#599 closed defect (fixed)

The Omega HTML parser resets contents if a further <body> tag is found

Reported by: Jean-Francois Dockes Owned by: Olly Betts
Priority: normal Milestone: 1.2.11
Component: Omega Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

In myhtmlparse.cc around line 81, the omega HTML handler resets the current content each time an opening <body> tag is found.

Some very malformed HTML files contain several opening <body> tags, and resetting on further occurrences loses content.

At least Firefox and Opera ignore further <body> tags. Incidentally they also just ignore closing </body> and </html> tags.

Noticed through a reported Recoll issue (Recoll uses the Omega parser mostly unmodified), and changed locally.

Attachments (1)

verybadhtml.html (646 bytes ) - added by Jean-Francois Dockes 13 years ago.

Download all attachments as: .zip

Change History (5)

comment:1 by Olly Betts, 13 years ago

Milestone: 1.3.1
Status: newassigned

Did you test if the browsers ignore content before the first <body> (if there is one)? I think the current rule is based on older browsers ignoring such content (around the Netscape 4 era), but didn't consider the case of multiple <body> tags.

by Jean-Francois Dockes, 13 years ago

Attachment: verybadhtml.html added

comment:2 by Jean-Francois Dockes, 13 years ago

Yes, I forgot to mention, text before the first <body> is also displayed. For text display purposes, it appears that <body>, </body>, </html> are mostly ignored.

Attaching my bad html test file. Originally comes from a purple log.

comment:3 by Olly Betts, 13 years ago

Milestone: 1.3.11.2.11

Chromium and even MSIE both seem to behave similarly.

Fixed in trunk r16609. Marking to consider for backporting for 1.2.11.

comment:4 by Olly Betts, 12 years ago

Resolution: fixed
Status: assignedclosed

Backported in r16619.

Note: See TracTickets for help on using tickets.