Opened 13 years ago
Closed 12 years ago
#599 closed defect (fixed)
The Omega HTML parser resets contents if a further <body> tag is found
Reported by: | Jean-Francois Dockes | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.2.11 |
Component: | Omega | Version: | |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
In myhtmlparse.cc around line 81, the omega HTML handler resets the current content each time an opening <body> tag is found.
Some very malformed HTML files contain several opening <body> tags, and resetting on further occurrences loses content.
At least Firefox and Opera ignore further <body> tags. Incidentally they also just ignore closing </body> and </html> tags.
Noticed through a reported Recoll issue (Recoll uses the Omega parser mostly unmodified), and changed locally.
Attachments (1)
Change History (5)
comment:1 by , 13 years ago
Milestone: | → 1.3.1 |
---|---|
Status: | new → assigned |
by , 13 years ago
Attachment: | verybadhtml.html added |
---|
comment:2 by , 13 years ago
Yes, I forgot to mention, text before the first <body> is also displayed. For text display purposes, it appears that <body>, </body>, </html> are mostly ignored.
Attaching my bad html test file. Originally comes from a purple log.
comment:3 by , 13 years ago
Milestone: | 1.3.1 → 1.2.11 |
---|
Chromium and even MSIE both seem to behave similarly.
Fixed in trunk r16609. Marking to consider for backporting for 1.2.11.
Did you test if the browsers ignore content before the first <body> (if there is one)? I think the current rule is based on older browsers ignoring such content (around the Netscape 4 era), but didn't consider the case of multiple <body> tags.