Opened 16 years ago
Closed 16 years ago
#292 closed defect (fixed)
incorrect translation of non-english HTMLS when charset entry is after title in head.
Reported by: | ruslan shevchenko | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.0.8 |
Component: | Omega | Version: | 1.0.7 |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
When meta "http-equiv" (where charset is set) is situated after "title" element in html document, than title entry in index is incorrect.
patch to fix is attached.
Attachments (2)
Change History (12)
by , 16 years ago
Attachment: | omega-rssh-292.patch added |
---|
comment:1 by , 16 years ago
Description: | modified (diff) |
---|
The patch is incorrect. The default character set for HTML *is* ISO-8859-1, at least in this context (since we're trying to work with documents from a webserver's document tree. The HTTP/1.1 spec (RFC 2616) says:
When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.
Your comment here misses the point:
(first 256 position of UTF-8 are the same, as in "ISO-8859-1")
While the character values are the same, the byte-level encodings are only the same for the first 128 positions, and it's the byte-level encoding which matters here.
Please attach an example document which demonstrates the problem to save me having to try to create one from your description.
Also, what version are you using?
comment:4 by , 16 years ago
Replying to olly:
Sorry, I don't understand your comment.
Sorry, my fault.
Long description:
1 -- let we have html with http-equiv with charser below title:
<html>
<title> Щось рідною мовою (Something in my native language) </title> <description content="Моя сторінка (My page)" </description> <meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
</html>
when charset is below title, than htmlparser call process_text on title (htmplparse.cc line 218) from value, which is actually in windows-1251 but transformed to utf8 from ISO-8859-1 (i.e. totally incorrect).
to prevent this we must pass to myhtmlparser original text, yet not transformed to utf8.
(i.e. or move converting to utf8 to myhtmlparsee or change process_text to receive two arguments, [utf8 text and origin text]).
P.S. About difference between ISO_8859-1 Here is mapping http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT It's the same from 00 to FF, i.e,. for first 256 symbols
(first 128 - it's about ASCII)
comment:5 by , 16 years ago
Thanks for the example.
You didn't answer my question about which version of Xapian are you using here - is it 1.0.7 as in your other report?
comment:7 by , 16 years ago
Status: | new → assigned |
---|---|
Version: | → 1.0.7 |
Hmm, the example seems to have been converted to UTF-8 by pasting it. Can you attach the example instead please? Use the [Attach file] button.
Also, <description>
isn't an HTML tag - did you mean <meta name="description" ...>
?
comment:8 by , 16 years ago
) Also, <description> isn't an HTML tag - did you mean <meta name="description" ...> ?
yes
by , 16 years ago
Attachment: | example.html added |
---|
comment:10 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Backported to 1.0 branch [11167].
patch to fix