Opened 16 years ago

Closed 16 years ago

#292 closed defect (fixed)

incorrect translation of non-english HTMLS when charset entry is after title in head.

Reported by: ruslan shevchenko Owned by: Olly Betts
Priority: normal Milestone: 1.0.8
Component: Omega Version: 1.0.7
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

When meta "http-equiv" (where charset is set) is situated after "title" element in html document, than title entry in index is incorrect.

patch to fix is attached.

Attachments (2)

omega-rssh-292.patch (2.7 KB ) - added by ruslan shevchenko 16 years ago.
patch to fix
example.html (500 bytes ) - added by ruslan shevchenko 16 years ago.
ecampele of file, which shows #292 and #293 problems

Download all attachments as: .zip

Change History (12)

by ruslan shevchenko, 16 years ago

Attachment: omega-rssh-292.patch added

patch to fix

comment:1 by Olly Betts, 16 years ago

Description: modified (diff)

The patch is incorrect. The default character set for HTML *is* ISO-8859-1, at least in this context (since we're trying to work with documents from a webserver's document tree. The HTTP/1.1 spec (RFC 2616) says:

When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.

Your comment here misses the point:

(first 256 position of UTF-8 are the same, as in "ISO-8859-1")

While the character values are the same, the byte-level encodings are only the same for the first 128 positions, and it's the byte-level encoding which matters here.

Please attach an example document which demonstrates the problem to save me having to try to create one from your description.

Also, what version are you using?

comment:2 by ruslan shevchenko, 16 years ago

Hmm - in such case we need keep original title in utf8.

comment:3 by Olly Betts, 16 years ago

Sorry, I don't understand your comment.

in reply to:  3 comment:4 by ruslan shevchenko, 16 years ago

Replying to olly:

Sorry, I don't understand your comment.

Sorry, my fault.

Long description:

1 -- let we have html with http-equiv with charser below title:

<html>

<title> Щось рідною мовою (Something in my native language) </title> <description content="Моя сторінка (My page)" </description> <meta http-equiv="Content-Type" content="text/html; charset=windows-1251">

</html>

when charset is below title, than htmlparser call process_text on title (htmplparse.cc line 218) from value, which is actually in windows-1251 but transformed to utf8 from ISO-8859-1 (i.e. totally incorrect).

to prevent this we must pass to myhtmlparser original text, yet not transformed to utf8.

(i.e. or move converting to utf8 to myhtmlparsee or change process_text to receive two arguments, [utf8 text and origin text]).

P.S. About difference between ISO_8859-1 Here is mapping http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT It's the same from 00 to FF, i.e,. for first 256 symbols

(first 128 - it's about ASCII)

comment:5 by Olly Betts, 16 years ago

Thanks for the example.

You didn't answer my question about which version of Xapian are you using here - is it 1.0.7 as in your other report?

comment:6 by ruslan shevchenko, 16 years ago

yes, 1.0.7

comment:7 by Olly Betts, 16 years ago

Status: newassigned
Version: 1.0.7

Hmm, the example seems to have been converted to UTF-8 by pasting it. Can you attach the example instead please? Use the [Attach file] button.

Also, <description> isn't an HTML tag - did you mean <meta name="description" ...> ?

comment:8 by ruslan shevchenko, 16 years ago

) Also, <description> isn't an HTML tag - did you mean <meta name="description" ...> ?

yes

by ruslan shevchenko, 16 years ago

Attachment: example.html added

ecampele of file, which shows #292 and #293 problems

comment:9 by Olly Betts, 16 years ago

Milestone: 1.0.8

Fixed in trunk [11162].

comment:10 by Olly Betts, 16 years ago

Resolution: fixed
Status: assignedclosed

Backported to 1.0 branch [11167].

Note: See TracTickets for help on using tickets.