Ticket #292 (closed defect: fixed)

Opened 3 months ago

Last modified 3 months ago

incorrect translation of non-english HTMLS when charset entry is after title in head.

Reported by: rssh Owned by: olly
Priority: normal Milestone: 1.0.8
Component: Omega Version: 1.0.7
Severity: normal Keywords:
Cc: Blocked By:
Operating System: All Blocking:

Description (last modified by olly) (diff)

When meta "http-equiv" (where charset is set) is situated after "title" element in html document, than title entry in index is incorrect.

patch to fix is attached.

Attachments

omega-rssh-292.patch (2.7 kB) - added by rssh 3 months ago.
patch to fix
example.html (500 bytes) - added by rssh 3 months ago.
ecampele of file, which shows #292 and #293 problems

Change History

Changed 3 months ago by rssh

patch to fix

  Changed 3 months ago by olly

  • description modified (diff)

The patch is incorrect. The default character set for HTML *is* ISO-8859-1, at least in this context (since we're trying to work with documents from a webserver's document tree. The HTTP/1.1 spec (RFC 2616) says:

When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.

Your comment here misses the point:

(first 256 position of UTF-8 are the same, as in "ISO-8859-1")

While the character values are the same, the byte-level encodings are only the same for the first 128 positions, and it's the byte-level encoding which matters here.

Please attach an example document which demonstrates the problem to save me having to try to create one from your description.

Also, what version are you using?

  Changed 3 months ago by rssh

Hmm - in such case we need keep original title in utf8.

follow-up: ↓ 4   Changed 3 months ago by olly

Sorry, I don't understand your comment.

in reply to: ↑ 3   Changed 3 months ago by rssh

Replying to olly:

Sorry, I don't understand your comment.

Sorry, my fault.

Long description:

1 -- let we have html with http-equiv with charser below title:

<html>

<title> Щось рідною мовою (Something in my native language) </title> <description content="Моя сторінка (My page)" </description> <meta http-equiv="Content-Type" content="text/html; charset=windows-1251">

</html>

when charset is below title, than htmlparser call process_text on title (htmplparse.cc line 218) from value, which is actually in windows-1251 but transformed to utf8 from ISO-8859-1 (i.e. totally incorrect).

to prevent this we must pass to myhtmlparser original text, yet not transformed to utf8.

(i.e. or move converting to utf8 to myhtmlparsee or change process_text to receive two arguments, [utf8 text and origin text]).

P.S. About difference between ISO_8859-1 Here is mapping http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT It's the same from 00 to FF, i.e,. for first 256 symbols

(first 128 - it's about ASCII)

  Changed 3 months ago by olly

Thanks for the example.

You didn't answer my question about which version of Xapian are you using here - is it 1.0.7 as in your other report?

  Changed 3 months ago by rssh

yes, 1.0.7

  Changed 3 months ago by olly

  • status changed from new to assigned
  • version set to 1.0.7

Hmm, the example seems to have been converted to UTF-8 by pasting it. Can you attach the example instead please? Use the [Attach file] button.

Also, <description> isn't an HTML tag - did you mean <meta name="description" ...> ?

  Changed 3 months ago by rssh

) Also, <description> isn't an HTML tag - did you mean <meta name="description" ...> ?

yes

Changed 3 months ago by rssh

ecampele of file, which shows #292 and #293 problems

  Changed 3 months ago by olly

  • milestone set to 1.0.8

Fixed in trunk [11162].

  Changed 3 months ago by olly

  • status changed from assigned to closed
  • resolution set to fixed

Backported to 1.0 branch [11167].

Note: See TracTickets for help on using tickets.