Opened 13 years ago

Closed 9 years ago

Last modified 9 years ago

#550 closed enhancement (fixed)

Omega script enhancement: $prettyurl

Reported by: Charles Owned by: Olly Betts
Priority: normal Milestone: 1.2.21
Component: Omega Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

As discussed on xapian-discuss in this thread:

http://thread.gmane.org/gmane.comp.search.xapian.general/8777/focus=8840

Olly: It would be fairly easy to add a $prettyurl command, if we can decide exactly what it should do. For example, if you decode ?, #, and %, then affected URLs can't be cut and pasted and actually work, while if you decode byte values >= 0x80, then the encoding of filenames becomes an issue.

Charles: A design guideline could be "maximise human readability without causing any breakage" leading to prettifying all but ?, #, % and all >= 0x80 ... ?

Change History (20)

comment:1 by Olly Betts, 13 years ago

Milestone: 1.3.0

Marking for 1.3.0 (then it'll probably get backported).

comment:2 by Olly Betts, 12 years ago

As I alluded to in the text you quote, decoding bytes >= 0x80 is problematic as we don't know for sure what the filename encoding is, and inserting random top-bit-set byte sequences into an HTML page labelled at UTF-8 isn't a great plan. Modern Linux distros seem to have converged on UTF-8, at least by default, but you can use other encodings, and other platforms may be different. Also, if you copy a file from a system with a different encoding, it may not match what is used locally, so checking LC_ALL, etc doesn't really help.

I guess we could see if any sequences of characters >= 0x80 are valid UTF-8 and decode them if so. Generating broken UTF-8 output is really bad - risking generating the wrong characters isn't so bad when the alternative is showing unreadable hex codes.

So I think an initial version which just deals with bytes < 0x80 would be worthwhile as it would at least address the ugly escaping to some extent (and fully for English filenames).

Incidentally, referring to an email by giving the digest number isn't very useful - I don't subscribe to the digest version, and I don't know of any way to look at a previous digest in mailman, or find out what messages were in it. If the digest contains it, the message id of the original email is much more helpful.

comment:3 by Olly Betts, 12 years ago

Status: newassigned

OK, r16161 on trunk adds an initial implementation of $prettyurl, along with a few unit tests of the function which does all the work, and alters the templates to use it where appropriate.

Currently it preserves anything < 0x20 or >= 0x7f or in this list: :/?#[]@!$&'()*+,;=%

The list is what the current RFC says are reserved, but I'm pretty sure that in practice these don't all actually need escaping, and since our primary aim here is appearance and wanting the URL to still work if cut and pasted is secondary, we should check how these actually work unescaped in practice. At least some will be context dependent, which makes this quite complicated to get right.

comment:4 by Olly Betts, 12 years ago

Backported r16161 for 1.2.8 in r16168.

comment:5 by Olly Betts, 12 years ago

Milestone: 1.3.01.3.x

comment:6 by Olly Betts, 10 years ago

As reported in #644, the biggest issue currently is for filenames with non-ASCII characters. If we either document that the output if UTF-8, or provide a way to tell omegascript that, then $prettyurl could check that that top bit set characters match that encoding and if they do (which they generally will) output them as the byte values. That will look nice and should generally cut and paste correctly (though we should check that with a few different browsers).

Of the other characters:

  • < 0x20 and 0x7F are control characters and won't display nicely unencoded anyway.
  • / can't appear in a filename except as a directory separator, so isn't relevant.
  • ? and # are definitely not safe.
  • % we have to leave escaped if it is followed by a hex digit, but we might be able unescape in other cases (this seems to work in firefox, not tried other browsers).
  • :[]@!$&'()*+,;= seem to work OK unencoded in firefox at least.

comment:7 by Olly Betts, 10 years ago

Milestone: 1.3.x1.3.3

comment:8 by Olly Betts, 10 years ago

I've set up a testcase for the other characters:

http://survex.com/~olly/550/

I've realised : is definitely not safe in general, as http:bad.html should get parsed as scheme http:. Experiments bear that out. Some cases (like a link to :.html) work in most browsers (but not all). It seems to be safe if there's an explicit scheme, which wouldn't be too hard to check, but I think this probably isn't worth context sensitive handling.

And % not followed by a valid pair of hex digits is rejected by Apache, and MSIE doesn't even allow clicking on it.

So the apparently-OK-in-practice list is: []@!$&'()*+,;=.

I've tested (or had tested) the cases on the page above with recent versions of iceweasel/firefox, chromium, w3m, lynx, links, and MSIE. Results for other browsers and for older versions welcome.

Last edited 9 years ago by Olly Betts (previous) (diff)

comment:9 by James Aylett, 10 years ago

A note on why []@ should be fine for us despite being gen-delims in the URI syntax (http://tools.ietf.org/html/rfc3986#section-2.2):

@ is used for authentication delimitation in common schemes, and [] for IPv6 (and future) literals. Both of these fall only in the authority section, which starts // and so is very unlikely to actually be constructed via $prettyurl in its common use in Xapian.

I may be misconstruing what $prettyurl is for, in which case the above may be incorrect in some uses.

Incidentally, http:bad.html is valid URI syntax, but not a valid http absolute URI (which requires an authority), so I believe it is as a relative URI with an explicit scheme. (http:http:bad.html similarly.)

comment:10 by Olly Betts, 10 years ago

So in older versions, we didn't really do a proper job with URL encoding. That got fixed by doing what the latest RFC on the subject said, which is great for the links in the result page, but people also sometimes want to show the URL in the text, and the by-the-book encoding makes URLs much uglier than they were before.

Such URLs really ought to work if cut and pasted, but readability is also important - if a particular URL doesn't work in some ancient or obscure browser, that's probably acceptable.

So to address this, we added $prettyurl to take a URL and undo the percent-encoding where we're confident it isn't needed in practice. The URL might be full or relative, and could theoretically use any scheme, though in practice it's most likely to be http: or https:, so handling those well is particularly important.

So we do have to deal with an authority section, but we only need to worry about decoding, not encoding. None of []@ are valid in hostnames IIRC, but they could be seen in a username or password. Having those in search result links seems unlikely, but perhaps we should do some basic parsing of the URL and limit what we decode here.

I'm aware http:bad.html is valid - it just doesn't mean the same as http%3Abad.html (the "bad" is that it's bad to undo the percent encoding there). And http:http:bad.html was a test to see if an unencoded : works if there is an explicit scheme (which is seems to).

Probably the next step should actually be to try to handle top-bit-set characters. For these, I think we just need to make sure that they're valid for the character set the page is in, though I've not done any tests yet.

Incidentally, I also tested with the browser on my android phone, and results are inline with the other mainstream browsers I tried. I'm not sure what this browser is called (the "about" dialog just shows the useragent string, which seems to include the name of just about every web browser I can think of).

in reply to:  10 comment:11 by James Aylett, 10 years ago

Replying to olly:

So we do have to deal with an authority section, but we only need to worry about decoding, not encoding. None of []@ are valid in hostnames IIRC, but they could be seen in a username or password. Having those in search result links seems unlikely, but perhaps we should do some basic parsing of the URL and limit what we decode here.

[] are only for IP literals, so always decoding them is probably safe as no one seems to use them for ipv6 anyway. However if we were considering parsing the URL, we could probably follow the RFC more precisely, which has different reserved characters for different portions.

I'm aware http:bad.html is valid - it just doesn't mean the same as http%3Abad.html (the "bad" is that it's bad to undo the percent encoding there). And http:http:bad.html was a test to see if an unencoded : works if there is an explicit scheme (which is seems to).

My understanding of http:http:bad.html is that it gets parsed as scheme=http:[relative-path=http:bad.html], with an empty authority and other pieces, because : doesn't need escaping in path segments (unless it's the first one and there's no scheme, which doesn't apply here). (The collected ABNF in RFC 3986 seems to actually spell this out, although it's considerably less clear if you read through the RFC from top to bottom. Sigh.)

Probably the next step should actually be to try to handle top-bit-set characters. For these, I think we just need to make sure that they're valid for the character set the page is in, though I've not done any tests yet.

There's also IRIs (RFC 3987) for going full Unicode, and IDNA (RFC 5890 et al) for internationalised domain names (in the authority). However I suspect that these may conflict with, eg, a page in ISO-8859-2 and a query string that has been encoded for the page (which will probably "just work"). It may be that we need one filter that interprets as UTF-8 and reverses IRI/IDNA escaping for prettifying, separate to one that can work in codepages.

Incidentally, I also tested with the browser on my android phone, and results are inline with the other mainstream browsers I tried. I'm not sure what this browser is called (the "about" dialog just shows the useragent string, which seems to include the name of just about every web browser I can think of).

Android browser is a variant of Chrom[e|ium], I believe.

comment:12 by Olly Betts, 10 years ago

I think right now we should focus on undoing (some of) the encoding that we do to filenames in omindex, so IDNA isn't relevant (as that's for the hostname) and IRIs aren't relevant (as we don't encode using that currently).

Someone could run omindex with a baseurl with IDNA in the hostname and/or IRI encoding in the rest, but they shouldn't then be very surprised to see that same URL prefix as-is in the output of the omega CGI.

Also, handling this for UTF-8 encoded output is what really matters, since that's assumed in places which require an encoding in Xapian, and Omega's $html doesn't escape non-ASCII characters (only <>&").

comment:13 by James Aylett, 10 years ago

That all seems reasonable, yes. Is it worth, alongside this, being more explicit about where omega/tools expect UTF-8? On a quick glance I can't see anything in the RSTs at the moment, and it's probably worth something explicit that talks about how documents in different character sets work, and how non-ASCII filenames will be handled. (Although only the latter would be part of this ticket I'd think.)

comment:14 by Olly Betts, 10 years ago

We should discuss encodings explicitly in the docs (and I think we indeed don't currently).

The main issue is actually filenames, though for text/plain documents we correctly handle files with an explicit BOM, UTF-8, and also real-world cases of ISO-8859-1.

The ISO-8859-1 handling is because our UTF-8 decoder falls back to interpreting invalid UTF-8 sequences as ISO-8859-1 - that's technically invalid behaviour these days, but the security implications are very limited when parsing documents and queries and changing it would break user code that expects it (either deliberately or without realising it). And the alternative is to sniff the charset in advance to decide UTF-8 or ISO-8859-1, then parse as whichever we sniffed, so we'd end up with much the same result, just with having to make an extra pass over the text first.

There's also the issue of what the encoding of the output (via the templates) is - you'd struggle to make that anything but UTF-8 as things are currently, but we should say that somewhere.

And queries will also be expected to be UTF-8, which means you should set accept-charset="UTF-8" on search <form> tags unless the page with the search form uses UTF-8 encoding itself). I think that's true even if you use HTTP POST, as I don't think Omega currently looks for a charset in the POST request (but presumably it can have one). GET seems a better option for searches though.

comment:15 by James Aylett, 10 years ago

I think declaring that HTTP-level IO is expected to be UTF-8 is sufficient, at least for now. If anyone really needs something else, they can file a bug or help figure out how to make things more flexible.

Similarly, assuming UTF-8 for filenames seems like the right call: NTFS certainly does (according to this: http://msdn.microsoft.com/en-us/library/windows/desktop/dd317748(v=vs.85).aspx), and apparently most Linux distros will by default (there seems to be some disagreement in some libraries about whether it should follow the user's locale or not, but it sounds like they tend to all use UTF-8 locales out of the box to avoid such problems).

comment:16 by Olly Betts, 9 years ago

[6a77a1b] addresses the ASCII characters better.

Last edited 9 years ago by Olly Betts (previous) (diff)

comment:17 by Olly Betts, 9 years ago

Description: modified (diff)

comment:18 by Olly Betts, 9 years ago

[575eab1] adds a document about character encodings. Covers the basics, but could be expanded - it doesn't discuss filenames yet.

So left to do here:

  • decode valid %-encoded UTF-8 byte in $prettyurl
  • discuess encodings of filenames in encoding.rst
  • look into adding accept-charset="UTF-8" to forms
  • look into charset on POST requests

comment:19 by Olly Betts, 9 years ago

Resolution: fixed
Status: assignedclosed

Decoding of valid %-encoded UTF-8 implemented in [6f34d49].

Discussion of filename encoding added in [cc09f52].

And discussion of encoding of form submissions added in [8945389].

There is no charset sent with POST requests it seems (nor on GET requests).

comment:20 by Olly Betts, 9 years ago

Milestone: 1.3.31.2.21

[08f47bae6e017616b3cd8d35fd8f7d544637ee94] updates the docs to say $prettyurl decodes UTF-8 from 1.2.21 too.

Backports for 1.2.21:

Last edited 9 years ago by Olly Betts (previous) (diff)
Note: See TracTickets for help on using tickets.