Context Navigation

← Previous Ticket
Next Ticket →

#737 closed enhancement (fixed)

Fix/improve $filters

Reported by:	Olly Betts	Owned by:	Olly Betts
Priority:	highest	Milestone:	1.5.0
Component:	Omega	Version:
Severity:	normal	Keywords:
Cc:		Blocked By:
Blocking:		Operating System:	All

Description

The current encoding of $filters has at least one bug (which was also present in the older encoding used in 1.2.x):

DOCIDORDER=A is the default, but produces an X in $filters/DOCIDORDER=X is non-default but produces nothing in $filters. Currently however, A and X are identical as DONT_CARE currently actually always results in ASCENDING order, so this doesn't seem worth changing anything for. But if/when we change the encoding, we should address this.

And it could be more compact:

Every N term is prefixed by !, but only the first needs to be.
Every encoded string has at least ~~ after the character for DEFAULTOP, which isn't necessary.
The DEFAULTOP character could be omitted when using the default DEFAULTOP.
We could combine some/all of DEFAULTOP, DOCIDORDER and the existing SORTREVERSE/SORTAFTER characters - there are currently 2, 3 and 2*2 states, though more DEFAULTOP values are possible, and about 10+26*2+19=81 characters which don't need URL encoding, so we could support up to 6 DEFAULTOP values and encode all of these into one character which shouldn't need URL encoding.
We could encode value slot numbers using something like base64 and save bytes when slots > 9 are used (or perhaps encode all the slot numbers together such that they'd usually all fit in one byte).
Lists of B and N are sorted, so could easily be prefix-compressed - reducing the size when there are a lot of either, which is a case where keeping the size down matters most.

The compactness matters as the length of a URL is limited, and using GET is common for search systems. A longer URL can also look uglier when pasted, etc.

Change History (5)

comment:1 by Olly Betts, 2 years ago

Status:	new → assigned

As well as building the filters string, we also build an old_filters string which is the value of FILTERS from Omega < 1.3.4. This means the first stable release it was in was 1.4.0, released 2016-06-24, so we can reasonably drop support for this and instead have old_filters supporting what 1.4.x generates in FILTERS.

comment:2 by Olly Betts, 2 years ago

As a first step, dropped compatibility handling for Xapian 1.2.x xFILTERS encoding in d66c9e9b4d9f8456e6245d0fc1ee59f9e9c5a7d9.

comment:3 by Olly Betts, 2 years ago

Working on this. My WIP so far addresses the first 3 points (any START/END/SPAN filter is now encoded in the same way as date range filters from START.n, etc are) which gets rid of the ~~ when these aren't used. Additionally I've shortened the encoding of date range filters by a character or two in cases where SPAN/SPAN.n isn't used.

The DEFAULTOP character could be omitted when using the default DEFAULTOP.

We probably could, but it's a single character and omitting it entirely seems to complicate things.

We could combine some/all of DEFAULTOP, DOCIDORDER and the existing SORTREVERSE/SORTAFTER characters - there are currently 2, 3 and 2*2 states, though more DEFAULTOP values are possible, and about 10+26*2+19=81 characters which don't need URL encoding, so we could support up to 6 DEFAULTOP values and encode all of these into one character which shouldn't need URL encoding.

This seems a better approach and potentially saves more.

We could encode value slot numbers using something like base64 and save bytes when slots > 9 are used (or perhaps encode all the slot numbers together such that they'd usually all fit in one byte).

Not looked into this.

Lists of B and N are sorted, so could easily be prefix-compressed - reducing the size when there are a lot of either, which is a case where keeping the size down matters most.

Or this.

comment:4 by Olly Betts, 20 months ago

Priority:	normal → highest

We really should do this for 1.5.0.

comment:5 by Olly Betts, 15 months ago

Resolution:	→ fixed
Status:	assigned → closed

Finished off and committed to master as a709f04794725efd8d89d14d726c714ae0c7e7b9. Not suitable for backporting to 1.4.x.

We now encode slot numbers in $filters output with a base-64 like encoding. We need to handle the variable length somehow - currently each continuation byte is currently flagged by preceding it with a special byte (a space currently, size that encodes as a single byte (+) in a CGI parameter in a URL). So e.g. 65 -> 1 1, and this means slots 65 to 99 actually encode less compactly than before (but 10 to 64 more compactly). We could rejig to avoid this but it's very rare in my experience to use such large slot numbers.

Filter terms are now prefix-compressed. Also instead of escaping ~ in the term and using ~ as a terminator we now store the length first (a bit like Pascal rather than C strings), using the base-64 like encoding to store the length (and the length of the prefix to reuse). Storing the length doesn't affect the encoding length at all unless terms contain ~ or the length of the string to append to the reused portion is > 63 bytes long, but it's simpler to encode as we can just copy the term data rather than having to scan it for ~.

DEFAULTOP, DOCIDORDER, SORTREVERSE and SORTAFTER are now encoded together into a single character.

It also occurred to me we could hash the encoded filters if they're longer than a certain length. They'd then no longer guaranteed unique, but it would help avoid exceeding URL length limits. However nobody has ever reported problems with hitting such limits, and the filter encoding we'll produce for the next release series will be more compact than currently, so I think let's worry about that if we ever get reports of it being an issue.

Note: See TracTickets for help on using tickets.

Download in other formats: