Opened 8 years ago
Closed 8 months ago
#737 closed enhancement (fixed)
Fix/improve $filters
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | highest | Milestone: | 1.5.0 |
Component: | Omega | Version: | |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
The current encoding of $filters has at least one bug (which was also present in the older encoding used in 1.2.x):
DOCIDORDER=A
is the default, but produces anX
in$filters
/DOCIDORDER=X
is non-default but produces nothing in$filters
. Currently however,A
andX
are identical asDONT_CARE
currently actually always results inASCENDING
order, so this doesn't seem worth changing anything for. But if/when we change the encoding, we should address this.
And it could be more compact:
- Every
N
term is prefixed by!
, but only the first needs to be. - Every encoded string has at least
~~
after the character forDEFAULTOP
, which isn't necessary. - The
DEFAULTOP
character could be omitted when using the defaultDEFAULTOP
. - We could combine some/all of
DEFAULTOP
,DOCIDORDER
and the existingSORTREVERSE
/SORTAFTER
characters - there are currently 2, 3 and 2*2 states, though moreDEFAULTOP
values are possible, and about 10+26*2+19=81 characters which don't need URL encoding, so we could support up to 6DEFAULTOP
values and encode all of these into one character which shouldn't need URL encoding. - We could encode value slot numbers using something like base64 and save bytes when slots > 9 are used (or perhaps encode all the slot numbers together such that they'd usually all fit in one byte).
- Lists of
B
andN
are sorted, so could easily be prefix-compressed - reducing the size when there are a lot of either, which is a case where keeping the size down matters most.
The compactness matters as the length of a URL is limited, and using GET
is common for search systems. A longer URL can also look uglier when pasted, etc.
Change History (5)
comment:1 by , 19 months ago
Status: | new → assigned |
---|
comment:2 by , 19 months ago
As a first step, dropped compatibility handling for Xapian 1.2.x xFILTERS
encoding in d66c9e9b4d9f8456e6245d0fc1ee59f9e9c5a7d9.
comment:3 by , 19 months ago
Working on this. My WIP so far addresses the first 3 points (any START
/END
/SPAN
filter is now encoded in the same way as date range filters from START.n
, etc are) which gets rid of the ~~
when these aren't used. Additionally I've shortened the encoding of date range filters by a character or two in cases where SPAN
/SPAN.n
isn't used.
The DEFAULTOP character could be omitted when using the default DEFAULTOP.
We probably could, but it's a single character and omitting it entirely seems to complicate things.
We could combine some/all of DEFAULTOP, DOCIDORDER and the existing SORTREVERSE/SORTAFTER characters - there are currently 2, 3 and 2*2 states, though more DEFAULTOP values are possible, and about 10+26*2+19=81 characters which don't need URL encoding, so we could support up to 6 DEFAULTOP values and encode all of these into one character which shouldn't need URL encoding.
This seems a better approach and potentially saves more.
We could encode value slot numbers using something like base64 and save bytes when slots > 9 are used (or perhaps encode all the slot numbers together such that they'd usually all fit in one byte).
Not looked into this.
Lists of B and N are sorted, so could easily be prefix-compressed - reducing the size when there are a lot of either, which is a case where keeping the size down matters most.
Or this.
comment:5 by , 8 months ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Finished off and committed to master as a709f04794725efd8d89d14d726c714ae0c7e7b9. Not suitable for backporting to 1.4.x.
We now encode slot numbers in $filters
output with a base-64 like encoding. We need to handle the variable length somehow - currently each continuation byte is currently flagged by preceding it with a special byte (a space currently, size that encodes as a single byte (+
) in a CGI parameter in a URL). So e.g. 65 -> 1 1
, and this means slots 65 to 99 actually encode less compactly than before (but 10 to 64 more compactly). We could rejig to avoid this but it's very rare in my experience to use such large slot numbers.
Filter terms are now prefix-compressed. Also instead of escaping ~
in the term and using ~
as a terminator we now store the length first (a bit like Pascal rather than C strings), using the base-64 like encoding to store the length (and the length of the prefix to reuse). Storing the length doesn't affect the encoding length at all unless terms contain ~
or the length of the string to append to the reused portion is > 63 bytes long, but it's simpler to encode as we can just copy the term data rather than having to scan it for ~
.
DEFAULTOP, DOCIDORDER, SORTREVERSE and SORTAFTER are now encoded together into a single character.
It also occurred to me we could hash the encoded filters if they're longer than a certain length. They'd then no longer guaranteed unique, but it would help avoid exceeding URL length limits. However nobody has ever reported problems with hitting such limits, and the filter encoding we'll produce for the next release series will be more compact than currently, so I think let's worry about that if we ever get reports of it being an issue.
As well as building the
filters
string, we also build anold_filters
string which is the value ofFILTERS
from Omega < 1.3.4. This means the first stable release it was in was 1.4.0, released 2016-06-24, so we can reasonably drop support for this and instead haveold_filters
supporting what 1.4.x generates inFILTERS
.