Opened 15 years ago
Closed 9 years ago
#452 closed enhancement (fixed)
Add support for sorting documents missing a value to end
Reported by: | Richard Boulton | Owned by: | Richard Boulton |
---|---|---|---|
Priority: | normal | Milestone: | 1.3.4 |
Component: | Library API | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
Currently, when performing a sort in ascending order based on a value slot, documents with no value stored in the slot will be placed highest (since "" sorts before any other value). However, this is often not desirable: in particular, if a numeric value such as price has been stored in the slot, and sorting is in ascending price order, it is likely that documents without a price stored shouldn't be ranked above the cheapest products.
A KeyMaker subclass can easily be written to support this, but it would be useful for there to be support in a standard KeyMaker (so that users of languages like Python could use it). I have a patch attached which allows MultiValueKeyMaker to be given a "default value" which is used in place of the empty value for documents with no value stored. This is slightly more flexible than simply sorting to the end; for example, the median price stored could be tracked, and used to sort documents with no price stored at this position, instead of simply sorting them to the end.
For backends which store an upper bound on the value in a slot, it is trivial to compute a value which sorts after all other values in the slot - just append a character to the upper bound.
The patch is fairly simple, and contains a few testcases. Exactly as implemented it is probably ABI incompatible (it extends an API method with a new parameter with a default argument), but it could be trivially adjusted to use an overloaded method instead for ABI compatibility if added during the 1.2.x series.
Attachments (1)
Change History (8)
by , 15 years ago
Attachment: | sortemptylast.patch added |
---|
comment:1 by , 15 years ago
The patch also changes the type in the vector in MultiValueKeyMaker, which is an incompatible ABI change, and not so trivial to fix.
Probably the best bet is what we did in an early 1.0.x release - put the new version in a "Foo" namespace and pull it into namespace Xapian with "using Foo::MultiValueKeyMaker;". The old version gets removed from the external header, but is still present in the library (with public visibility) for dynamic linking of binaries built against 1.2.0-1.2.x just before this change.
Care is also needed to stop SWIG exploding, so I'd copy how we did this in 1.0.x as that should have the wrinkles sorted out already.
comment:2 by , 12 years ago
Milestone: | 1.2.x → 1.3.2 |
---|
Let's get this applied in 1.3.x, so there's no worry about ABI changes. Marking for 1.3.2 for now.
comment:3 by , 11 years ago
Milestone: | 1.3.2 → 1.3.3 |
---|
comment:4 by , 10 years ago
Milestone: | 1.3.3 → 1.3.4 |
---|
comment:5 by , 9 years ago
Maybe it doesn't matter, but in general the patch doesn't actually allow what the title says - you'd have to set the default value to a string which sorts after any other string, but there isn't really such a string (a std::string
of maximum length consisting entirely of byte 255 would work, but that's too large to be a practical solution).
For cases on a more restricted domain (for example where sortable_serialise() is in use) it's fine, and that's even true for UTF-8 text as '\xff'
is not valid in UTF-8 - the string "\xff"
serves as "UTF-8 infinity".
comment:6 by , 9 years ago
For backends which store an upper bound on the value in a slot, it is trivial to compute a value which sorts after all other values in the slot - just append a character to the upper bound.
Ah, I missed this - indeed, we actually know an upper bound, so my logic breaks down.
comment:7 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Patch applied (with minor updates due to changes to the underlying code) in [cf1f9c66ca32594a246822218b71a3e01840466e/git].
And documented the get_value_upper_bound(slot) + "x"
trick in [87ea363f3ad4d01a51a92ded8205ebdb87f36142/git].
Patch to add option to sort empty values to last (or other position based on default parameter)