Opened 14 years ago

Closed 8 years ago

#452 closed enhancement (fixed)

Add support for sorting documents missing a value to end

Reported by: Richard Boulton Owned by: Richard Boulton
Priority: normal Milestone: 1.3.4
Component: Library API Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Currently, when performing a sort in ascending order based on a value slot, documents with no value stored in the slot will be placed highest (since "" sorts before any other value). However, this is often not desirable: in particular, if a numeric value such as price has been stored in the slot, and sorting is in ascending price order, it is likely that documents without a price stored shouldn't be ranked above the cheapest products.

A KeyMaker subclass can easily be written to support this, but it would be useful for there to be support in a standard KeyMaker (so that users of languages like Python could use it). I have a patch attached which allows MultiValueKeyMaker to be given a "default value" which is used in place of the empty value for documents with no value stored. This is slightly more flexible than simply sorting to the end; for example, the median price stored could be tracked, and used to sort documents with no price stored at this position, instead of simply sorting them to the end.

For backends which store an upper bound on the value in a slot, it is trivial to compute a value which sorts after all other values in the slot - just append a character to the upper bound.

The patch is fairly simple, and contains a few testcases. Exactly as implemented it is probably ABI incompatible (it extends an API method with a new parameter with a default argument), but it could be trivially adjusted to use an overloaded method instead for ABI compatibility if added during the 1.2.x series.

Attachments (1)

sortemptylast.patch (5.4 KB ) - added by Richard Boulton 14 years ago.
Patch to add option to sort empty values to last (or other position based on default parameter)

Download all attachments as: .zip

Change History (8)

by Richard Boulton, 14 years ago

Attachment: sortemptylast.patch added

Patch to add option to sort empty values to last (or other position based on default parameter)

comment:1 by Olly Betts, 14 years ago

The patch also changes the type in the vector in MultiValueKeyMaker, which is an incompatible ABI change, and not so trivial to fix.

Probably the best bet is what we did in an early 1.0.x release - put the new version in a "Foo" namespace and pull it into namespace Xapian with "using Foo::MultiValueKeyMaker;". The old version gets removed from the external header, but is still present in the library (with public visibility) for dynamic linking of binaries built against 1.2.0-1.2.x just before this change.

Care is also needed to stop SWIG exploding, so I'd copy how we did this in 1.0.x as that should have the wrinkles sorted out already.

comment:2 by Olly Betts, 11 years ago

Milestone: 1.2.x1.3.2

Let's get this applied in 1.3.x, so there's no worry about ABI changes. Marking for 1.3.2 for now.

comment:3 by Olly Betts, 11 years ago

Milestone: 1.3.21.3.3

comment:4 by Olly Betts, 9 years ago

Milestone: 1.3.31.3.4

comment:5 by Olly Betts, 8 years ago

Maybe it doesn't matter, but in general the patch doesn't actually allow what the title says - you'd have to set the default value to a string which sorts after any other string, but there isn't really such a string (a std::string of maximum length consisting entirely of byte 255 would work, but that's too large to be a practical solution).

For cases on a more restricted domain (for example where sortable_serialise() is in use) it's fine, and that's even true for UTF-8 text as '\xff' is not valid in UTF-8 - the string "\xff" serves as "UTF-8 infinity".

comment:6 by Olly Betts, 8 years ago

For backends which store an upper bound on the value in a slot, it is trivial to compute a value which sorts after all other values in the slot - just append a character to the upper bound.

Ah, I missed this - indeed, we actually know an upper bound, so my logic breaks down.

comment:7 by Olly Betts, 8 years ago

Resolution: fixed
Status: newclosed

Patch applied (with minor updates due to changes to the underlying code) in [cf1f9c66ca32594a246822218b71a3e01840466e/git].

And documented the get_value_upper_bound(slot) + "x" trick in [87ea363f3ad4d01a51a92ded8205ebdb87f36142/git].

Note: See TracTickets for help on using tickets.