Opened 18 years ago

Closed 17 years ago

Last modified 17 years ago

#100 closed enhancement (released)

Allow custom sort orders

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone:
Component: Library API Version: SVN trunk
Severity: minor Keywords:
Cc: Olly Betts, Mark Hammond Blocked By:
Blocking: Operating System: All

Description

Xapian should allow sort orders to be implemented by subclassing a base "Match Comparison" class, which compares two MatchItems.

The attached patch implements this feature. It's a fairly invasive patch, because it is necessary to sort the Document object in the MatchItems used by the matcher, rather than a single sort key.

The patch works correctly in all cases, as far as I can tell, but still needs some work to improve performance in the case of remote databases. Previously, the mset items passed from the remote database contained the value of the sort-key in a member. However, for the new version, this member is replaced by a document object. At present, this document object is not passed when the mset items are passed, and instead will be requested when the document is accessed (which will probably happen when the items are being merged into the local mset). Instead, the document objects should be passed across with the mset item; or better, the partial document objects should be passed, to avoid needing to access the document data and termlist tables more than necessary.

This issue should probably be fixed before the patch is merged, to avoid a performance-regression. (It would be good to do some general benchmarking before merging this, anyway, for just that reason.)

Attachments (3)

matchcmp.patch (49.2 KB ) - added by Richard Boulton 18 years ago.
Initial patch to implement custom sort orders
customsortorders2.patch (49.0 KB ) - added by Richard Boulton 18 years ago.
Updated patch to implement custom sort orders
patch (45.4 KB ) - added by Richard Boulton 18 years ago.
Updated patch again, to apply to HEAD version 8312

Download all attachments as: .zip

Change History (18)

comment:1 by Richard Boulton, 18 years ago

Status: newassigned

comment:2 by Richard Boulton, 18 years ago

Blocked By: 107 added

Merging this patch involves modifying very performance sensitive code, so I'm marking it as depending on bug #107. We don't necessarily need a fully completed performance test suite to progress this, but do need at least the ability to benchmark search speeds.

comment:3 by Olly Betts, 18 years ago

Component: OtherLibrary API
op_sys: LinuxAll
rep_platform: PCAll
Severity: normalenhancement

I just took a look at this patch, wondering what the performance worries were, and I'm a bit suprised!

I don't think we need to make a Xapian::Document available here. I worry this is adding generality that just isn't required. The main feature which users currently actually seem to miss is being able to sort on "atoi(value)" rather than being forced to left-pad numbers stored as strings with zeros or spaces. All the other real-world uses I can think of are similar - collating orders for particular languages, etc.

So I imagined we'd just provide a user-subclassable functor which takes two strings and returns true or false. Sorting by both values and document weight would be possible by using "sort_by_value_then_relevance" or "sort_by_relevance_then_value" as with non-custom sort orders.

Perhaps it could be useful to allow sorting on a secondary value if the first value is the same for two documents, but I've never heard that asked about.

Performance worries aside, I'm a bit concerned this adds code complexity disproportionate to the benefit to users. I'm also worried it could make it more difficult to take advantage of the planned change to how values are stored, which I think has the potential to greatly improve performance of sort by value.

It's good that currently we know the value number for sorting in advance, since

we will be able to open the appropriate "ValueList". At least for MatchDecider we can defer applying the MatchDecider, but we can't do that for document ordering that I can see.

comment:4 by Olly Betts, 18 years ago

Cc: olly@… added

Here's a somewhat lateral thought (inspired by the Schwartzian transform idiom in Perl: http://en.wikipedia.org/wiki/Schwartzian_transform )

The sort function gets called a lot, so it's a performance sensitive area, so we want to minimise the work it does.

If m documents get considered for the M-set, we must compare each at least once, so there will be >= (m-1) comparisons. In reality it'll be a lot more, probably O(m.log(m)).

So what if instead of allowing the sort function to be specified, we provided a functor which allows the user to construct the key use for sorting? This functor would be called exactly m times. Then the matcher sorting mechanism would use these pre-constructed keys instead of the raw values.

So for a numerical sort, you could construct the sort key by doing:

sort_key = string(char(value.size())) + value;

We could provide helper methods for this and other common encodings (like reversing the sort order).

This would fit nicely with the remote backend since we just need to pass these built sort keys across instead of the sort values.

If this seems a workable plan, it would be easy to prototype - we could just hack the current sorting code to use it and see how performance compares.

by Richard Boulton, 18 years ago

Attachment: matchcmp.patch added

Initial patch to implement custom sort orders

comment:5 by Richard Boulton, 18 years ago

attachments.isobsolete: 01

comment:6 by Mark Hammond, 18 years ago

Cc: mhammond@… added

comment:7 by Olly Betts, 18 years ago

FWIW, someone has now asked about sorting by more than one value (on xapian-discuss IIRC, but I don't have a reference to hand).

comment:8 by Richard Boulton, 18 years ago

Blocking: 120 added

Taking that user into account, I've now come across three instances of users needing to be able to sort by multiple keys: the others being the original users I implemented this for, and a customer I'm about to implement various things for.

I think, therefore, that a patch which allows multiple sort keys to work is justified - but it doesn't have to be implemented in the way this patch does. On the other hand, I believe that some parts of this patch simplify the code, rather than increasing its complexity.

Maybe a patch which implemented a fixed, but more flexible, sorting mechanism would be more useful: for example, something which specified a list of values to use for the sort, and the comparison method for each value (ie, numeric/date/string), and the comparision direction (ie, ascending, descending).

This wouldn't let us do really fun things like geolocation sorting, though (where the comparison function would calculate the distance (stored as latitude and longitude, in a value) from a point, and sort on that distance).

Regarding the Schwartzian transform idea - I don't think that the comparison function will usually be the expensive part - it'll be getting the data for the comparison function that costs. It could be a gain for the remote database, though, because the only the transformed value would need to be passed across the network connection, rather than all values in use. It might be interesting to investigate it. Again - coming back to the geolocation idea as one of the more complicated sort orders possible - a geolocation sort would involve a fairly complicated calculation (particularly if we want to cope with large distances, so have to take the curvature of the earth into account), and the Schwartzian transform method would allow this calculation to be performed only once.

Anyway, I'll be looking into this soon, so I'm marking it as for the 1.0 series so it won't get forgotten. (It may not be appropriate for 1.0 due to API breaking issues, but we can cross that bridge when we have an implementation agreed.)

by Richard Boulton, 18 years ago

Attachment: patch added

Updated patch to implement custom sort orders

comment:9 by Richard Boulton, 18 years ago

attachments.isobsolete: 01

comment:10 by Olly Betts, 17 years ago

Owner: changed from Richard Boulton to Olly Betts
Status: assignednew

I'm going to be looking at this, so assigning to myself.

comment:11 by Olly Betts, 17 years ago

Status: newassigned

comment:12 by Olly Betts, 17 years ago

Blocked By: 107 removed
Blocking: 200 added; 120 removed

Apart from some missing feature tests, this is now in SVN HEAD.

Marking as blocking 1.0.5 rather than 1.0.x.

Removing block from the performance testing suite as the approach taken won't affect how sorting by a single value works.

comment:13 by Olly Betts, 17 years ago

attachments.isobsolete: 01

comment:14 by Olly Betts, 17 years ago

Resolution: fixed
Status: assignedclosed

Now fixed in SVN HEAD.

comment:15 by Olly Betts, 17 years ago

Resolution: fixedreleased

comment:16 by Olly Betts, 17 years ago

Blocking: 200 removed
Operating System: All
Note: See TracTickets for help on using tickets.