Ticket #100 (closed enhancement: released)

Opened 2 years ago

Last modified 13 months ago

Allow custom sort orders

Reported by: richard Owned by: olly
Priority: normal Milestone:
Component: Library API Version: SVN trunk
Severity: minor Keywords:
Cc: olly, mhammond Blocked By:
Operating System: All Blocking:

Description

Xapian should allow sort orders to be implemented by subclassing a base "Match Comparison" class, which compares two MatchItems?.

The attached patch implements this feature. It's a fairly invasive patch, because it is necessary to sort the Document object in the MatchItems? used by the matcher, rather than a single sort key.

The patch works correctly in all cases, as far as I can tell, but still needs some work to improve performance in the case of remote databases. Previously, the mset items passed from the remote database contained the value of the sort-key in a member. However, for the new version, this member is replaced by a document object. At present, this document object is not passed when the mset items are passed, and instead will be requested when the document is accessed (which will probably happen when the items are being merged into the local mset). Instead, the document objects should be passed across with the mset item; or better, the partial document objects should be passed, to avoid needing to access the document data and termlist tables more than necessary.

This issue should probably be fixed before the patch is merged, to avoid a performance-regression. (It would be good to do some general benchmarking before merging this, anyway, for just that reason.)

Attachments

matchcmp.patch (49.2 kB) - added by richard 22 months ago.
Initial patch to implement custom sort orders
customsortorders2.patch (49.0 kB) - added by richard 21 months ago.
Updated patch to implement custom sort orders
patch (45.4 kB) - added by richard 21 months ago.
Updated patch again, to apply to HEAD version 8312

Change History

Changed 2 years ago by richard

  • status changed from new to assigned

Changed 2 years ago by richard

  • blockedby set to 107

Merging this patch involves modifying very performance sensitive code, so I'm marking it as depending on bug #107. We don't necessarily need a fully completed performance test suite to progress this, but do need at least the ability to benchmark search speeds.

Changed 2 years ago by olly

  • rep_platform changed from PC to All
  • component changed from Other to Library API
  • op_sys changed from Linux to All
  • severity changed from normal to enhancement

I just took a look at this patch, wondering what the performance worries were, and I'm a bit suprised!

I don't think we need to make a Xapian::Document available here. I worry this is adding generality that just isn't required. The main feature which users currently actually seem to miss is being able to sort on "atoi(value)" rather than being forced to left-pad numbers stored as strings with zeros or spaces. All the other real-world uses I can think of are similar - collating orders for particular languages, etc.

So I imagined we'd just provide a user-subclassable functor which takes two strings and returns true or false. Sorting by both values and document weight would be possible by using "sort_by_value_then_relevance" or "sort_by_relevance_then_value" as with non-custom sort orders.

Perhaps it could be useful to allow sorting on a secondary value if the first value is the same for two documents, but I've never heard that asked about.

Performance worries aside, I'm a bit concerned this adds code complexity disproportionate to the benefit to users. I'm also worried it could make it more difficult to take advantage of the planned change to how values are stored, which I think has the potential to greatly improve performance of sort by value.

It's good that currently we know the value number for sorting in advance, since

we will be able to open the appropriate "ValueList?". At least for MatchDecider? we can defer applying the MatchDecider?, but we can't do that for document ordering that I can see.

Changed 2 years ago by olly

  • cc olly@… added

Here's a somewhat lateral thought (inspired by the Schwartzian transform idiom in Perl: http://en.wikipedia.org/wiki/Schwartzian_transform )

The sort function gets called a lot, so it's a performance sensitive area, so we want to minimise the work it does.

If m documents get considered for the M-set, we must compare each at least once, so there will be >= (m-1) comparisons. In reality it'll be a lot more, probably O(m.log(m)).

So what if instead of allowing the sort function to be specified, we provided a functor which allows the user to construct the key use for sorting? This functor would be called exactly m times. Then the matcher sorting mechanism would use these pre-constructed keys instead of the raw values.

So for a numerical sort, you could construct the sort key by doing:

sort_key = string(char(value.size())) + value;

We could provide helper methods for this and other common encodings (like reversing the sort order).

This would fit nicely with the remote backend since we just need to pass these built sort keys across instead of the sort values.

If this seems a workable plan, it would be easy to prototype - we could just hack the current sorting code to use it and see how performance compares.

Changed 22 months ago by richard

Changed 22 months ago by richard

  • attachments.isobsolete changed from 0 to 1

Changed 22 months ago by richard

Initial patch to implement custom sort orders

Changed 22 months ago by mhammond

  • cc mhammond@… added

Changed 21 months ago by olly

FWIW, someone has now asked about sorting by more than one value (on xapian-discuss IIRC, but I don't have a reference to hand).

Changed 21 months ago by richard

  • blocking set to 120

Taking that user into account, I've now come across three instances of users needing to be able to sort by multiple keys: the others being the original users I implemented this for, and a customer I'm about to implement various things for.

I think, therefore, that a patch which allows multiple sort keys to work is justified - but it doesn't have to be implemented in the way this patch does. On the other hand, I believe that some parts of this patch simplify the code, rather than increasing its complexity.

Maybe a patch which implemented a fixed, but more flexible, sorting mechanism would be more useful: for example, something which specified a list of values to use for the sort, and the comparison method for each value (ie, numeric/date/string), and the comparision direction (ie, ascending, descending).

This wouldn't let us do really fun things like geolocation sorting, though (where the comparison function would calculate the distance (stored as latitude and longitude, in a value) from a point, and sort on that distance).

Regarding the Schwartzian transform idea - I don't think that the comparison function will usually be the expensive part - it'll be getting the data for the comparison function that costs. It could be a gain for the remote database, though, because the only the transformed value would need to be passed across the network connection, rather than all values in use. It might be interesting to investigate it. Again - coming back to the geolocation idea as one of the more complicated sort orders possible - a geolocation sort would involve a fairly complicated calculation (particularly if we want to cope with large distances, so have to take the curvature of the earth into account), and the Schwartzian transform method would allow this calculation to be performed only once.

Anyway, I'll be looking into this soon, so I'm marking it as for the 1.0 series so it won't get forgotten. (It may not be appropriate for 1.0 due to API breaking issues, but we can cross that bridge when we have an implementation agreed.)

Changed 21 months ago by richard

Updated patch to implement custom sort orders

Changed 21 months ago by richard

  • attachments.isobsolete changed from 0 to 1

Changed 15 months ago by olly

  • owner changed from richard to olly
  • status changed from assigned to new

I'm going to be looking at this, so assigning to myself.

Changed 14 months ago by olly

  • status changed from new to assigned

Changed 13 months ago by olly

  • blocking changed from 120 to 200
  • blockedby deleted

Apart from some missing feature tests, this is now in SVN HEAD.

Marking as blocking 1.0.5 rather than 1.0.x.

Removing block from the performance testing suite as the approach taken won't affect how sorting by a single value works.

Changed 13 months ago by olly

  • attachments.isobsolete changed from 0 to 1

Changed 13 months ago by olly

  • status changed from assigned to closed
  • resolution set to fixed

Now fixed in SVN HEAD.

Changed 13 months ago by olly

  • resolution changed from fixed to released

Changed 13 months ago by olly

  • blocking deleted

Changed 13 months ago by trac

  • platform set to All
Note: See TracTickets for help on using tickets.