#100 closed enhancement (released)
Allow custom sort orders
Reported by: | Richard Boulton | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Library API | Version: | SVN trunk |
Severity: | minor | Keywords: | |
Cc: | Olly Betts, Mark Hammond | Blocked By: | |
Blocking: | Operating System: | All |
Description
Xapian should allow sort orders to be implemented by subclassing a base "Match Comparison" class, which compares two MatchItems.
The attached patch implements this feature. It's a fairly invasive patch, because it is necessary to sort the Document object in the MatchItems used by the matcher, rather than a single sort key.
The patch works correctly in all cases, as far as I can tell, but still needs some work to improve performance in the case of remote databases. Previously, the mset items passed from the remote database contained the value of the sort-key in a member. However, for the new version, this member is replaced by a document object. At present, this document object is not passed when the mset items are passed, and instead will be requested when the document is accessed (which will probably happen when the items are being merged into the local mset). Instead, the document objects should be passed across with the mset item; or better, the partial document objects should be passed, to avoid needing to access the document data and termlist tables more than necessary.
This issue should probably be fixed before the patch is merged, to avoid a performance-regression. (It would be good to do some general benchmarking before merging this, anyway, for just that reason.)
Attachments (3)
Change History (18)
comment:1 by , 18 years ago
Status: | new → assigned |
---|
comment:2 by , 18 years ago
Blocked By: | 107 added |
---|
comment:3 by , 18 years ago
Component: | Other → Library API |
---|---|
op_sys: | Linux → All |
rep_platform: | PC → All |
Severity: | normal → enhancement |
I just took a look at this patch, wondering what the performance worries were, and I'm a bit suprised!
I don't think we need to make a Xapian::Document available here. I worry this is adding generality that just isn't required. The main feature which users currently actually seem to miss is being able to sort on "atoi(value)" rather than being forced to left-pad numbers stored as strings with zeros or spaces. All the other real-world uses I can think of are similar - collating orders for particular languages, etc.
So I imagined we'd just provide a user-subclassable functor which takes two strings and returns true or false. Sorting by both values and document weight would be possible by using "sort_by_value_then_relevance" or "sort_by_relevance_then_value" as with non-custom sort orders.
Perhaps it could be useful to allow sorting on a secondary value if the first value is the same for two documents, but I've never heard that asked about.
Performance worries aside, I'm a bit concerned this adds code complexity disproportionate to the benefit to users. I'm also worried it could make it more difficult to take advantage of the planned change to how values are stored, which I think has the potential to greatly improve performance of sort by value.
It's good that currently we know the value number for sorting in advance, since
we will be able to open the appropriate "ValueList". At least for MatchDecider we can defer applying the MatchDecider, but we can't do that for document ordering that I can see.
comment:4 by , 18 years ago
Cc: | added |
---|
Here's a somewhat lateral thought (inspired by the Schwartzian transform idiom in Perl: http://en.wikipedia.org/wiki/Schwartzian_transform )
The sort function gets called a lot, so it's a performance sensitive area, so we want to minimise the work it does.
If m documents get considered for the M-set, we must compare each at least once, so there will be >= (m-1) comparisons. In reality it'll be a lot more, probably O(m.log(m)).
So what if instead of allowing the sort function to be specified, we provided a functor which allows the user to construct the key use for sorting? This functor would be called exactly m times. Then the matcher sorting mechanism would use these pre-constructed keys instead of the raw values.
So for a numerical sort, you could construct the sort key by doing:
sort_key = string(char(value.size())) + value;
We could provide helper methods for this and other common encodings (like reversing the sort order).
This would fit nicely with the remote backend since we just need to pass these built sort keys across instead of the sort values.
If this seems a workable plan, it would be easy to prototype - we could just hack the current sorting code to use it and see how performance compares.
comment:5 by , 18 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:6 by , 18 years ago
Cc: | added |
---|
comment:7 by , 18 years ago
FWIW, someone has now asked about sorting by more than one value (on xapian-discuss IIRC, but I don't have a reference to hand).
comment:8 by , 18 years ago
Blocking: | 120 added |
---|
Taking that user into account, I've now come across three instances of users needing to be able to sort by multiple keys: the others being the original users I implemented this for, and a customer I'm about to implement various things for.
I think, therefore, that a patch which allows multiple sort keys to work is justified - but it doesn't have to be implemented in the way this patch does. On the other hand, I believe that some parts of this patch simplify the code, rather than increasing its complexity.
Maybe a patch which implemented a fixed, but more flexible, sorting mechanism would be more useful: for example, something which specified a list of values to use for the sort, and the comparison method for each value (ie, numeric/date/string), and the comparision direction (ie, ascending, descending).
This wouldn't let us do really fun things like geolocation sorting, though (where the comparison function would calculate the distance (stored as latitude and longitude, in a value) from a point, and sort on that distance).
Regarding the Schwartzian transform idea - I don't think that the comparison function will usually be the expensive part - it'll be getting the data for the comparison function that costs. It could be a gain for the remote database, though, because the only the transformed value would need to be passed across the network connection, rather than all values in use. It might be interesting to investigate it. Again - coming back to the geolocation idea as one of the more complicated sort orders possible - a geolocation sort would involve a fairly complicated calculation (particularly if we want to cope with large distances, so have to take the curvature of the earth into account), and the Schwartzian transform method would allow this calculation to be performed only once.
Anyway, I'll be looking into this soon, so I'm marking it as for the 1.0 series so it won't get forgotten. (It may not be appropriate for 1.0 due to API breaking issues, but we can cross that bridge when we have an implementation agreed.)
comment:9 by , 18 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:10 by , 17 years ago
Owner: | changed from | to
---|---|
Status: | assigned → new |
I'm going to be looking at this, so assigning to myself.
comment:11 by , 17 years ago
Status: | new → assigned |
---|
comment:12 by , 17 years ago
Blocked By: | 107 removed |
---|---|
Blocking: | 200 added; 120 removed |
Apart from some missing feature tests, this is now in SVN HEAD.
Marking as blocking 1.0.5 rather than 1.0.x.
Removing block from the performance testing suite as the approach taken won't affect how sorting by a single value works.
comment:13 by , 17 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:15 by , 17 years ago
Resolution: | fixed → released |
---|
comment:16 by , 17 years ago
Blocking: | 200 removed |
---|---|
Operating System: | → All |
Merging this patch involves modifying very performance sensitive code, so I'm marking it as depending on bug #107. We don't necessarily need a fully completed performance test suite to progress this, but do need at least the ability to benchmark search speeds.