Opened 15 years ago

Closed 15 years ago

Last modified 15 years ago

#311 closed defect (fixed)

set_sort_by_value has incorrect sense for 'ascending'

Reported by: Tom Owned by: Olly Betts
Priority: normal Milestone: 1.0.11
Component: Library API Version: SVN trunk
Severity: minor Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Demo in python:

import xapian
db = xapian.WritableDatabase('test.db', xapian.DB_CREATE_OR_OVERWRITE)
db.begin_transaction()
doc = xapian.Document()
doc.add_term('foo')
doc.add_value(0, '101')
db.add_document(doc)
doc.add_value(0, '102')
db.add_document(doc)
doc.add_value(0, '103')
db.add_document(doc)
db.commit_transaction()

enq = xapian.Enquire(db)
enq.set_query(xapian.Query('foo'))

print 'asc=True:'
enq.set_sort_by_value(0, True)
for hit in enq.get_mset(0, 3):
    print hit.docid
    
print
print 'asc=False:'
enq.set_sort_by_value(0, False)
for hit in enq.get_mset(0, 3):
    print hit.docid

Change History (16)

comment:1 by Richard Boulton, 15 years ago

Component: MatcherDocumentation

For reference, that demo gives the following output for me:

asc=True:
3
2
1

asc=False:
1
2
3

I'm guessing you expected the order of the results in each case to be the opposite of that.

The problem is a question of what is meant by ascending. One interpretation could be that as you iterate through the result set of a search in ascending order, the values should ascend. Another interpretation could be that ascending order means that documents which have higher values are considered "better" (ie, get ranked higher). Xapian uses the latter interpretation.

Personally, I think this is a mistake, and it would be less confusing to users if iterating through the result set of an ascending search resulted in the values ascending (and that the default sort order was by descending relevance). However, changing this at this stage would break all existing applications. What we do need to do is improve the documentation to make it clear what Xapian means by "ascending" (so I've changed the component of this bug accordingly).

(For reference, in my xappy API, I define ascending in the opposite way to xapian, since I believe this results in less confusion for users.)

comment:2 by Richard Boulton, 15 years ago

Milestone: 1.1.0

Perhaps the easiest documentation fix would be to change the parameter name from "ascending" to "descending", and to change the parameter documentation from:

ascending 	If true, documents values which sort higher by string compare are better. If false, the sort order is reversed. (default true) 

to

descending 	If true, documents will be returned such that those with values which sort higher by string compare are returned with higher rank (so, as the result set is iterated in forward order, the value will descend).  If false, the sort order is reversed. (default true)

Note that currently, the meaning of "ascending" used by Xapian::Enquire::set_docid_order is that as the mset is iterated in forward order, the docid increases. This is the opposite meaning to that used by set_sort_by_value() (and related methods), so resolving this inconsistency would be good.

comment:3 by Olly Betts, 15 years ago

Hmm, this does seem backwards. Shame you didn't point it out rather than just making xappy different (an inconsistency which is itself potentially confusing...)

We obviously can't just change what the ascending parameter means here or we'll drive existing users mad.

Changing the parameter name isn't a bad option. It could be "reversed" rather than "descending" which might be less confusing (and deals better with any arguments about what "ascending" means here, though I think the current interpretation is odd at best).

Another (for source-level compatibility at least) would be to make this use the same enum as docid order, and provide an overloaded form taking bool for backward compatibility, and perhaps mark that as deprecated. The DONT_CARE option doesn't make sense here, and the type name is docid_order, but it would make code using the API more explicit than passing true or false does.

I'm not sure I have a favoured option currently. I think I need to mull it over.

comment:4 by Richard Boulton, 15 years ago

(I was surprised myself to find I hadn't made a bug report about this.)

Of the options discussed so far, I think just changing the parameter name is the least intrusive in the short term, and I'd favour it.

Longer term, providing an overloaded form using the enum might be a good plan, with the aim of deprecating the existing bool form, since it should lead to more readable code. However, I'd be slightly concerned about whether the overloading implementions for all the languages we provide bindings for can distinguish between bool and int parameters successfully.

comment:5 by Olly Betts, 15 years ago

Status: newassigned

Looking at this again, I think I favour just changing the parameter name to "reversed". It certainly would be a pain to create an API which was hard to wrap sanely, and having to do something if we're passed the DONT_CARE value is ugly.

Any further thoughts?

comment:6 by Richard Boulton, 15 years ago

Changing the parameter to reversed is my preferred solution, too.

comment:7 by Olly Betts, 15 years ago

Hmm, sadly "reversed" doesn't really work, as the default value is true. That seems like it would be very confusing - we're saying the search is reversed by default.

So I guess "descending" is the best option unless someone has a better idea...

comment:8 by Richard Boulton, 15 years ago

Ah. I change my vote to "descending", then. :)

comment:9 by Olly Betts, 15 years ago

We could go for "reversed", split the function into two overloaded forms rather than having a default parameter, and deprecate the single argument form. In the more distant future (2.0, say) we could consider adding a default parameter value of reversed=false. I think the end result is probably the best API here. The API deprecation is an annoyance for users so we shouldn't do this lightly, though sorting isn't used by everyone and the change required is at least simple and the updated code will work on existing releases.

The other sane alternative I can see is to go for "descending", and keep the default value of "true". This would make the default sort order (if you don't specify the direction):

  1. zebra
  2. yacht
  3. x-ray
  4. ...

or for numbers:

  1. 9
  2. 8
  3. 7
  4. ...

which I think isn't so natural.

Of these, I think the first is my current preference. Thoughts? Any other options anyone can see?

comment:10 by Richard Boulton, 15 years ago

I quite like the idea of deprecating the single parameter version - possibly even permanently. It seems reasonable to require the direction of a sort to be specified, and I'm not sure it's ever going to be completely obvious which direction the default direction should be.

However, for the same reason, I'm not very keen on the name "reversed"; if it's not clear what the default direction should be, it's not clear which way it is when reversed, either.

I'd like to go for changing the name to "descending" immediately (since it's really a bug that it's named ascending), and deprecating the single-argument form. In 2.0, if we add a default argument back, we can add it back with a value of false (this only works if there are several release series between 1.1 and 2.0, so that it is reasonable to expect all applications to have changed). I don't think adding the default parameter back is not a terribly important step, though.

Alternatively, we could deprecate all the set_sort_by_* functions which currently take an "ascending" parameter, and make a parallel set of functions with the signature which we actually want. "set_sorting_by_value_then_relevance(sort_key, descending=false)" for example. I'm not particularly keen on this, but mention it as another idea for consideration...

comment:11 by Olly Betts, 15 years ago

Component: DocumentationLibrary API

In comment:6 you preferred reversed though!

I think there's definite "prior art" as to a standard default sort direction (ascending) and for describing the opposite as "reverse". For example, the UNIX sort program defaults to ascending with "-r" for reverse (the GNU version even has "--reverse" as the long form of "-r"); C++'s std::sort defaults to ascending; Perl's sort() function does too; Python's sorted function and sort method do (and the sort method has an optional "reverse" parameter); PHP's sort() is ascending by default (it offers an "rsort" function to sort in reverse order). Ascending is also the order used in dictionaries and book indices.

In fact, I think the only source of confusion here really is the current parameter name and its default.

In 2.0, if we add a default argument back, we can add it back with a value of false (this only works if there are several release series between 1.1 and 2.0, so that it is reasonable to expect all applications to have changed).

Yes, that's why I said "more distant future". If 2.0 ends up immediately following 1.2, that's probably too soon, though at least the major version bump suggests incompatible changes.

I don't think adding the default parameter back is not a terribly important step, though.

I'm guessing the "not" isn't intended there.

But I believe there is natural default here, and I've not been able to come up with any existing sorting API which doesn't have a default order, other than C's qsort() which is probably just because C doesn't have default arguments. Even there, if you make the comparison function just call strcmp(), you get ascending order.

"set_sorting_by_value_then_relevance(sort_key, descending=false)" for example. I'm not particularly keen on this, but mention it as another idea for consideration...

But then you can't write code which work on both 1.0.x and 1.2.x without either using the deprecated form (and getting a warning) or having both versions and switching with the preprocessor, which is clumsy and doesn't work for the bindings.

Marking as "API" again, since we're intending to change code rather than add warnings to the documentation about the current odd API.

comment:12 by Richard Boulton, 15 years ago

I've managed to get myself thoroughly confused here.

I think the first option in comment:9 is quite acceptable.

comment:13 by Olly Betts, 15 years ago

Resolution: fixed
Status: assignedclosed

OK, fixed in trunk r11913.

comment:14 by Olly Betts, 15 years ago

Milestone: 1.1.01.0.11

Backported the parameter rename for 1.0.11 in r12091, and added a note recommending not to use the default value.

comment:15 by Daniel Ménard, 15 years ago

Just a note to point out that the logic seems to be the opposite for MultiValueSorter:

MultiValueSorter::add(Xapian::valueno valno, bool forward=true); Enquire::set_sort_by_key (Xapian::Sorter *sorter, bool reverse=true);

(Sorry to comment on a closed ticket, it was not worth a new issue)

comment:16 by Olly Betts, 15 years ago

Hmm, well spotted.

The default direction is good for !MultiValueSorter::add(), but the parameter name is annoyingly inconsistent.

Not sure I see a good fix other than renaming the class and/or method.

I think we probably do want a new ticket for this - I'll open one.

Note: See TracTickets for help on using tickets.