Opened 16 years ago

Closed 14 years ago

#245 closed defect (fixed)

All-stopword queries with two or more terms should ignore stopword list

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone: 1.2.3
Component: QueryParser Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Currently, if a single word query is parsed, and that word is a stopword, the stopwording is ignored. However, if a multiple word query is parsed, and all words are stopwords, the stopwording is applied (resulting in an empty query).

If all the words in the query are stopwords, I think it may make sense to ignore the stopwording. However, even if we decide to apply the stopwording in this case, we should be consistent in our behaviour.

Some examples, in python:

>>> import xapian
>>> s=xapian.SimpleStopper()
>>> s.add('foo')
>>> s.add('bar')
>>> qp=xapian.QueryParser()
>>> qp.set_stopper(s)
>>> str(qp.parse_query('foo'))
'Xapian::Query(foo:(pos=1))'
>>> str(qp.parse_query('foo foo'))
'Xapian::Query()'
>>> str(qp.parse_query('foo bar'))
'Xapian::Query()'

Either the first parse_query() call should return Xapian::Query(), or the later ones should return non-empty queries.

Attachments (2)

qp-allstop.patch (3.0 KB ) - added by Olly Betts 16 years ago.
Simple approach to fixing this
bug-245-fix.patch (10.8 KB ) - added by Olly Betts 14 years ago.
More sophisticated fix, but a bit ugly

Download all attachments as: .zip

Change History (11)

comment:1 by Olly Betts, 16 years ago

Status: newassigned
Summary: Inconsistent behaviour if all words in the query are stopwords.All-stopword queries with two or more terms should ignore stopword list

The bug is that all-stopword queries with two or more terms aren't handled specially - the single term case was easy to handle, while the multi-term case seemed trickier for some reason, and I only handled the easy one so far.

There's actually a FIXME comment about this in queryparser.lemony which probably shows where the fix should go:

FIXME what if E && E->empty() (all terms are stopwords)?

comment:2 by Olly Betts, 16 years ago

Owner: changed from New Bugs to Olly Betts
Status: assignednew

by Olly Betts, 16 years ago

Attachment: qp-allstop.patch added

Simple approach to fixing this

comment:3 by Olly Betts, 16 years ago

Operating System: All
Status: newassigned

comment:5 by Olly Betts, 14 years ago

Description: modified (diff)
Milestone: 1.2.x

Marking for 1.2.x.

comment:6 by Olly Betts, 14 years ago

Milestone: 1.2.x1.2.3

The approach in the patch seems reasonable to me, so I'm going to take another look at this and if I don't see a better fix, I'll update and apply it for 1.2.3.

comment:7 by Olly Betts, 14 years ago

There's a case the patch doesn't handle: the AND an

This doesn't work currently either, so the patch is an improvement, but it would be better to solve this whole issue in one go if we can.

comment:8 by Olly Betts, 14 years ago

Actually the AND an works - the failing testcase is: the AND an a

Which is pretty unlikely to be hit in practice unless the stopword list is rather long, but the error message is confusing.

by Olly Betts, 14 years ago

Attachment: bug-245-fix.patch added

More sophisticated fix, but a bit ugly

comment:9 by Olly Betts, 14 years ago

New patch seems to fix all the cases I've checked without breaking more, but feels rather clumsy in places. It's bedtime now, but I'll have a look when I'm more awake and see if it can be done more elegantly.

comment:10 by Olly Betts, 14 years ago

Description: modified (diff)
Resolution: fixed
Status: assignedclosed

I committed the patch (with comment improvements) in trunk r14845. Maybe there's a neater way, but it does at least work.

I think this probably isn't worth backporting to 1.0 at this point - I've not seen any feedback from end users on this (unless that was what triggered your report?)

Note: See TracTickets for help on using tickets.