Opened 9 years ago

Last modified 3 years ago

#508 assigned enhancement

OP_PHRASE should directly support non-leaf subqueries

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 1.4.x
Component: Matcher Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Currently Xapian does support OP_PHRASE with non-leaf subqueries, like this:

A OP_PHRASE (B OP_OR C)

But it handles with by converting it to:

(A OP_PHRASE B) OP_OR (A OP_PHRASE C)

This works, but it would be more efficient for OP_OR, etc to support merging of the positional information of their subqueries so that OP_PHRASE could work directly with non-leaf subqueries.

Marking as 1.2.x since this is essentially an internal implementation change so there aren't API or ABI compatibility concerns.

Attachments (1)

xapian-orpositionlist.patch (8.5 KB ) - added by Olly Betts 4 years ago.
Patch to add OrPositionList

Download all attachments as: .zip

Change History (18)

comment:1 by Olly Betts, 9 years ago

Description: modified (diff)

comment:2 by Olly Betts, 8 years ago

Milestone: 1.2.x1.3.1
Status: newassigned

comment:3 by Olly Betts, 7 years ago

Milestone: 1.3.11.3.2

It's time we got 1.3.1 out.

comment:4 by Olly Betts, 5 years ago

Milestone: 1.3.21.3.3

Also well past time we got 1.3.2 out.

comment:5 by Olly Betts, 5 years ago

Milestone: 1.3.31.3.4

comment:6 by Olly Betts, 4 years ago

As noted above, this doesn't involve an ABI change.

However, IIRC 1.3.x no longer supports even the distribution of the operator indicated in the description, so this is probably currently a regression compared to 1.2.x.

comment:7 by Olly Betts, 4 years ago

Milestone: 1.3.41.3.x

comment:8 by Olly Betts, 4 years ago

However, IIRC 1.3.x no longer supports even the distribution of the operator indicated in the description

Tested and confirmed.

For OP_OR, we can have an OrPositionList which merges the positions lists (much like OrPostList) , but there's not an obvious matching interpretation for all the operators - e.g. (a AND b) NEAR c can't really require a and b occur at the same position, as generally only one term occurs at each position. Expanding this to (a NEAR c) AND (b NEAR c) as 1.2.x does makes sense (though if we're clever we create the equivalent (a AND b AND c) plus two positional filters which can get hoisted).

Last edited 4 years ago by Olly Betts (previous) (diff)

comment:9 by Olly Betts, 4 years ago

Version: SVN trunkgit master

Looks like currently this case segfaults! Reproducible with:

q = xapian.Query(xapian.Query.OP_PHRASE, [xapian.Query(xapian.Query.OP_OR, "a", "b"), "c"])
db = xapian.Database("../../xapian-core/tests/.chert/db=etext")
enq = xapian.Enquire(db)
enq.set_query(q)
m = enq.get_mset(0, 10)

It looks like the issue here is that a OR b decays to a (since b isn't a term in that database), and the OrPostList gets deleted, but the hoisted phrase check still uses the OrPostList...

by Olly Betts, 4 years ago

Attachment: xapian-orpositionlist.patch added

Patch to add OrPositionList

comment:10 by Olly Betts, 4 years ago

The attached patch adds OrPositionList, which seems to work aside from the same issue if the OR decays.

comment:11 by Olly Betts, 4 years ago

I noticed that expanding (a AND_NOT b) NEAR c to (a NEAR c) AND_NOT b seems more natural than expanding it to (a NEAR c) AND_NOT (b NEAR c).

comment:12 by Olly Betts, 4 years ago

Milestone: 1.3.x1.4.x

[0a01a2f6f309bd1477233937f580639b144e167b/git] addresses the segfault by checking for and rejecting non-leaf subqueries of OP_PHRASE and OP_NEAR with UnimplementedError.

Such cases should be supported, but it's not worth holding up 1.4.0 for now the segfault is addressed.

comment:13 by Olly Betts, 3 years ago

This branch supports OP_OR with 2 subqueries under OP_NEAR and OP_PHRASE:

https://github.com/ojwb/xapian/tree/orpositionlist

The 2 subquery restriction needs a bit of work on the new OrPositionList class to fix - the patch attached here used a series of pairwise OrPositionList objects, but the new patch needs a single one instead - the class needs reworking to handle that.

comment:14 by Olly Betts, 3 years ago

Milestone: 1.4.x1.4.3

I've pushed a commit to the above branch which handles any sub-tree using only OP_OR as a subquery of OP_NEAR and OP_PHRASE.

Marking to consider for next 1.4.x release.

comment:15 by Olly Betts, 3 years ago

JFD mentioned the cases of OP_PHRASE and OP_NEAR used in subqueries, which have an obvious meaning (approximately they occur where they match at the location of each term for that match). Currently the positional check exits as soon as it finds a match in a given document - when used in a subquery it would need to (potentially) keep going and find all matches.

comment:17 by Olly Betts, 3 years ago

Milestone: 1.4.31.4.x

The rest of this ticket isn't for 1.4.3.

Note: See TracTickets for help on using tickets.