Opened 14 years ago
Last modified 21 months ago
#508 assigned enhancement
OP_PHRASE should directly support non-leaf subqueries
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Matcher | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
Currently Xapian does support OP_PHRASE with non-leaf subqueries, like this:
A OP_PHRASE (B OP_OR C)
But it handles with by converting it to:
(A OP_PHRASE B) OP_OR (A OP_PHRASE C)
This works, but it would be more efficient for OP_OR, etc to support merging of the positional information of their subqueries so that OP_PHRASE could work directly with non-leaf subqueries.
Marking as 1.2.x since this is essentially an internal implementation change so there aren't API or ABI compatibility concerns.
Attachments (1)
Change History (19)
comment:1 by , 14 years ago
Description: | modified (diff) |
---|
comment:2 by , 13 years ago
Milestone: | 1.2.x → 1.3.1 |
---|---|
Status: | new → assigned |
comment:3 by , 12 years ago
Milestone: | 1.3.1 → 1.3.2 |
---|
comment:5 by , 10 years ago
Milestone: | 1.3.3 → 1.3.4 |
---|
comment:6 by , 9 years ago
As noted above, this doesn't involve an ABI change.
However, IIRC 1.3.x no longer supports even the distribution of the operator indicated in the description, so this is probably currently a regression compared to 1.2.x.
comment:7 by , 9 years ago
Milestone: | 1.3.4 → 1.3.x |
---|
comment:8 by , 9 years ago
However, IIRC 1.3.x no longer supports even the distribution of the operator indicated in the description
Tested and confirmed.
For OP_OR
, we can have an OrPositionList
which merges the positions lists (much like OrPostList
) , but there's not an obvious matching interpretation for all the operators - e.g. (a AND b) NEAR c
can't really require a
and b
occur at the same position, as generally only one term occurs at each position. Expanding this to (a NEAR c) AND (b NEAR c)
as 1.2.x does makes sense (though if we're clever we create the equivalent (a AND b AND c)
plus two positional filters which can get hoisted).
comment:9 by , 9 years ago
Version: | SVN trunk → git master |
---|
Looks like currently this case segfaults! Reproducible with:
q = xapian.Query(xapian.Query.OP_PHRASE, [xapian.Query(xapian.Query.OP_OR, "a", "b"), "c"]) db = xapian.Database("../../xapian-core/tests/.chert/db=etext") enq = xapian.Enquire(db) enq.set_query(q) m = enq.get_mset(0, 10)
It looks like the issue here is that a OR b
decays to a
(since b
isn't a term in that database), and the OrPostList
gets deleted, but the hoisted phrase check still uses the OrPostList
...
comment:10 by , 9 years ago
The attached patch adds OrPositionList
, which seems to work aside from the same issue if the OR
decays.
comment:11 by , 9 years ago
I noticed that expanding (a AND_NOT b) NEAR c
to (a NEAR c) AND_NOT b
seems more natural than expanding it to (a NEAR c) AND_NOT (b NEAR c)
.
comment:12 by , 9 years ago
Milestone: | 1.3.x → 1.4.x |
---|
[0a01a2f6f309bd1477233937f580639b144e167b/git] addresses the segfault by checking for and rejecting non-leaf subqueries of OP_PHRASE
and OP_NEAR
with UnimplementedError
.
Such cases should be supported, but it's not worth holding up 1.4.0 for now the segfault is addressed.
comment:13 by , 8 years ago
This branch supports OP_OR with 2 subqueries under OP_NEAR and OP_PHRASE:
https://github.com/ojwb/xapian/tree/orpositionlist
The 2 subquery restriction needs a bit of work on the new OrPositionList
class to fix - the patch attached here used a series of pairwise OrPositionList
objects, but the new patch needs a single one instead - the class needs reworking to handle that.
comment:14 by , 8 years ago
Milestone: | 1.4.x → 1.4.3 |
---|
I've pushed a commit to the above branch which handles any sub-tree using only OP_OR as a subquery of OP_NEAR and OP_PHRASE.
Marking to consider for next 1.4.x release.
comment:15 by , 8 years ago
JFD mentioned the cases of OP_PHRASE
and OP_NEAR
used in subqueries, which have an obvious meaning (approximately they occur where they match at the location of each term for that match). Currently the positional check exits as soon as it finds a match in a given document - when used in a subquery it would need to (potentially) keep going and find all matches.
comment:16 by , 8 years ago
OP_OR handling branch applied to master in dba350ba15b7eaf315d5270ac3eca4deb40f4b93 and 55e12a7a0b68b49b209945caa708eaab64dfb7fa, and those backported for 1.4.3 as 8ba3f148713fe0dbb693a0f1f57907c05cbde581 and 076d877224200ee70452fad104a61c0acc63df78.
comment:18 by , 21 months ago
Milestone: | 1.4.x → 2.0.0 |
---|
It's time we got 1.3.1 out.