Opened 17 years ago

Last modified 22 months ago

#224 new enhancement

Supply and optimise more OP_VALUE_ comparison operators

Reported by: Richard Boulton Owned by: visvesh
Priority: low Milestone: 2.0.0
Component: Matcher Version: git master
Severity: minor Keywords: GoodFirstBug
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

We currently supply OP_VALUE_GE and OP_VALUE_LE. It might be useful to also supply OP_VALUE_GT and OP_VALUE_LT, and possibly OP_VALUE_EQ.

Also, we currently supply a range operator which can be represented in terms of two comparison operators: ie, OP_VALUE_RANGE(1, a, b) is equivalent to: (OP_VALUE_GE(1, a) AND OP_VALUE_LE(1, b)). It would be nice to teach the query optimiser about this equivalence, so that a range specified as two comparison operators would be transformed into an efficient single ValueRangePostList internally, instead of the combination of two Value*Cmp*PostLists. Also, the ValueRangePostList could be made more flexible by adding flags to tell it whether each end of the range is inclusive or not, to allow other combinations of comparison operator to be represented as ranges.

Change History (15)

comment:1 by Richard Boulton, 17 years ago

Operating System: All
Status: newassigned

comment:2 by Olly Betts, 17 years ago

I think OP_VALUE_EQ would be ill-advised. If you want to find things matching an exact category, boolean terms are a lot more efficient, so OP_VALUE_EQ is just a tempting trap to disappoint users - for example:

http://article.gmane.org/gmane.comp.search.xapian.general/5481

If you really really know you want to do this (e.g. small collection, storing values for other reasons, smallest database size really matters) then you can just use OP_VALUE_RANGE with the same start and end.

comment:4 by Olly Betts, 17 years ago

Description: modified (diff)
Owner: changed from New Bugs to Olly Betts
Status: assignednew

comment:5 by Olly Betts, 14 years ago

Description: modified (diff)
Milestone: 1.2.x

Implementation can easily be upwardly API and ABI compatible.

comment:6 by Olly Betts, 12 years ago

Milestone: 1.2.x1.3.x

1.3.x material now, especially as the query internals have been rewritten in trunk.

comment:7 by Olly Betts, 10 years ago

Milestone: 1.3.x1.4.x

This isn't worth holding up 1.4.0 for - we haven't had user requests for such operators, and they could be added cleanly in a stable point release.

comment:8 by Olly Betts, 5 years ago

Keywords: GoodFirstBug added
Version: SVN trunkgit master

comment:9 by Olly Betts, 5 years ago

Type: defectenhancement

comment:10 by visvesh, 5 years ago

I am trying to do the first subpart, i.e. implement OP_VALUE_GT and OP_VALUE_LT. The component is labeled as "Matcher", but according to the API, I am supposed to make modifications to query.h in include directory and few more files in the api directory viz. query.cc, queryinternal.h, queryinternal.cc, and possibly some other files also. Why is the module labeled as "Matcher" instead of "Query"? Also I am unable to figure out the data type of data member internal in query object. Does its data type depend on the constructor called?

Last edited 5 years ago by visvesh (previous) (diff)

comment:11 by Olly Betts, 5 years ago

Owner: changed from Olly Betts to visvesh

Why is the module labeled as "Matcher" instead of "Query"?

The "Component" field in trac is really just one tool trac provides to try to impose a bit of order on the ticket pile, so I wouldn't read too much into what it's set to. It's often the case that a ticket doesn't actually fit completely into a single component. Free-form tagging via the "Keywords" field would allow multiple areas of the code to be specified, but you then get issues like someone using "matching" instead of "matcher" - with "Component" there's a pre-defined list of values to choose from. (One can also set a different default owner for a new ticket based on the component the submitter specifies, but currently we don't make use of that.)

The part you've chosen to look at does involve adding new operators to Query objects, but you'll also need to modify files under matcher/ to actually implement it.

We don't actually have a "Query" component in trac. "Library API" would cover that part, but I think overall the things discussed in this ticket are more about the matcher part than the API part.

Also I am unable to figure out the data type of data member internal in query object. Does its data type depend on the constructor called?

The internal member is a (reference-counted) pointer to a Xapian::Query::Internal object, but that's an abstract base class and it's concrete type will be a subclass of that, depending on the query operator/leaf subquery type (e.g. QueryTerm for a term, QueryAnd for OP_AND).

comment:12 by Boda, 5 years ago

Hey, I would like to work on this

comment:13 by visvesh, 4 years ago

Since the first subpart has been done, I am trying to do the second subpart. My understanding of the problem is as follows: we are optimising the case: Xapian::Query(Xapian::Query::OP_AND, q1, q2) where q1 and q2 are of type OP_VALUE_*CMP* Since Query.internal in this case is of type QueryAnd, when we call query.internal->postlist (line 206 in localSubMatch.cc) we create an AndContext Object (line 1542 in queryinternal.cc). This recursively adds all the postlists that need to be “anded” together and then merges them and returns a single postlist. If the task is to replace the compound query (mentioned above) with a single range query, then I don’t see how query optimiser can achieve because it does not prevent postlist_sub_and_like(the recursive function) from adding the two OP_VALUE_*CMP* queries to the AndContext Object.

comment:14 by Olly Betts, 4 years ago

The query optimiser would need to detect when there are two OP_VALUE_* subqueries on the same value slot in the same AndContext and create a single PostList object for the pair (instead of one for each as currently happens.)

It probably ought to gather up all the subqueries on the same slot even if there are more than two, and eliminate any that are redundant - e.g. in x >= 3 AND x > 2 AND x <= 4 we can ignore x > 2 as it's ensured by x >= 3.

(That's "query optimiser" as in the parts of the code that optimises queries - it wouldn't necessarily be done in the QueryOptimiser class itself, though it might well be.)

comment:15 by visvesh, 4 years ago

In order to get the Postlist's type, slot, begin, and end we would have to parse the string returned by get_description(), so I'm thinking of implementing a parsing function in QueryOptimiser, this would require the sstream standard library. Once that's done I can write another function in QueryOptimiser that does the actual optimisation, and it will use parser function as a helper. Does this approach seem fine?

comment:16 by Olly Betts, 22 months ago

Milestone: 1.4.x2.0.0

In order to get the Postlist's type, slot, begin, and end we would have to parse the string returned by get_description()

No, we shouldn't be parsing descriptions for such things.

Also, this optimisation really should be done before we've created the PostList objects for the OP_VALUE_... operators - it should look at the query internal objects (which will be of classes such as QueryValueLE).

This can be added without an incompatible change to the public ABI so not a blocker for the next release series.

Note: See TracTickets for help on using tickets.