Opened 17 years ago
Last modified 20 months ago
#224 new enhancement
Supply and optimise more OP_VALUE_ comparison operators
Reported by: | Richard Boulton | Owned by: | visvesh |
---|---|---|---|
Priority: | low | Milestone: | 2.0.0 |
Component: | Matcher | Version: | git master |
Severity: | minor | Keywords: | GoodFirstBug |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
We currently supply OP_VALUE_GE and OP_VALUE_LE. It might be useful to also supply OP_VALUE_GT and OP_VALUE_LT, and possibly OP_VALUE_EQ.
Also, we currently supply a range operator which can be represented in terms of two comparison operators: ie, OP_VALUE_RANGE(1, a, b) is equivalent to: (OP_VALUE_GE(1, a) AND OP_VALUE_LE(1, b)). It would be nice to teach the query optimiser about this equivalence, so that a range specified as two comparison operators would be transformed into an efficient single ValueRangePostList internally, instead of the combination of two Value*Cmp*PostLists. Also, the ValueRangePostList could be made more flexible by adding flags to tell it whether each end of the range is inclusive or not, to allow other combinations of comparison operator to be represented as ranges.
Change History (15)
comment:1 by , 17 years ago
Operating System: | → All |
---|---|
Status: | new → assigned |
comment:2 by , 17 years ago
comment:4 by , 17 years ago
Description: | modified (diff) |
---|---|
Owner: | changed from | to
Status: | assigned → new |
comment:5 by , 14 years ago
Description: | modified (diff) |
---|---|
Milestone: | → 1.2.x |
Implementation can easily be upwardly API and ABI compatible.
comment:6 by , 12 years ago
Milestone: | 1.2.x → 1.3.x |
---|
1.3.x material now, especially as the query internals have been rewritten in trunk.
comment:7 by , 10 years ago
Milestone: | 1.3.x → 1.4.x |
---|
This isn't worth holding up 1.4.0 for - we haven't had user requests for such operators, and they could be added cleanly in a stable point release.
comment:8 by , 5 years ago
Keywords: | GoodFirstBug added |
---|---|
Version: | SVN trunk → git master |
comment:9 by , 5 years ago
Type: | defect → enhancement |
---|
comment:10 by , 5 years ago
I am trying to do the first subpart, i.e. implement OP_VALUE_GT and OP_VALUE_LT. The component is labeled as "Matcher", but according to the API, I am supposed to make modifications to query.h in include
directory and few more files in the api
directory viz. query.cc
, queryinternal.h
, queryinternal.cc
, and possibly some other files also. Why is the module labeled as "Matcher" instead of "Query"? Also I am unable to figure out the data type of data member internal in query
object. Does its data type depend on the constructor called?
comment:11 by , 5 years ago
Owner: | changed from | to
---|
Why is the module labeled as "Matcher" instead of "Query"?
The "Component" field in trac is really just one tool trac provides to try to impose a bit of order on the ticket pile, so I wouldn't read too much into what it's set to. It's often the case that a ticket doesn't actually fit completely into a single component. Free-form tagging via the "Keywords" field would allow multiple areas of the code to be specified, but you then get issues like someone using "matching" instead of "matcher" - with "Component" there's a pre-defined list of values to choose from. (One can also set a different default owner for a new ticket based on the component the submitter specifies, but currently we don't make use of that.)
The part you've chosen to look at does involve adding new operators to Query objects, but you'll also need to modify files under matcher/
to actually implement it.
We don't actually have a "Query" component in trac. "Library API" would cover that part, but I think overall the things discussed in this ticket are more about the matcher part than the API part.
Also I am unable to figure out the data type of data member internal in query object. Does its data type depend on the constructor called?
The internal member is a (reference-counted) pointer to a Xapian::Query::Internal
object, but that's an abstract base class and it's concrete type will be a subclass of that, depending on the query operator/leaf subquery type (e.g. QueryTerm
for a term, QueryAnd
for OP_AND
).
comment:13 by , 4 years ago
Since the first subpart has been done, I am trying to do the second subpart. My understanding of the problem is as follows: we are optimising the case: Xapian::Query(Xapian::Query::OP_AND, q1, q2) where q1 and q2 are of type OP_VALUE_*CMP* Since Query.internal in this case is of type QueryAnd, when we call query.internal->postlist (line 206 in localSubMatch.cc) we create an AndContext Object (line 1542 in queryinternal.cc). This recursively adds all the postlists that need to be “anded” together and then merges them and returns a single postlist. If the task is to replace the compound query (mentioned above) with a single range query, then I don’t see how query optimiser can achieve because it does not prevent postlist_sub_and_like(the recursive function) from adding the two OP_VALUE_*CMP* queries to the AndContext Object.
comment:14 by , 4 years ago
The query optimiser would need to detect when there are two OP_VALUE_*
subqueries on the same value slot in the same AndContext
and create a single PostList
object for the pair (instead of one for each as currently happens.)
It probably ought to gather up all the subqueries on the same slot even if there are more than two, and eliminate any that are redundant - e.g. in x >= 3 AND x > 2 AND x <= 4
we can ignore x > 2
as it's ensured by x >= 3
.
(That's "query optimiser" as in the parts of the code that optimises queries - it wouldn't necessarily be done in the QueryOptimiser
class itself, though it might well be.)
comment:15 by , 4 years ago
In order to get the Postlist's type, slot, begin, and end we would have to parse the string returned by get_description(), so I'm thinking of implementing a parsing function in QueryOptimiser, this would require the sstream standard library. Once that's done I can write another function in QueryOptimiser that does the actual optimisation, and it will use parser function as a helper. Does this approach seem fine?
comment:16 by , 20 months ago
Milestone: | 1.4.x → 2.0.0 |
---|
In order to get the Postlist's type, slot, begin, and end we would have to parse the string returned by get_description()
No, we shouldn't be parsing descriptions for such things.
Also, this optimisation really should be done before we've created the PostList
objects for the OP_VALUE_...
operators - it should look at the query internal objects (which will be of classes such as QueryValueLE
).
This can be added without an incompatible change to the public ABI so not a blocker for the next release series.
I think OP_VALUE_EQ would be ill-advised. If you want to find things matching an exact category, boolean terms are a lot more efficient, so OP_VALUE_EQ is just a tempting trap to disappoint users - for example:
http://article.gmane.org/gmane.comp.search.xapian.general/5481
If you really really know you want to do this (e.g. small collection, storing values for other reasons, smallest database size really matters) then you can just use OP_VALUE_RANGE with the same start and end.