Opened 20 years ago
Last modified 5 years ago
#48 assigned enhancement
Support term ranges
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | low | Milestone: | |
Component: | Library API | Version: | git master |
Severity: | minor | Keywords: | GoodFirstBug |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
Extend OP_WILDCARD
(or add a new operator) to support zip..zoo
style ranges of terms to wildcard over.
Possible API:
Query(op, const string&, const string&)
The implementation would be like OP_WILDCARD
but simpler because there's no need to check wildcard patterns, just create an allterms iterator, skip_to(range_start)
then iterate until we hit reach range_end
(or the end iterator.
Original description:
Provide explicit support for range searches, such as "RangePostList" - combine a sequence of adjacent terms...
Change History (18)
comment:1 by , 20 years ago
Severity: | blocker → normal |
---|---|
Status: | new → assigned |
comment:2 by , 18 years ago
Cc: | added |
---|---|
Component: | other → Backend-Flint |
comment:3 by , 18 years ago
Cc: | added; removed |
---|---|
Owner: | changed from | to
Status: | assigned → new |
I'd not really thought about what the API would look like, but your suggested API sounds good.
Perhaps there should be a variant where the end of the range is exclusive? Then right truncation could map to this new Query OP, which would potentially allow us to optimise it in the backend.
comment:4 by , 18 years ago
Status: | new → assigned |
---|
comment:5 by , 18 years ago
op_sys: | other → All |
---|---|
rep_platform: | Other → All |
Severity: | normal → enhancement |
Version: | other → SVN HEAD |
Did you come up with a patch?
comment:6 by , 18 years ago
No, I never got round to making a patch - I ended up finding another way to do what I needed. Feel free to reassign the bug if you like.
comment:7 by , 18 years ago
Component: | Backend-Flint → Library API |
---|
I've just committed a related (and perhaps more generally useful) new feature - OP_VALUE_RANGE - which allows restricting a query to documents which have a *document value* between two specified limits. Currently both limits are inclusive.
comment:8 by , 18 years ago
Operating System: | → All |
---|---|
Priority: | high → low |
comment:9 by , 17 years ago
After using OP_VALUE_RANGE for a while, I think it covers many of the cases that fixing this bug would help with (and also several others). It's also better to just access the value, rather than look through an arbitrarily large number of terms at search time.
The remaining use case I see for a RangePostList is for supporting efficient truncation wildcards (or the equivalent partial query term matches). An initial implementation could just build up a set of sub postlists for every term in the database covered by the range, but it might be possible to optimise this later (perhaps by storing special terms to cover a specific groups of sub-terms).
I've experimented manually with storing "acceleration" terms with an I prefix for the first character of every term: so the word "patch" would produce the terms "Ip" and "patch". Then, a RangePostList for the range "octopus" .. "queen" would search for the combination of <all terms starting with "o" sorting after "octopus"> "Ip" <all terms starting with "q" before "queen">. In other words, we'd remove the need to combine all "p*" terms at search time, by combining them at index time.
For this to work usefully, though, we'd need to work out which groups of terms to combine at index time, rather than just grouping by the first letter of a term. This could either be done in a way independent of term distribution (so that we know which terms to generate when parsing the first document), or we could do a pass after the main indexing step to add these "acceleration" terms to the database. The latter approach would allow an even distribution of "acceleration" terms; perhaps it would be appropriate to have an acceleration term for every prefix of a term which matches more than, say, 10 terms.
At search time, the RangePostList doesn't need to know how the acceleration terms are distributed - it just tests for the existence of an acceleration term for each prefix it is using; if it finds one, it assumes that that covers all terms with that prefix; if it doesn't, it iterates through the alltermlist to expand that prefix.
comment:10 by , 17 years ago
Some search systems actually allow search for zit..zoom which acts like z* but only for terms between the two limits. I'm unconvinced it's particularly useful though - wildcarding offers a "poor man's stemming", and also allows searching when you are unsure of the spelling (or fear the documents may not have the correct spelling), but I don't really see what practical use a more general term range really has. So it really only seems to have ticklist value as a user visible feature.
However, allowing a wildcard to be represented as a Query object might help us to optimise it better - it's worth looking at at some point. If we do that, then we can probably offer a general term range for little extra work.
Generating "I" terms based on query logs might be the best approach - there's certainly something to be said for doing that for phrases (e.g. e-mail would probably get its own term) and if we're going to offer query log analysis guided indexing at some point, the more we can get out of it the better.
comment:12 by , 15 years ago
Description: | modified (diff) |
---|---|
Milestone: | → 1.3.0 |
Marking for 1.3.0 for now.
comment:13 by , 15 years ago
Thinking about API briefly, I think Query(op, const string&, const string&)
is saner than taking two subqueries which are terms - we don't want two wqf values (and if we want one, having an optional 4th parameter seems sanest).
The end of the range might also want to be inclusive (zip..zoo
) or exclusive (zo*
).
comment:14 by , 13 years ago
Milestone: | 1.3.0 → 1.3.x |
---|
comment:15 by , 11 years ago
Cc: | removed |
---|---|
Milestone: | 1.3.x → 1.3.3 |
Owner: | changed from | to
Priority: | low → normal |
Status: | assigned → new |
I'm working on delaying expansion of wildcards to when get_mset() is called, by allowing them to be explicitly represented as a Query object, which essentially covers this ticket.
comment:16 by , 10 years ago
Status: | new → assigned |
---|
comment:17 by , 10 years ago
Description: | modified (diff) |
---|---|
Milestone: | 1.3.3 → 1.3.x |
Summary: | RangePostList → Support term ranges |
[9888fdbb8c785949d20e34fa013744c135c97084] implements OP_WILDCARD which lazily expands, so the tree of PostList objects is created directly rather than having an intermediate tree of Query objects.
I think that covers the zo*
case above - literally adding a RangePostList class would need backend support to be worthwhile I think, and at least with the current backends that's not going to be more efficient that the solution I've implemented, but it would need support adding to every backend.
Allowing a non-prefixed range of terms (e.g. zip..zoo
) isn't currently supported, but is probably worth doing at some point.
comment:19 by , 5 years ago
Description: | modified (diff) |
---|---|
Keywords: | GoodFirstBug added |
Milestone: | 1.4.x |
Priority: | normal → low |
Version: | SVN trunk → git master |
I don't recall any user requests for term ranges, so I don't think this is 1.4.x material now. Some other systems do support such things, so presumably they're useful to somebody, and this shouldn't be a lot of work to implement.
I could do with having this implemented soon, so will look at putting a patch for this together in the next two weeks. Added myself to the CC list; Olly should feel free to assign the bug to me if he likes.
I imagine the interface to this would be a new query operator, which takes two subqueries which must be terms, and returns all documents matched by any term in the range defined by (and including) those terms.