Opened 20 years ago

Last modified 4 years ago

#48 assigned enhancement

Support term ranges

Reported by: Olly Betts Owned by: Olly Betts
Priority: low Milestone:
Component: Library API Version: git master
Severity: minor Keywords: GoodFirstBug
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Extend OP_WILDCARD (or add a new operator) to support zip..zoo style ranges of terms to wildcard over.

Possible API:

Query(op, const string&, const string&)

The implementation would be like OP_WILDCARD but simpler because there's no need to check wildcard patterns, just create an allterms iterator, skip_to(range_start) then iterate until we hit reach range_end (or the end iterator.


Original description:

Provide explicit support for range searches, such as "RangePostList" - combine a sequence of adjacent terms...

Change History (18)

comment:1 by Olly Betts, 20 years ago

Severity: blockernormal
Status: newassigned

comment:2 by Richard Boulton, 18 years ago

Cc: richard@… added
Component: otherBackend-Flint

I could do with having this implemented soon, so will look at putting a patch for this together in the next two weeks. Added myself to the CC list; Olly should feel free to assign the bug to me if he likes.

I imagine the interface to this would be a new query operator, which takes two subqueries which must be terms, and returns all documents matched by any term in the range defined by (and including) those terms.

comment:3 by Olly Betts, 18 years ago

Cc: olly@… added; richard@… removed
Owner: changed from Olly Betts to Richard Boulton
Status: assignednew

I'd not really thought about what the API would look like, but your suggested API sounds good.

Perhaps there should be a variant where the end of the range is exclusive? Then right truncation could map to this new Query OP, which would potentially allow us to optimise it in the backend.

comment:4 by Richard Boulton, 18 years ago

Status: newassigned

comment:5 by Olly Betts, 17 years ago

op_sys: otherAll
rep_platform: OtherAll
Severity: normalenhancement
Version: otherSVN HEAD

Did you come up with a patch?

comment:6 by Richard Boulton, 17 years ago

No, I never got round to making a patch - I ended up finding another way to do what I needed. Feel free to reassign the bug if you like.

comment:7 by Olly Betts, 17 years ago

Component: Backend-FlintLibrary API

I've just committed a related (and perhaps more generally useful) new feature - OP_VALUE_RANGE - which allows restricting a query to documents which have a *document value* between two specified limits. Currently both limits are inclusive.

comment:8 by Olly Betts, 17 years ago

Operating System: All
Priority: highlow

comment:9 by Richard Boulton, 17 years ago

After using OP_VALUE_RANGE for a while, I think it covers many of the cases that fixing this bug would help with (and also several others). It's also better to just access the value, rather than look through an arbitrarily large number of terms at search time.

The remaining use case I see for a RangePostList is for supporting efficient truncation wildcards (or the equivalent partial query term matches). An initial implementation could just build up a set of sub postlists for every term in the database covered by the range, but it might be possible to optimise this later (perhaps by storing special terms to cover a specific groups of sub-terms).

I've experimented manually with storing "acceleration" terms with an I prefix for the first character of every term: so the word "patch" would produce the terms "Ip" and "patch". Then, a RangePostList for the range "octopus" .. "queen" would search for the combination of <all terms starting with "o" sorting after "octopus"> "Ip" <all terms starting with "q" before "queen">. In other words, we'd remove the need to combine all "p*" terms at search time, by combining them at index time.

For this to work usefully, though, we'd need to work out which groups of terms to combine at index time, rather than just grouping by the first letter of a term. This could either be done in a way independent of term distribution (so that we know which terms to generate when parsing the first document), or we could do a pass after the main indexing step to add these "acceleration" terms to the database. The latter approach would allow an even distribution of "acceleration" terms; perhaps it would be appropriate to have an acceleration term for every prefix of a term which matches more than, say, 10 terms.

At search time, the RangePostList doesn't need to know how the acceleration terms are distributed - it just tests for the existence of an acceleration term for each prefix it is using; if it finds one, it assumes that that covers all terms with that prefix; if it doesn't, it iterates through the alltermlist to expand that prefix.

comment:10 by Olly Betts, 17 years ago

Some search systems actually allow search for zit..zoom which acts like z* but only for terms between the two limits. I'm unconvinced it's particularly useful though - wildcarding offers a "poor man's stemming", and also allows searching when you are unsure of the spelling (or fear the documents may not have the correct spelling), but I don't really see what practical use a more general term range really has. So it really only seems to have ticklist value as a user visible feature.

However, allowing a wildcard to be represented as a Query object might help us to optimise it better - it's worth looking at at some point. If we do that, then we can probably offer a general term range for little extra work.

Generating "I" terms based on query logs might be the best approach - there's certainly something to be said for doing that for phrases (e.g. e-mail would probably get its own term) and if we're going to offer query log analysis guided indexing at some point, the more we can get out of it the better.

comment:12 by Olly Betts, 14 years ago

Description: modified (diff)
Milestone: 1.3.0

Marking for 1.3.0 for now.

comment:13 by Olly Betts, 14 years ago

Thinking about API briefly, I think Query(op, const string&, const string&) is saner than taking two subqueries which are terms - we don't want two wqf values (and if we want one, having an optional 4th parameter seems sanest).

The end of the range might also want to be inclusive (zip..zoo) or exclusive (zo*).

comment:14 by Olly Betts, 12 years ago

Milestone: 1.3.01.3.x

comment:15 by Olly Betts, 10 years ago

Cc: Olly Betts removed
Milestone: 1.3.x1.3.3
Owner: changed from Richard Boulton to Olly Betts
Priority: lownormal
Status: assignednew

I'm working on delaying expansion of wildcards to when get_mset() is called, by allowing them to be explicitly represented as a Query object, which essentially covers this ticket.

comment:16 by Olly Betts, 9 years ago

Status: newassigned

comment:17 by Olly Betts, 9 years ago

Description: modified (diff)
Milestone: 1.3.31.3.x
Summary: RangePostListSupport term ranges

[9888fdbb8c785949d20e34fa013744c135c97084] implements OP_WILDCARD which lazily expands, so the tree of PostList objects is created directly rather than having an intermediate tree of Query objects.

I think that covers the zo* case above - literally adding a RangePostList class would need backend support to be worthwhile I think, and at least with the current backends that's not going to be more efficient that the solution I've implemented, but it would need support adding to every backend.

Allowing a non-prefixed range of terms (e.g. zip..zoo) isn't currently supported, but is probably worth doing at some point.

Last edited 8 years ago by Olly Betts (previous) (diff)

comment:18 by Olly Betts, 9 years ago

Milestone: 1.3.x1.4.x

This isn't worth holding up 1.4.0 for.

comment:19 by Olly Betts, 4 years ago

Description: modified (diff)
Keywords: GoodFirstBug added
Milestone: 1.4.x
Priority: normallow
Version: SVN trunkgit master

I don't recall any user requests for term ranges, so I don't think this is 1.4.x material now. Some other systems do support such things, so presumably they're useful to somebody, and this shouldn't be a lot of work to implement.

Note: See TracTickets for help on using tickets.