Ticket #48 (assigned enhancement)

Opened 4 years ago

Last modified 15 months ago

RangePostList

Reported by: olly Owned by: richard
Priority: low Milestone:
Component: Library API Version: SVN trunk
Severity: minor Keywords:
Cc: olly Blocked By:
Operating System: All Blocking:

Description

Provide explicit support for range searches, such as "RangePostList?" - combine a sequence of adjacent terms...

Change History

Changed 4 years ago by olly

  • status changed from new to assigned
  • severity changed from blocker to normal

Changed 2 years ago by richard

  • cc richard@… added
  • component changed from other to Backend-Flint

I could do with having this implemented soon, so will look at putting a patch for this together in the next two weeks. Added myself to the CC list; Olly should feel free to assign the bug to me if he likes.

I imagine the interface to this would be a new query operator, which takes two subqueries which must be terms, and returns all documents matched by any term in the range defined by (and including) those terms.

Changed 2 years ago by olly

  • cc olly@… added; richard@… removed
  • owner changed from olly to richard
  • status changed from assigned to new

I'd not really thought about what the API would look like, but your suggested API sounds good.

Perhaps there should be a variant where the end of the range is exclusive? Then right truncation could map to this new Query OP, which would potentially allow us to optimise it in the backend.

Changed 2 years ago by richard

  • status changed from new to assigned

Changed 22 months ago by olly

  • rep_platform changed from Other to All
  • version changed from other to SVN HEAD
  • op_sys changed from other to All
  • severity changed from normal to enhancement

Did you come up with a patch?

Changed 22 months ago by richard

No, I never got round to making a patch - I ended up finding another way to do what I needed. Feel free to reassign the bug if you like.

Changed 21 months ago by olly

  • component changed from Backend-Flint to Library API

I've just committed a related (and perhaps more generally useful) new feature - OP_VALUE_RANGE - which allows restricting a query to documents which have a *document value* between two specified limits. Currently both limits are inclusive.

Changed 20 months ago by olly

  • priority changed from high to low

Changed 20 months ago by trac

  • platform set to All

Changed 15 months ago by richard

After using OP_VALUE_RANGE for a while, I think it covers many of the cases that fixing this bug would help with (and also several others). It's also better to just access the value, rather than look through an arbitrarily large number of terms at search time.

The remaining use case I see for a RangePostList? is for supporting efficient truncation wildcards (or the equivalent partial query term matches). An initial implementation could just build up a set of sub postlists for every term in the database covered by the range, but it might be possible to optimise this later (perhaps by storing special terms to cover a specific groups of sub-terms).

I've experimented manually with storing "acceleration" terms with an I prefix for the first character of every term: so the word "patch" would produce the terms "Ip" and "patch". Then, a RangePostList? for the range "octopus" .. "queen" would search for the combination of <all terms starting with "o" sorting after "octopus"> "Ip" <all terms starting with "q" before "queen">. In other words, we'd remove the need to combine all "p*" terms at search time, by combining them at index time.

For this to work usefully, though, we'd need to work out which groups of terms to combine at index time, rather than just grouping by the first letter of a term. This could either be done in a way independent of term distribution (so that we know which terms to generate when parsing the first document), or we could do a pass after the main indexing step to add these "acceleration" terms to the database. The latter approach would allow an even distribution of "acceleration" terms; perhaps it would be appropriate to have an acceleration term for every prefix of a term which matches more than, say, 10 terms.

At search time, the RangePostList? doesn't need to know how the acceleration terms are distributed - it just tests for the existence of an acceleration term for each prefix it is using; if it finds one, it assumes that that covers all terms with that prefix; if it doesn't, it iterates through the alltermlist to expand that prefix.

Changed 15 months ago by olly

Some search systems actually allow search for zit..zoom which acts like z* but only for terms between the two limits. I'm unconvinced it's particularly useful though - wildcarding offers a "poor man's stemming", and also allows searching when you are unsure of the spelling (or fear the documents may not have the correct spelling), but I don't really see what practical use a more general term range really has. So it really only seems to have ticklist value as a user visible feature.

However, allowing a wildcard to be represented as a Query object might help us to optimise it better - it's worth looking at at some point. If we do that, then we can probably offer a general term range for little extra work.

Generating "I" terms based on query logs might be the best approach - there's certainly something to be said for doing that for phrases (e.g. e-mail would probably get its own term) and if we're going to offer query log analysis guided indexing at some point, the more we can get out of it the better.

Note: See TracTickets for help on using tickets.