#93 closed enhancement (released)
QueryParser: allow mapping a field to multiple term prefixes
Reported by: | Daniel Ménard | Owned by: | Richard Boulton |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | QueryParser | Version: | SVN trunk |
Severity: | minor | Keywords: | |
Cc: | Olly Betts, Richard Boulton | Blocked By: | |
Blocking: | Operating System: | All |
Description
Xapian QueryParser class already supports mapping multiples fields to the same prefix.
Supporting the opposite (mapping a single field to multiple prefixes) would be a useful enhancement.
As an example, it would allow to give the user a 'artist/title/track' search box on a music database.
URL: http://thread.gmane.org/gmane.comp.search.xapian.general/3268
Change History (12)
comment:1 by , 18 years ago
Component: | Other → Library API |
---|---|
Status: | new → assigned |
comment:2 by , 18 years ago
comment:3 by , 18 years ago
I don't think we should deprecate the default_prefix parameter - it's not a crime to allow things to be done in different ways, and this is an obvious and simple interface to the functionality (qp.add_prefix("", "blah") is logical, but obscure).
Having the default prefix as a parameter also easily allows the user to call the same QueryParser object several times with different default prefixes without having to mess around changing the prefix settings each time.
comment:4 by , 18 years ago
Cc: | added |
---|
comment:5 by , 18 years ago
The ability to set multiple prefixes for the "" field is something that would be very useful for us.
However, I think that it differs a bit (or is a generalization?) of what I was originally asking for.
I can't speak at the "implementation" level, but perhaps giving some higher level details about what I was asking for can help?
We have a database which contains bibliographical records with fields like authors, organizations, titles (in many languages), dates, periodical, keywords (in many languages), abstract and many others.
The user should be able to search on any field (e.g. author:smith and date:2007), so each field is indexed with a specific prefix (XAUTHORS:smith, XDATE:2007 and so on), but she can also search on the whole "record" without specifying any field (e.g. smith 2007). For now, we duplicate all terms and postings, adding the prefixed term and the not prefixed one to the index (smith, XAUTHORS:smith, 2007, XDATE:2007 and so on).
This is where Richard proposition is interesting: being able to specify which fields are queried by unprefixed query terms would allow me to cut down the database.
Additionally, the "admin" of the database can define "pseudo indices" to "facilitate" user search. For exemple, he should be able to specify that: ti=main title+subtitle+title translation au=main authors+secondary authors+corporate authors kwd=english descriptors+french descriptors+free keywords body=ti+au+kwd+abstract ... so a query like "au:bdsp and kwd:health" would search in the fields specified.
We can't handle this during indexation, because it would lead to duplicate and duplicate again the postings (for example a word appearing in the main title would give the terms "word", "XMAINTITLE:word", "XTI:word", "XBODY:word", etc.) and also because we need some freedom about being able to change these aliases without having to re-index the whole database. As an example, the mappings could be user choosen : a regular user of PubMed (a popular health database) would be happy if we set up our database with the same mappings as those he is used to (http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.section.pubmedhelp.Search_Field_Descrip).
For now, to handle this, we "tweak" the user query, doing string replacements (using the OR operator) before submitting the query to Xapian. It's not very satisfying, it can leads to huge queries (but xapian "eats" them very well!) and can easily generate bad queries.
If Xapian was able to map a single field to multiple prefixes, it would be a far better approach to do what we need.
PS: where can I read about the "synonym" operator Richard mentioned? It seems I miss the thread where it was discussed.
comment:6 by , 18 years ago
Re: synonyms - I'm not sure it's been discussed publically, or at least, not for a very long time. Bug #50 is a long-standing wishlist item about it, and I remember discussing the idea with Olly back when we shared an office in 2000 or 2001.
The idea is just that there would be an operator which worked similarly to OR, but returned statistics (and thus weights) as if all the terms involved were actually stored in the database as a single term. This ought, theoretically, to give better weightings in situations where the terms are actually different aspects of a meaningful "meta-term". I'm not sure when, or if, we'll get round to implementing it: OR works well enough for a lot of situations.
comment:7 by , 17 years ago
Cc: | added |
---|---|
Owner: | changed from | to
Status: | assigned → new |
I'm taking a look at this now, so assigning to me.
comment:8 by , 17 years ago
Component: | Library API → QueryParser |
---|---|
Status: | new → assigned |
Summary: | [Wishlist bug] QueryParser : mapping a field to multiple prefixes → QueryParser: allow mapping a field to multiple term prefixes |
Version: | 0.9.6 → SVN HEAD |
Adjust component, version, summary.
comment:9 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
This is now implemented in SVN HEAD. (You need to use a new form of add_prefix(), which takes a third argument - the existing behaviour of add_prefix(field, prefix) and add_boolean_prefix() hasn't been changed for backward compatibility reasons.)
My suggested extension of allowing unprefixed fields to be mapped to allow the default field to use multiple prefixes has also been implemented. As Olly suggested, if a value is specified for the default_prefix parameter as well, this is simply added to the list of default prefixes.
comment:10 by , 17 years ago
Blocking: | 200 added |
---|
Postponing this change to 1.0.4, so marking to block that.
comment:12 by , 17 years ago
Blocking: | 200 removed |
---|---|
Operating System: | → All |
A natural extension of this might be to allow _unprefixed_ fields to be mapped to multiple prefixes.
Currently, unprefixed fields are always mapped to a single prefix (specified by the default_prefix parameter in parse_query).
I would envisage this working such that:
unprefixed query words would become two terms; one with an "A" prefix, and one without.
It might then be appropriate to deprecate the default_prefix parameter supplied to parse_query. Alternatively / meanwhile, backwards compatible handling could be implemented by initially considering the list of prefixes for "" to be empty, and by (temporarily) appending the default_prefix parameter to this list while parsing the query.
It might be useful to be able to set the operator joining the multiple terms produced in this way: currently, I suspect that "OR" is the only useful such operator, but when it is implemented, SYNONYM might also be a good operator to add.
The ability to be able to set a multiple prefixes for the "" field would be useful for situations where data is indexed in a set of fields (eg, author and title), but it is also desirable to be able to search across all these fields. Currently this can only be implemented either by indexing all the data twice, once with prefixes and once without, or by parsing the whole query multiple times and then ORring the results together (which I'm not convinced will always give the same result).