Opened 15 years ago

Closed 13 years ago

Last modified 13 years ago

#442 closed enhancement (wontfix)

Add support for mapping field names to differing term prefixes across multiple databases

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone:
Component: Library API Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Overview

If multiple databases use differing schemes for mapping field names to term prefixes, it's currently not possible to form a search across them.

I'd like to be able to perform a mapping from field name to the correct prefix for a sub-database at the point at which the posting lists for a query are opened, instead of when building the query, in order to allow me to do such searches.

Rationale

When implementing higher level abstractions on top of Xapian (eg, Xappy), it is very common to want to introduce the concept of data split into fields. The current recommended way to do this is to use a different prefix for each field.

One problem with this approach is that, if the abstraction is to hide the implementation from users, it is necessary to allocate prefixes automatically. There are two approaches; either use the field name as the prefix, with some appropriate escaping mechanism, or generate and store a mapping from fieldname to prefix (either the first time that a fieldname is used, or in a preliminary "schema" generation step).

Using the field name as the prefix currently causes significant database bloat: particularly for long field names. One benchmark: on an example large database (containing 47 fields with descriptive field names averaging 9.2 charaters in length), moving from xappy's allocated prefixes to the full field names increases the postlist table size by 3.0%, the position table size by 7.6% and the termlist size by 10.7% (with chert).

For Xappy, we instead generate a mapping from field name to a short (usually 2 character) prefix, and store the mapping in metadata keys. This seems to work well, but has a major drawback: it is no longer possible to search across multiple databases unless they have an identical mapping for all the fields involved in a search.

One solution to this problem may be provided by future database backends; it's possible that avoiding storing common prefixes in btree blocks will avoid the problem sufficiently. However, at least one full fieldname would still need to be stored in each btree block, and I'm not convinced that this would remove the problem fully.

Another approach would be to rework the Xapian API so that in all places where a term is supplied to the API, a (fieldname, term) pair is supplied instead, and have Xapian perform all the mapping from fieldnames to prefixes fully internally. It might be possible to do this in a reasonably backwards compatible manner, by making the current versions of the methods store the terms in a field with name "". The hard bit to do in a backwards compatible manner would be working out what to return from the APIs which currently return terms as single strings.

Suggested solution

I think it's undesirable to specify a fixed scheme for performing a mapping from fieldnames to prefixes, since it won't be appropriate for all users of Xapian (even the concept of "fields" isn't always appropriate).

Instead, I suggest adding the ability to register a functor to be called when generating a leaf posting list from a leaf query. This functor would be passed the database and the term from the query, and would return a term. It would be able to use metadata stored in the sub-database to convert the term appropriately for that sub-database.

A default functor could be defined which recognised a specific format for terms (possibly a 0-byte separating the fieldname from the value for that field, to allow any other character in fieldnames), and looked up the fieldname. The default functor could use a standard set of metadata keys to look up fieldnames: eg "_F<fieldname>". (Introducing the idea that an _ prefix would be used for "internal" metadata for Xapian, and using "F" (for field) to allow us to store additional internally-used-but-publically-visible metadata in future if desired.)

The functor could either be specified by passing it to the Enquire class (like a weighting object), or could be attached to the Query in some way (potentially allowing different mappings for different parts of the query tree).

Issues

  • To make this work for remote searches, the functor would need to be serialisable, and be registered with the Registry object. This would limit the use of user defined functions instead of the built-in functor.
  • We'd probably want to add support to the TermGenerator for generating terms with the appropriate prefix, based on a fieldname and the prefix metadata values stored in a xapian database (possibly the TermGenerator would be able to take the same functor as used by Enquire/Query).
  • The return values for get_matching_terms() would be confusing to users, since they'd hold the converted terms. We'd probably need to have some way to convert a prefix-term back to the original field-term (or, at least, to a field-term which would map to the prefix-term: mapping to the original term wouldn't be possible in general since two fieldnames could map to the same prefix).
  • This solution generally seems rather overly complex (both in implementation and API). However, it's the best I've come up with so far which doesn't involve massive re-working of the Xapian API, or relying on as-yet unimplemented and unproven database storage improvements.

A related issue to this is that it would be nice to support field-specific weighting schemes (eg, BM25-F), which need Xapian to spport a wider model of fields: in particular, implementing these would require Xapian to store "document length" values specific to fields. Perhaps rather than following the approach advocated above, this means that it would be worth making the effort to add explicit support for fields to all the Xapian APIs which work with terms.

Change History (3)

comment:1 by Olly Betts, 15 years ago

Yuck!

Brass should eliminate almost all of the overhead from long prefixes (you don't actually even need to store the prefix once per block necessarily - you can potentially reuse the prefix from the dividing keys in the parent block), so adding a complex and ugly mechanism for this in 1.2.x only to deprecate it in 1.3.0 seems unwise. I think we should at least see what improvements this can deliver first.

comment:2 by Olly Betts, 13 years ago

Resolution: wontfix
Status: newclosed

This still seems like it would be opening a can of worms, even 2 years later. There are likely to be a lot of awkward issues to resolve - e.g. I can see the ordering of converted terms being interesting.

Although we don't have the B-tree key prefix compression completed yet, I think that's a better way to expend implementation effort - it also would reduce database size significantly in general, and that would add complexity only to the lower levels of the backend code, rather than touching a lot of areas of the code, and it wouldn't require any user API additions.

Yes, that's "relying on as-yet unimplemented and unproven database storage improvements", but then the approach proposed here is relying on as-yet unimplemented and unproven changes to how Xapian maps terms to keys in databases.

If we're going to decouple the field and the term rather than adding a prefix like we currently do, let's do it properly and make the appropriate API changes. This probably won't be fully backwardly compatible, but we can make this change in Xapian 2.0.

So I'm going to mark this idea as "won't fix".

comment:3 by Olly Betts, 13 years ago

Milestone: 1.2.x
Note: See TracTickets for help on using tickets.