Opened 17 years ago

Closed 17 years ago

Last modified 16 years ago

#143 closed defect (released)

Add ability to store metadata in databases

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone:
Component: Library API Version: SVN trunk
Severity: normal Keywords:
Cc: Olly Betts Blocked By:
Blocking: #120 Operating System: All

Description

I've recently come across several cases where I need to store small amounts of information in the database which relate to the database as a whole, rather than to individual documents.

Two particular examples of this are:

  1. I want to be able to store a mapping from field name to prefix (and also to a

description of the indexing method used for the field, eg, stemmers used, etc).

This would be used at search time to set up the query parser appropriately.

  1. I want to be able to store a monotonically increasing version number in the

database each time it is modified. This is to be used for tying a set of search results to the appropriate version of a parallel RDBMS which holds the full document data. (The internal revision numbers of the flint tables could be used for this purpose, but aren't exposed, and it would probably be a bad design decision to expose them because not all backends should need to use revision numbers.)

Currently, I am using a text file stored in the database directory for the first of these (but this doesn't work for remote databases, and the text file isn't protected by the transaction process). I have no good solution for the second of these.

I have implemented a patch which adds this functionality to Xapian HEAD (for flint databases only, at present). The intention is that this is only used for small amounts of metadata, so a whole new btree for the metadata seems overkill.

Therefore, I have stored the metadata in the special entry in the postlist

table (which already stored metainfo for the total document length and the last docid). This involves a change to the database format (but I can't see any way to implement this without requiring such a change).

I pondered for a while whether a single flat piece of metadata or a key-tag store is appropriate: I believe that a key-tag store is the best interface, because it allows multiple layers built on top of Xapian to store their metadata without colliding (as long as keys are chosen appropriately).

Note that if a large amount of metadata is needed, the metadata stored in xapian could be used as a pointer to the full metadata, so I don't think the recommendation to store only small amounts of metadata will be a problem in practice.

Attachments (4)

xapian-metadata.patch (32.8 KB ) - added by Richard Boulton 17 years ago.
Implementation of database metadata
oldpatch (8.5 KB ) - added by Richard Boulton 17 years ago.
Pre-patch, needs to be reapplied before patch 77.
smallpatch (1.5 KB ) - added by Richard Boulton 17 years ago.
Small patch to change database format to allow metadata to be added later
flint_alltermslist_robustness.patch (634 bytes ) - added by Richard Boulton 17 years ago.
Patch to make alltermslist ignore extra items at start of postlist table

Download all attachments as: .zip

Change History (21)

by Richard Boulton, 17 years ago

Attachment: xapian-metadata.patch added

Implementation of database metadata

comment:1 by Richard Boulton, 17 years ago

Status: newassigned

by Richard Boulton, 17 years ago

Attachment: oldpatch added

Pre-patch, needs to be reapplied before patch 77.

by Richard Boulton, 17 years ago

Attachment: smallpatch added

Small patch to change database format to allow metadata to be added later

comment:2 by Olly Betts, 17 years ago

Cc: olly@… added

[cut and paste from my mail to the list, which I just realised should be here really]

The minimal patch itself seems safe, but I think that the approach is suboptimal. I only had a quick look, but the full patch seems to be serialising a load of (key,tag) pairs into a blob of data which gets tacked on the end of the metainfo tag. So we'll need to fetch it every time we open the database, whether it's wanted or not. My point is that we have a handy Btree manager, whose entire purpose is to store (key,tag) pairs! Wouldn't it be better to store this versioned user data using that? Then if someone wants to store multi-KB (or even multi-MB) tags, it's really no problem. And they can efficiently retrieve one piece of data without having to fetch all the others. We should be able to slot such data into "impossible" keys in the postlist table I think.

comment:3 by Richard Boulton, 17 years ago

If we can store the metadata entries in the posting list table as a continuous set of tags (so they don't separate blocks of postings), that might be an ideal solution... but I couldn't see how to do this at the time I wrote the initial patch.

On closer inspection, it looks like any tag which begins with a zero byte, and then any byte other than 0 or 0xff, is an invalid key for a posting entry. Also, the only key which could start with two zero bytes is a key for a posting for a null term, which isn't allowed. So, using keys which start with a double zero byte would be a perfectly good way to store the data.

So, I'll rework the patch at some point to do that, and there is no need for the small patch to be applied for this reason.

This means that the database version number won't need to be bumped to accommodate the change. However, (supposing the metadata patch was applied in 1.0.1), I believe a 1.0.0 FlintAllTermsList iterator would report a DatabaseCorruptError for a database built with 1.0.1 due to not understanding the metadata keys. I'll look at implementing a small patch to FlintAllTermsListIterator, which would make 1.0.0 simply ignore the keys it doesn't understand, which will make compatbility much easier.

I suppose it might still be a good idea to apply the small patch so that we can store extra internal metainfo data without breaking compatibility (if we find we need to store some additional internal database statistic). I can imagine a couple of such pieces of information, but can't see a likely need for them, so let's leave that kind of change until we're changing the metainfo format anyway (which may be never).

comment:4 by Richard Boulton, 17 years ago

After a quick look at flint_alltermslist.h, the FIXME comment at line 71 makes me think that maybe we should be storing the total number of terms in the database in the metainfo - so maybe applying the small patch would be a good idea after all... (Trying to implement this FIXME would be too likely to introduce bugs at this stage, but it would be nice to be able to implement it without breaking database compatibility.)

For posterity, the comment is: FIXME : this can badly overestimate (it could be several times too large) - perhaps keep track of the number of terms (or number of extra chunks) and store this along with the last_docid and total_doclen? (Mind you, not sure this value is ever used, but we should really make the exact number of terms available somewhere in the API).

by Richard Boulton, 17 years ago

Patch to make alltermslist ignore extra items at start of postlist table

comment:5 by Richard Boulton, 17 years ago

For the record, I currently think there's a good case for applying patch #84 "Patch to make alltermslist ignore extra items at start of postlist table" now, to allow upwards compatibility problems in future.

Patch #79 "Small patch to change database format to allow metadata to be added later" might be a good idea too, but I'm not really fussed about it, and it's not needed for this bug.

comment:6 by Olly Betts, 17 years ago

Yes, what I had in mind was starting these keys with \x00<something>. I'd suggest we impose a sane limit on the user key size (say 200 bytes) since we may need to use extra bytes to encode them in a future database format.

But I suspect it won't just be AllTermsIterator which will fall over these extra keys. For example, I bet xapian-compact and xapian-check will too, and perhaps even normal use of PostLists.

Rather than trying to patch all these cases now, I think it's cleaner to just bump the database version. A current database will be valid under the new version (it will just have no user meta-data set), so we can easily transparently read 1.0.0 databases in this 1.0.x, and even auto-upgrade them on a write (just rewrite the database version if user metadata is set).

There are some other statistics which would be useful for implementing DfR weighting schemes (and probably also for implementing better bounds for BM25Weight too) but I was thinking we can use say \x00\xc0<key> for user metadata and then we can use \x00\x01, \x01\x02, etc for any extra statistics, grouping them as seems useful. Then all the "database statistics" should be in the first leaf block of the Btree.

comment:7 by Richard Boulton, 17 years ago

Blocking: 120 added

Okay, that all sounds very sane. I'll mark this as for the 1.0 series, and I'm entirely happy that there's no need to change anything now. :)

Thanks for the response.

comment:8 by Richard Boulton, 17 years ago

One question about your suggested implementation a couple of comments above: you suggest using \x00\xc0<key> for the keys for the user-metadata: why \xc0 in particular? Is there anything special about that value, or is it just far enough above \x00 to allow plenty of space for system metadata to expand into, and far enough below \xff to allow plenty of space for new user-metadata type stuff?

I hope to look into updating the patch to work as you suggested in the next couple of weeks, though I think this bug is probably not a candidate for 1.0.1 however un-buggy 1.0.0 turns out to be.

comment:9 by Richard Boulton, 17 years ago

Just to note here, Olly commented in some other forum (email or real-life - can't remember which) that the thinking behind use of \xc0 was as I suggested in the previous comment - just to allow plenty of room for more system meta-data values, whilst allowing some room for more user meta-data values to be added (though neither of us can think why such values might be needed, at present).

comment:10 by Olly Betts, 17 years ago

Owner: changed from Richard Boulton to Olly Betts
Status: assignednew

I'm looking at storing extra information for DfR, so I'll try to sort this out while I'm at it.

comment:11 by Olly Betts, 17 years ago

attachments.isobsolete: 01

comment:12 by Olly Betts, 17 years ago

attachments.isobsolete: 01

comment:13 by Olly Betts, 17 years ago

(From update of attachment 84) Attachment #84 has a bug I believe - terms which start with a zero byte will have postlist table keys starting \x00\x00 so won't be iterated over with this patch applied...

comment:14 by Olly Betts, 17 years ago

Nope, that's wrong - it is OK since zero bytes encode as \x00\xff and not \x00\x00 as I was thinking!

comment:15 by Olly Betts, 17 years ago

attachments.isobsolete: 01

comment:16 by Olly Betts, 17 years ago

Resolution: fixed
Status: newclosed

This feature is now implemented in SVN HEAD.

I made an API adjustment (thoughts welcome):

  • get_metadata() just returns an empty string rather than throwing an new exception.
  • set_metadata() with an empty value deletes the item.
  • I've removed delete_metadata() as now superfluous, though it could be readded

as an inline alias if we decide it is still useful to have (but it's much harder to remove a method once released than add one).

Now that we are storing each metadata item as a Btree entry, it's feasible to use them to store per-document, per-term, or even per-posting metadata, but if your application only has metadata for a small number of cases, you would have to choose between having to catch an exception when there's nothing to store (which can be quite expensive in run time), or storing a key with an empty tag for all the cases with no metadata (which is quite a disk space and VM overhead).

I'm not sure if empty metadata keys are useful to users, and they could be problematic for a backend (e.g. if we put metadata in its own table, which wouldn't be too unreasonable now that we have optional tables, we'd have to adjust keys somehow to allow an empty key - perhaps empty key -> "\x00" and any key consisting entirely of zero bytes would get an extra zero byte appended, which would even preserve the sort order, but is still a faff if empty keys aren't actually useful) so I've added a documentation note to suggest avoiding making use of them for the time being.

comment:17 by Olly Betts, 17 years ago

Operating System: All
Resolution: fixedreleased

Fixed in 1.0.3

Note: See TracTickets for help on using tickets.