#143 closed defect (released)
Add ability to store metadata in databases
Reported by: | Richard Boulton | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Library API | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Olly Betts | Blocked By: | |
Blocking: | #120 | Operating System: | All |
Description
I've recently come across several cases where I need to store small amounts of information in the database which relate to the database as a whole, rather than to individual documents.
Two particular examples of this are:
- I want to be able to store a mapping from field name to prefix (and also to a
description of the indexing method used for the field, eg, stemmers used, etc).
This would be used at search time to set up the query parser appropriately.
- I want to be able to store a monotonically increasing version number in the
database each time it is modified. This is to be used for tying a set of search results to the appropriate version of a parallel RDBMS which holds the full document data. (The internal revision numbers of the flint tables could be used for this purpose, but aren't exposed, and it would probably be a bad design decision to expose them because not all backends should need to use revision numbers.)
Currently, I am using a text file stored in the database directory for the first of these (but this doesn't work for remote databases, and the text file isn't protected by the transaction process). I have no good solution for the second of these.
I have implemented a patch which adds this functionality to Xapian HEAD (for flint databases only, at present). The intention is that this is only used for small amounts of metadata, so a whole new btree for the metadata seems overkill.
Therefore, I have stored the metadata in the special entry in the postlist
table (which already stored metainfo for the total document length and the last docid). This involves a change to the database format (but I can't see any way to implement this without requiring such a change).
I pondered for a while whether a single flat piece of metadata or a key-tag store is appropriate: I believe that a key-tag store is the best interface, because it allows multiple layers built on top of Xapian to store their metadata without colliding (as long as keys are chosen appropriately).
Note that if a large amount of metadata is needed, the metadata stored in xapian could be used as a pointer to the full metadata, so I don't think the recommendation to store only small amounts of metadata will be a problem in practice.
Attachments (4)
Change History (21)
by , 18 years ago
Attachment: | xapian-metadata.patch added |
---|
comment:1 by , 18 years ago
Status: | new → assigned |
---|
by , 18 years ago
Attachment: | smallpatch added |
---|
Small patch to change database format to allow metadata to be added later
comment:2 by , 18 years ago
Cc: | added |
---|
[cut and paste from my mail to the list, which I just realised should be here really]
The minimal patch itself seems safe, but I think that the approach is suboptimal. I only had a quick look, but the full patch seems to be serialising a load of (key,tag) pairs into a blob of data which gets tacked on the end of the metainfo tag. So we'll need to fetch it every time we open the database, whether it's wanted or not. My point is that we have a handy Btree manager, whose entire purpose is to store (key,tag) pairs! Wouldn't it be better to store this versioned user data using that? Then if someone wants to store multi-KB (or even multi-MB) tags, it's really no problem. And they can efficiently retrieve one piece of data without having to fetch all the others. We should be able to slot such data into "impossible" keys in the postlist table I think.
comment:3 by , 18 years ago
If we can store the metadata entries in the posting list table as a continuous set of tags (so they don't separate blocks of postings), that might be an ideal solution... but I couldn't see how to do this at the time I wrote the initial patch.
On closer inspection, it looks like any tag which begins with a zero byte, and then any byte other than 0 or 0xff, is an invalid key for a posting entry. Also, the only key which could start with two zero bytes is a key for a posting for a null term, which isn't allowed. So, using keys which start with a double zero byte would be a perfectly good way to store the data.
So, I'll rework the patch at some point to do that, and there is no need for the small patch to be applied for this reason.
This means that the database version number won't need to be bumped to accommodate the change. However, (supposing the metadata patch was applied in 1.0.1), I believe a 1.0.0 FlintAllTermsList iterator would report a DatabaseCorruptError for a database built with 1.0.1 due to not understanding the metadata keys. I'll look at implementing a small patch to FlintAllTermsListIterator, which would make 1.0.0 simply ignore the keys it doesn't understand, which will make compatbility much easier.
I suppose it might still be a good idea to apply the small patch so that we can store extra internal metainfo data without breaking compatibility (if we find we need to store some additional internal database statistic). I can imagine a couple of such pieces of information, but can't see a likely need for them, so let's leave that kind of change until we're changing the metainfo format anyway (which may be never).
comment:4 by , 18 years ago
After a quick look at flint_alltermslist.h, the FIXME comment at line 71 makes me think that maybe we should be storing the total number of terms in the database in the metainfo - so maybe applying the small patch would be a good idea after all... (Trying to implement this FIXME would be too likely to introduce bugs at this stage, but it would be nice to be able to implement it without breaking database compatibility.)
For posterity, the comment is: FIXME : this can badly overestimate (it could be several times too large) - perhaps keep track of the number of terms (or number of extra chunks) and store this along with the last_docid and total_doclen? (Mind you, not sure this value is ever used, but we should really make the exact number of terms available somewhere in the API).
by , 18 years ago
Attachment: | flint_alltermslist_robustness.patch added |
---|
Patch to make alltermslist ignore extra items at start of postlist table
comment:5 by , 18 years ago
For the record, I currently think there's a good case for applying patch #84 "Patch to make alltermslist ignore extra items at start of postlist table" now, to allow upwards compatibility problems in future.
Patch #79 "Small patch to change database format to allow metadata to be added later" might be a good idea too, but I'm not really fussed about it, and it's not needed for this bug.
comment:6 by , 18 years ago
Yes, what I had in mind was starting these keys with \x00<something>. I'd suggest we impose a sane limit on the user key size (say 200 bytes) since we may need to use extra bytes to encode them in a future database format.
But I suspect it won't just be AllTermsIterator which will fall over these extra keys. For example, I bet xapian-compact and xapian-check will too, and perhaps even normal use of PostLists.
Rather than trying to patch all these cases now, I think it's cleaner to just bump the database version. A current database will be valid under the new version (it will just have no user meta-data set), so we can easily transparently read 1.0.0 databases in this 1.0.x, and even auto-upgrade them on a write (just rewrite the database version if user metadata is set).
There are some other statistics which would be useful for implementing DfR weighting schemes (and probably also for implementing better bounds for BM25Weight too) but I was thinking we can use say \x00\xc0<key> for user metadata and then we can use \x00\x01, \x01\x02, etc for any extra statistics, grouping them as seems useful. Then all the "database statistics" should be in the first leaf block of the Btree.
comment:7 by , 18 years ago
Blocking: | 120 added |
---|
Okay, that all sounds very sane. I'll mark this as for the 1.0 series, and I'm entirely happy that there's no need to change anything now. :)
Thanks for the response.
comment:8 by , 18 years ago
One question about your suggested implementation a couple of comments above: you suggest using \x00\xc0<key> for the keys for the user-metadata: why \xc0 in particular? Is there anything special about that value, or is it just far enough above \x00 to allow plenty of space for system metadata to expand into, and far enough below \xff to allow plenty of space for new user-metadata type stuff?
I hope to look into updating the patch to work as you suggested in the next couple of weeks, though I think this bug is probably not a candidate for 1.0.1 however un-buggy 1.0.0 turns out to be.
comment:9 by , 18 years ago
Just to note here, Olly commented in some other forum (email or real-life - can't remember which) that the thinking behind use of \xc0 was as I suggested in the previous comment - just to allow plenty of room for more system meta-data values, whilst allowing some room for more user meta-data values to be added (though neither of us can think why such values might be needed, at present).
comment:10 by , 17 years ago
Owner: | changed from | to
---|---|
Status: | assigned → new |
I'm looking at storing extra information for DfR, so I'll try to sort this out while I'm at it.
comment:11 by , 17 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:12 by , 17 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:13 by , 17 years ago
(From update of attachment 84) Attachment #84 has a bug I believe - terms which start with a zero byte will have postlist table keys starting \x00\x00 so won't be iterated over with this patch applied...
comment:14 by , 17 years ago
Nope, that's wrong - it is OK since zero bytes encode as \x00\xff and not \x00\x00 as I was thinking!
comment:15 by , 17 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:16 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
This feature is now implemented in SVN HEAD.
I made an API adjustment (thoughts welcome):
- get_metadata() just returns an empty string rather than throwing an new exception.
- set_metadata() with an empty value deletes the item.
- I've removed delete_metadata() as now superfluous, though it could be readded
as an inline alias if we decide it is still useful to have (but it's much harder to remove a method once released than add one).
Now that we are storing each metadata item as a Btree entry, it's feasible to use them to store per-document, per-term, or even per-posting metadata, but if your application only has metadata for a small number of cases, you would have to choose between having to catch an exception when there's nothing to store (which can be quite expensive in run time), or storing a key with an empty tag for all the cases with no metadata (which is quite a disk space and VM overhead).
I'm not sure if empty metadata keys are useful to users, and they could be problematic for a backend (e.g. if we put metadata in its own table, which wouldn't be too unreasonable now that we have optional tables, we'd have to adjust keys somehow to allow an empty key - perhaps empty key -> "\x00" and any key consisting entirely of zero bytes would get an extra zero byte appended, which would even preserve the sort order, but is still a faff if empty keys aren't actually useful) so I've added a documentation note to suggest avoiding making use of them for the time being.
Implementation of database metadata