Opened 17 years ago

Last modified 13 months ago

#198 new defect

Add support for multiple values in each value slot in a Document.

Reported by: Richard Boulton Owned by: Not currently assigned
Priority: high Milestone: 2.0.0
Component: Backend-Honey Version: git master
Severity: normal Keywords:
Cc: Olly Betts, sascha-web-trac.xapian.org@… Blocked By:
Blocking: #199 Operating System: All

Description (last modified by Richard Boulton)

Currently, the value stored in a slot in a Document is a single string. It would sometimes be useful to be able to store multiple strings in the slot. For example, when using a value slot to store the set of facets that a document is relevant to, a given document may be relevant to multiple values. Also, if storing the set of tags matching a document, for use when generating a tag cloud, we want to be able to store multiple tags for each document.

However, we also need to preserve the existing API, and ensure that database formats are compatible.

Some discussion from IRC follows:

Richard Boulton: Do you think we could convert values as stored in databases
currently to allow multiple values, without breaking backwards compatibility?
ojwb: probably
ojwb: if only by checking the flint version
Richard Boulton: Hmm - if an old flint version is used to create a database, and
insert some values, it could be hard to then modify that database with a new
version of flint.
Richard Boulton: Unless we have a pass through the whole database to rewrite the
values.
ojwb: well, you could just disable the ability to add multiple values
Richard Boulton: Oh, no, its easy to store this.
Richard Boulton: Each value entry consists of a list of "valueno, entry" items.
Richard Boulton: (serialised, of course)
ojwb: or start the newly encoded ones in a way which is invalid
ojwb: oh, just duplicate?
Richard Boulton: Yep, there doesn't seem to be any thing to stop that.
ojwb: so the only question is if it's actually desirable!
Richard Boulton: They're kept in sorted order.
Richard Boulton: And the existing get_value() just returns the first of a
particular valueno found.
ojwb: that's nice then
Richard Boulton: So it would even be backwards compatible for reading purposes.
 (Old versions of xapian just wouldn't see the duplicate values)
ojwb: rewriting would mess up a document with multiple values, wouldn't it?
Richard Boulton: Not entirely.
Richard Boulton: add_value adds on the values at the end of the list.
Richard Boulton: without checking them.
ojwb: but aren't the unserialised into a map in Xapian::Document?
Richard Boulton: Oh.  Ah.
Richard Boulton: Yes, so getting a document out and then inserting it again
would lose the duplicates.
Richard Boulton: But that's a pretty nice way to degrade.
ojwb: yeah, it's not too bad
Richard Boulton: It would be nicer if we'd named Document::add_value()  as
document::set_value()
Richard Boulton: We can't change the behaviour of add_value() now, though: I
suppose we could add Document::append_value()
Richard Boulton: And leave Document::remove_value() as removing all values with
a given number.
Richard Boulton: Document::get_value() would return the first value for a given
valueno.
Richard Boulton: And we could add Document::get_values() which gets a list of
all the values for a given valueno.
Richard Boulton: Hmm - I wonder if the list of values for each valueno should be
kept in insertion order.  Or sorted in some way (binary sort, I would think).
ojwb: It shouldn't sort them I think
Richard Boulton: I think just in insertion order.
Richard Boulton: *snap*
ojwb: because you want a "primary version"
Richard Boulton: That's true.
ojwb: which is used for sorting, etc
ojwb: I'm not completely sure this is a good plan, but it seems to have merit
Richard Boulton: Yes.  That's the main thing I was unhappy about
StringListSerialiser for - you couldn't sensibly sort on the resulting values.

Attachments (1)

multiple_values.patch (13.1 KB ) - added by Richard Boulton 15 years ago.
Patch from 2007 which I found lying around - no idea if it's of much use

Download all attachments as: .zip

Change History (26)

comment:1 by Richard Boulton, 17 years ago

Blocking: 199 added

comment:2 by Olly Betts, 17 years ago

Just noting that this doesn't necessarily block bug#199, as we could address that by just disabling features for 1.0.3. I'm not saying we should do that, but I'd certainly prefer not to delay 1.0.3 too much, and we should get back to ~monthly releases now that summer is over so 1.0.4 shouldn't be far away.

comment:3 by Richard Boulton, 17 years ago

Cc: olly@… added
Owner: changed from New Bugs to Richard Boulton

comment:4 by Richard Boulton, 17 years ago

Operating System: All
Status: newassigned

I've looked at this a little. The obvious way to implement this inside the Document object is to store the values in a multimap instead of a map. I think this is just about doable, but it turns out we require a slightly stronger specification of the behaviour than the current standard. Specifically, we require that we can add an item to the multimap, and be certain that it will appear after any pre-existing items with the same key. However, the current standard doesn't specify the ordering of items added to a multimap with the same key. On the other hand, the next standard is likely to require that an item inserted in a multimap when no hint position is supplied is added after all existing items with the same key, and all current public implementations seem to do that (according to discussion on the standards mailing list).

Therefore, I think using multimap, with an appropriate test in internaltest to check that the compiler is behaving as we require, should be fine.

comment:5 by Olly Betts, 17 years ago

I don't see how you can reliably test for this - even if you test inserting a billion values into a multimap and find them ordered by insertion order, there may be a case you haven't tickled where they aren't. There probably isn't, but we can't really rely on "probably".

It seems the only safe test is for the standard version the compiler supports being high enough (once it's updated, assuming this change is included) or for versions of particular compilers which document that they implement multimap like this.

Testing in internaltest is bad, since users won't automatically run it and could easily end up with a build which behaves incorrectly with multiple valued values. I think this would need to be a configure test really. And what do we do if this test fails (or when cross-compiling)? It seems some fall-back approach is needed until this is both required by the C++ standard and that version of the standard is supported by all compilers we want to support building with.

comment:6 by Richard Boulton, 17 years ago

Fair comments. I'll take a look and see if there's an alternative implementation of multimap that we can rely on that we can use (perhaps the version from STLPort can be used in isolation).

Alternatively, we could use a map of vectors to emulate a multimap, but that could be rather wasteful of memory.

comment:7 by Olly Betts, 17 years ago

If there's a suitable portable implementation, that may be the best solution. And it allows a gradual transition if/when C++ requires this extra ordering condition from multimap (or implementations document the work this way).

Or we could just keep the current map for "primary" values, and add a second structure (e.g. map<vector> or perhaps better map<list>) to track any extra ones. That means that if they aren't used, there's only a small fixed overhead per Document object. Iterating values becomes more complication though.

Another option is to encode multiple values into the string "tag" in the current map, e.g. <encoded length>value1<encoded length>value2 - again iteration is more complex, and there's overhead even if it's not used (though probably only 1 byte per value).

comment:8 by Olly Betts, 16 years ago

I just did a bit of digging, and this has been changed in the next revision of the standard:

http://www.comeaucomputing.com/iso/lwg-defects.html#233

And this list shows that issue "resulted in code changes to the [libstdc++-v3] library":

http://gcc.gnu.org/onlinedocs/libstdc++/ext/howto.html

So GCC should be suitable for our use, at least after these changes went in (it's not clear from the above when that was, or if it has always actually implemented the semantics we want and the code change was for some other detail of issue #233, or perhaps just changes to comments).

comment:10 by Richard Boulton, 16 years ago

Description: modified (diff)
Milestone: 1.1.0

Marking this for 1.1.0, since it may involve API changes (though probably could be done without).

comment:11 by Olly Betts, 16 years ago

Component: Backend-FlintBackend-Chert

We'd implement this for chert rather than flint now.

comment:12 by Olly Betts, 15 years ago

Milestone: 1.1.01.1.1

I think this can be done without incompatible API changes, and probably should be to avoid breaking a lot of existing code, so moving to milestone:1.1.1

comment:13 by Olly Betts, 15 years ago

Milestone: 1.1.11.1.4

Triaging milestone:1.1.1 bugs.

comment:14 by Olly Betts, 15 years ago

Priority: normalhigh

Blocks merging the facet part the matchspy stuff - if it wasn't for that, I'd punt on this.

comment:15 by Richard Boulton, 15 years ago

I've been thinking about this over the last few days, and I'm coming to think that this is a feature that we shouldn't actually support. Instead, we should provide convenient classes to allow serialising multiple values into a string, Sorter (or "KeyMaker") subclasses to use the first of these serialised values for sorting, and posting sources to read the first item.

Two reasons for this:

1) A generic "multiple values in a slot" encoding is going to entail some overhead over a custom encoding in some cases. For example, if we're encoding multiple numeric values, we can use the encode_length() encoding to store self-terminating representations, but if we're storing arbitrary data, we need to store a length part explicitly before the data.

2) Though the flint backend could currently store multiple values without a change to the way the values are stored, the chert backend uses separate value streams for each value, and I think we'd want to store only the first value in the stream, and then store the subsequent values in a secondary lookup. Either this, or encode multiple values into the stream. Either of these approaches seems awkward, and likely to involve quite a bit of code and several corner cases. Moving the problem to be handled by external processors seems like a much safer solution.

comment:16 by Richard Boulton, 15 years ago

Incidentally, of course, we currently have the SortableSerialise class (in the matchspy branch) for serialising arbitrary strings together. This could do with some clean up, but is roughly the sort of thing I was thinking of as an example when I said "provide convenient classes to allow serialising multiple values into a string".

comment:17 by Olly Betts, 15 years ago

Um, I think you mean StringListSerialiser...

While you could use a Sorter to get out the first value for sorting (and probably for collapsing too eventually), that's a lot of extra virtual method calls, and worse, a lot of extra data to fetch that you didn't want.

I'm not completely sold on the space overhead argument - it's misleading to claim that encode_length() doesn't store the length - that's what the top bit of each byte is used for, so it really incurs a 1 bit in 7 overhead to do so (on average, roughly, since it rounds up the value's length to a multiple of 7 or 8 bits in each case). And we could store multivalues by just concatenating the non-primary values and storing the split points using interpolative coding, which would probably be similar to 1 bit in 7. It can certainly do better in some cases - e.g. if you're storing evenly distributed values from 0 to 255, then encode_length() would need 1.5 * N bytes, but the interpolative approach would need N (no overhead, since interpolative coding a full range takes no space!)

And if you have to store the primary value in another slot to avoid the I/O overhead when you want to sort on it, that will cancel out any saving you made here and then some.

Also you don't have to use multi-values if you think you can encode your stuff into a single value more efficiently than the multi-value implementation.

So I think I still like this idea more than supplying serialisation helpers.

I think the conclusion when we discussed this before is that the non-primary values wouldn't go in the stream, so you'd need to store the secondary values. That certainly seems to make most sense to me. Sure that's extra code, but then StringListSerialiser is extra code too. It might be a bit more complex to implement this internally, but multi-values seem a cleaner external API.

comment:18 by Richard Boulton, 15 years ago

Oops, yes, StringListSerialiser.

You'd only need to use a Sorter for a value slot that you've stored multiple values in - I think wanting to sort on a field with multiple values is going to be quite rare (but could be wrong!).

If you want to separate the storage of the first value from the subsequent values, you could do that by using one slot to store the first value (as a plain entry), and a second slot to hold the subsequent values (as a StringListSerialised set of entries).

I think the difference is about where the code complexity sits - I'm not sure that there's anything which needs to sit inside the innards of Xapian, so I'm thinking that we shouldn't make them more complex than we need to.

Also, there's a question of implementation time - we have a StringListSerialiser implementation already that works (though could probably do with some API tidying), but we'd have to implement the multiple-value-in-a-slot storage mechaism, for each backend, which would be quite a bit of work.

Perhaps we're going in circles a bit - we should probably discuss this on IRC instead.

by Richard Boulton, 15 years ago

Attachment: multiple_values.patch added

Patch from 2007 which I found lying around - no idea if it's of much use

comment:19 by Olly Betts, 15 years ago

Milestone: 1.1.41.3.0

Bumping to try to get us to a release.

comment:20 by Sascha Silbe, 14 years ago

Cc: sascha-web-trac.xapian.org@… added

comment:21 by Olly Betts, 12 years ago

Milestone: 1.3.01.3.x

comment:22 by laurentm, 12 years ago

Hi,

Any news for this ticket? I see that olly changed the milestone to 1.3.x. Does anyone have an approximate release date?

Thanks, Laurent

comment:23 by Olly Betts, 11 years ago

Hi Laurent - sorry, your question seems to have been missed.

Nobody is currently working on this, so there isn't really a date I can give you. I'd certainly like to address it in the 1.3.x series (for a 1.4.0 stable release), but there are a lot of other isuses I'd also like to address, and we don't want to push 1.4.0 too far into the future.

comment:24 by Olly Betts, 9 years ago

Milestone: 1.3.x1.4.x

Nobody's working on this, and while there's interest in it, I don't think we have a clear design. It's also possible to use multi-valued facets already, you just need to unpack them in your own matchspy class. So I think it isn't worth holding up 1.4.0 for.

comment:25 by Olly Betts, 5 years ago

Component: Backend-ChertBackend-Honey
Owner: changed from Richard Boulton to Not currently assigned
Status: assignednew
Version: SVN trunkgit master

comment:26 by Olly Betts, 13 months ago

Milestone: 1.4.x2.0.0
Note: See TracTickets for help on using tickets.