Ticket #198 (assigned defect)

Opened 16 months ago

Last modified 3 months ago

Add support for multiple values in each value slot in a Document.

Reported by: richard Owned by: richard
Priority: normal Milestone: 1.1.0
Component: Backend-Chert Version: SVN trunk
Severity: normal Keywords:
Cc: olly Blocked By:
Operating System: All Blocking: #199

Description (last modified by richard) (diff)

Currently, the value stored in a slot in a Document is a single string. It would sometimes be useful to be able to store multiple strings in the slot. For example, when using a value slot to store the set of facets that a document is relevant to, a given document may be relevant to multiple values. Also, if storing the set of tags matching a document, for use when generating a tag cloud, we want to be able to store multiple tags for each document.

However, we also need to preserve the existing API, and ensure that database formats are compatible.

Some discussion from IRC follows:

Richard Boulton: Do you think we could convert values as stored in databases
currently to allow multiple values, without breaking backwards compatibility?
ojwb: probably
ojwb: if only by checking the flint version
Richard Boulton: Hmm - if an old flint version is used to create a database, and
insert some values, it could be hard to then modify that database with a new
version of flint.
Richard Boulton: Unless we have a pass through the whole database to rewrite the
values.
ojwb: well, you could just disable the ability to add multiple values
Richard Boulton: Oh, no, its easy to store this.
Richard Boulton: Each value entry consists of a list of "valueno, entry" items.
Richard Boulton: (serialised, of course)
ojwb: or start the newly encoded ones in a way which is invalid
ojwb: oh, just duplicate?
Richard Boulton: Yep, there doesn't seem to be any thing to stop that.
ojwb: so the only question is if it's actually desirable!
Richard Boulton: They're kept in sorted order.
Richard Boulton: And the existing get_value() just returns the first of a
particular valueno found.
ojwb: that's nice then
Richard Boulton: So it would even be backwards compatible for reading purposes.
 (Old versions of xapian just wouldn't see the duplicate values)
ojwb: rewriting would mess up a document with multiple values, wouldn't it?
Richard Boulton: Not entirely.
Richard Boulton: add_value adds on the values at the end of the list.
Richard Boulton: without checking them.
ojwb: but aren't the unserialised into a map in Xapian::Document?
Richard Boulton: Oh.  Ah.
Richard Boulton: Yes, so getting a document out and then inserting it again
would lose the duplicates.
Richard Boulton: But that's a pretty nice way to degrade.
ojwb: yeah, it's not too bad
Richard Boulton: It would be nicer if we'd named Document::add_value()  as
document::set_value()
Richard Boulton: We can't change the behaviour of add_value() now, though: I
suppose we could add Document::append_value()
Richard Boulton: And leave Document::remove_value() as removing all values with
a given number.
Richard Boulton: Document::get_value() would return the first value for a given
valueno.
Richard Boulton: And we could add Document::get_values() which gets a list of
all the values for a given valueno.
Richard Boulton: Hmm - I wonder if the list of values for each valueno should be
kept in insertion order.  Or sorted in some way (binary sort, I would think).
ojwb: It shouldn't sort them I think
Richard Boulton: I think just in insertion order.
Richard Boulton: *snap*
ojwb: because you want a "primary version"
Richard Boulton: That's true.
ojwb: which is used for sorting, etc
ojwb: I'm not completely sure this is a good plan, but it seems to have merit
Richard Boulton: Yes.  That's the main thing I was unhappy about
StringListSerialiser for - you couldn't sensibly sort on the resulting values.

Change History

Changed 16 months ago by richard

  • blocking set to 199

Changed 16 months ago by olly

Just noting that this doesn't necessarily block bug#199, as we could address that by just disabling features for 1.0.3. I'm not saying we should do that, but I'd certainly prefer not to delay 1.0.3 too much, and we should get back to ~monthly releases now that summer is over so 1.0.4 shouldn't be far away.

Changed 16 months ago by richard

  • cc olly@… added
  • owner changed from newbugs to richard

Changed 15 months ago by richard

  • status changed from new to assigned

I've looked at this a little. The obvious way to implement this inside the Document object is to store the values in a multimap instead of a map. I think this is just about doable, but it turns out we require a slightly stronger specification of the behaviour than the current standard. Specifically, we require that we can add an item to the multimap, and be certain that it will appear after any pre-existing items with the same key. However, the current standard doesn't specify the ordering of items added to a multimap with the same key. On the other hand, the next standard is likely to require that an item inserted in a multimap when no hint position is supplied is added after all existing items with the same key, and all current public implementations seem to do that (according to discussion on the standards mailing list).

Therefore, I think using multimap, with an appropriate test in internaltest to check that the compiler is behaving as we require, should be fine.

Changed 15 months ago by trac

  • platform set to All

Changed 15 months ago by olly

I don't see how you can reliably test for this - even if you test inserting a billion values into a multimap and find them ordered by insertion order, there may be a case you haven't tickled where they aren't. There probably isn't, but we can't really rely on "probably".

It seems the only safe test is for the standard version the compiler supports being high enough (once it's updated, assuming this change is included) or for versions of particular compilers which document that they implement multimap like this.

Testing in internaltest is bad, since users won't automatically run it and could easily end up with a build which behaves incorrectly with multiple valued values. I think this would need to be a configure test really. And what do we do if this test fails (or when cross-compiling)? It seems some fall-back approach is needed until this is both required by the C++ standard and that version of the standard is supported by all compilers we want to support building with.

Changed 15 months ago by richard

Fair comments. I'll take a look and see if there's an alternative implementation of multimap that we can rely on that we can use (perhaps the version from STLPort can be used in isolation).

Alternatively, we could use a map of vectors to emulate a multimap, but that could be rather wasteful of memory.

Changed 15 months ago by olly

If there's a suitable portable implementation, that may be the best solution. And it allows a gradual transition if/when C++ requires this extra ordering condition from multimap (or implementations document the work this way).

Or we could just keep the current map for "primary" values, and add a second structure (e.g. map<vector> or perhaps better map<list>) to track any extra ones. That means that if they aren't used, there's only a small fixed overhead per Document object. Iterating values becomes more complication though.

Another option is to encode multiple values into the string "tag" in the current map, e.g. <encoded length>value1<encoded length>value2 - again iteration is more complex, and there's overhead even if it's not used (though probably only 1 byte per value).

Changed 14 months ago by olly

I just did a bit of digging, and this has been changed in the next revision of the standard:

http://www.comeaucomputing.com/iso/lwg-defects.html#233

And this list shows that issue "resulted in code changes to the [libstdc++-v3] library":

http://gcc.gnu.org/onlinedocs/libstdc++/ext/howto.html

So GCC should be suitable for our use, at least after these changes went in (it's not clear from the above when that was, or if it has always actually implemented the semantics we want and the code change was for some other detail of issue #233, or perhaps just changes to comments).

Changed 9 months ago by richard

  • description modified (diff)
  • milestone set to 1.1.0

Marking this for 1.1.0, since it may involve API changes (though probably could be done without).

Changed 3 months ago by olly

  • component changed from Backend-Flint to Backend-Chert

We'd implement this for chert rather than flint now.

Note: See TracTickets for help on using tickets.