Opened 20 years ago

Closed 21 months ago

Last modified 21 months ago

#53 closed enhancement (fixed)

Xapian::Fields

Reported by: Olly Betts Owned by: Olly Betts
Priority: lowest Milestone: 1.4.23
Component: Library API Version: SVN trunk
Severity: minor Keywords:
Cc: Richard Boulton Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Implement a Xapian::Fields class to serialise/unserialise name=value pairs to/from Document data field.

Change History (13)

comment:1 by Olly Betts, 20 years ago

Severity: blockernormal
Status: newassigned

comment:2 by Olly Betts, 18 years ago

Component: otherLibrary API
op_sys: otherAll
rep_platform: OtherAll
Severity: normalenhancement
Version: otherSVN HEAD

comment:3 by Olly Betts, 18 years ago

Priority: highlowest

comment:4 by Richard Boulton, 18 years ago

Cc: richard@… added
Operating System: All

comment:5 by Olly Betts, 18 years ago

Some further thoughts, before I forget them:

Rather than a separate class, I think adding Document methods such as: set_field(key, tag), get_field(tag), clear_fields() makes most sense. Non-existent fields should probably read as empty rather than throwing an exception, which is consistent with values, and avoids throwing exceptions in the common case of having optional fields (there's overhead in exception handling everywhere, but exceptions are particularly hard to handle in some of the bindings - PHP4 for example).

I think we should chose a very compact storage scheme for the built-in field support (rather than XML or JSON or ...) because if people want a particular standard format, they can use an external library to serialise/unserialise it, whereas if they want a very compact format, there aren't many options.

We can probably make use of the meta-information stuff to store the field names once and just use numbers for them in the document data.

comment:6 by Richard Boulton, 18 years ago

There would need to be some way to iterate through the set of fields in a document, as well as the set_field, get_field and clear_field methods.

Also, if we're going to say that non-existent fields read as empty, wouldn't set_field("foo", "") be the same as clear_field("foo")? In which case, we shouldn't have a clear_field() method.

It will need careful documentation to ensure that it is clear that access to the document data is a "lower level" interface than the set_field and get_field methods, and that the two interfaces aren't intended to be used on the same document; users should consider the document data held in a document which has been created with set_field() methods to be an opaque blob.

Using the meta-information to store the field names makes a lot of sense to me.

comment:7 by Olly Betts, 18 years ago

doc.clear_field("foo") would indeed just be the same as doc.set_field("foo", "") which is why I didn't include it - I actually said "clear_fields()" (plural and taking no arguments). That would probably actually be the same as set_data(""), but I think it's helpful to provide it explicitly with the obvious name rather than forcing the user to mix "field" and "data" calls.

I'm not sure a way to iterate through fields is vital. Handy for tools like delve admittedly, but I don't think it's useful for user applications (unless they're abusing fields somehow).

One downside of not explicitly storing fieldnames in the document data is that xapian-compact can't easily merge databases which use fields - it potentially has to rewrite all the document data for every entry from all but one of the databases. Perhaps that kills that idea...

comment:8 by Richard Boulton, 18 years ago

Ah - "clear_fields" makes a lot more sense, sorry.

I was thinking of tools like "copy_database()" when suggesting that it should be possible to iterate through the fields (though tools like delve would certainly want to be able to do so, too). The iteration wouldn't neccessarily need to be a full blown class like the other Xapian iterators - indeed, simply having a method which returned a list of field names might be perfectly appropriate.

It would be a shame if we had to store the field names in every entry. Perhaps there's some way we can seed zlib with the field names so that they get compressed well... (I know that we can seed zlib - I'm just not quite sure how we could seed it since we need to know the seed before we add the first document, at present).

comment:10 by Olly Betts, 15 years ago

Description: modified (diff)
Milestone: 1.2.x

Marking for consideration in 1.2.x, since this would be an API addition.

comment:11 by Olly Betts, 12 years ago

Milestone: 1.2.x1.3.x

This isn't 1.2.x material now.

comment:12 by Olly Betts, 10 years ago

Milestone: 1.3.x1.4.x

We've too much on the 1.3.x milestone, so I think this need repegging.

comment:13 by Olly Betts, 21 months ago

Milestone: 1.4.x1.5.0
Resolution: fixed
Status: assignedclosed

I think we should chose a very compact storage scheme for the built-in field support (rather than XML or JSON or ...) because if people want a particular standard format, they can use an external library to serialise/unserialise it, whereas if they want a very compact format, there aren't many options.

Since then protocol buffers has emerged as a good option for this. This uses a schema so you don't end up storing the field names used explicitly in every document, and a it's also extensible - you can add new fields without invalidating existing data.

I think we should just suggest protocol buffers as a good option in the documentation. It doesn't seem feasible to fully integrate support as from what I've seen typically code is generated from the schema, so it seems the encoding and decoding really needs to be handled from the application code which I've done in [b02984d631043ad04b3720a7869def0b813b5047]. I'll also adjust the "Getting Started" guide to recommend it.

comment:14 by Olly Betts, 21 months ago

Milestone: 1.5.01.4.23
Note: See TracTickets for help on using tickets.