Ticket #53 (assigned enhancement)

Opened 4 years ago

Last modified 19 months ago

Xapian::Fields

Reported by: olly Owned by: olly
Priority: lowest Milestone:
Component: Library API Version: SVN trunk
Severity: minor Keywords:
Cc: richard Blocked By:
Operating System: All Blocking:

Description

Implement a Xapian::Fields class to serialise/unserialise name=value pairs to/from Document data field.

Change History

Changed 4 years ago by olly

  • status changed from new to assigned
  • severity changed from blocker to normal

Changed 23 months ago by olly

  • rep_platform changed from Other to All
  • version changed from other to SVN HEAD
  • component changed from other to Library API
  • op_sys changed from other to All
  • severity changed from normal to enhancement

Changed 20 months ago by olly

  • priority changed from high to lowest

Changed 20 months ago by richard

  • cc richard@… added

Changed 20 months ago by trac

  • platform set to All

Changed 19 months ago by olly

Some further thoughts, before I forget them:

Rather than a separate class, I think adding Document methods such as: set_field(key, tag), get_field(tag), clear_fields() makes most sense. Non-existent fields should probably read as empty rather than throwing an exception, which is consistent with values, and avoids throwing exceptions in the common case of having optional fields (there's overhead in exception handling everywhere, but exceptions are particularly hard to handle in some of the bindings - PHP4 for example).

I think we should chose a very compact storage scheme for the built-in field support (rather than XML or JSON or ...) because if people want a particular standard format, they can use an external library to serialise/unserialise it, whereas if they want a very compact format, there aren't many options.

We can probably make use of the meta-information stuff to store the field names once and just use numbers for them in the document data.

Changed 19 months ago by richard

There would need to be some way to iterate through the set of fields in a document, as well as the set_field, get_field and clear_field methods.

Also, if we're going to say that non-existent fields read as empty, wouldn't set_field("foo", "") be the same as clear_field("foo")? In which case, we shouldn't have a clear_field() method.

It will need careful documentation to ensure that it is clear that access to the document data is a "lower level" interface than the set_field and get_field methods, and that the two interfaces aren't intended to be used on the same document; users should consider the document data held in a document which has been created with set_field() methods to be an opaque blob.

Using the meta-information to store the field names makes a lot of sense to me.

Changed 19 months ago by olly

doc.clear_field("foo") would indeed just be the same as doc.set_field("foo", "") which is why I didn't include it - I actually said "clear_fields()" (plural and taking no arguments). That would probably actually be the same as set_data(""), but I think it's helpful to provide it explicitly with the obvious name rather than forcing the user to mix "field" and "data" calls.

I'm not sure a way to iterate through fields is vital. Handy for tools like delve admittedly, but I don't think it's useful for user applications (unless they're abusing fields somehow).

One downside of not explicitly storing fieldnames in the document data is that xapian-compact can't easily merge databases which use fields - it potentially has to rewrite all the document data for every entry from all but one of the databases. Perhaps that kills that idea...

Changed 19 months ago by richard

Ah - "clear_fields" makes a lot more sense, sorry.

I was thinking of tools like "copy_database()" when suggesting that it should be possible to iterate through the fields (though tools like delve would certainly want to be able to do so, too). The iteration wouldn't neccessarily need to be a full blown class like the other Xapian iterators - indeed, simply having a method which returned a list of field names might be perfectly appropriate.

It would be a shame if we had to store the field names in every entry. Perhaps there's some way we can seed zlib with the field names so that they get compressed well... (I know that we can seed zlib - I'm just not quite sure how we could seed it since we need to know the seed before we add the first document, at present).

Note: See TracTickets for help on using tickets.