#53 closed enhancement (fixed)
Xapian::Fields
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | lowest | Milestone: | 1.4.23 |
Component: | Library API | Version: | SVN trunk |
Severity: | minor | Keywords: | |
Cc: | Richard Boulton | Blocked By: | |
Blocking: | Operating System: | All |
Description (last modified by )
Implement a Xapian::Fields class to serialise/unserialise name=value pairs to/from Document data field.
Change History (13)
comment:1 by , 20 years ago
Severity: | blocker → normal |
---|---|
Status: | new → assigned |
comment:2 by , 18 years ago
Component: | other → Library API |
---|---|
op_sys: | other → All |
rep_platform: | Other → All |
Severity: | normal → enhancement |
Version: | other → SVN HEAD |
comment:3 by , 18 years ago
Priority: | high → lowest |
---|
comment:4 by , 18 years ago
Cc: | added |
---|---|
Operating System: | → All |
comment:5 by , 18 years ago
comment:6 by , 18 years ago
There would need to be some way to iterate through the set of fields in a document, as well as the set_field, get_field and clear_field methods.
Also, if we're going to say that non-existent fields read as empty, wouldn't set_field("foo", "") be the same as clear_field("foo")? In which case, we shouldn't have a clear_field() method.
It will need careful documentation to ensure that it is clear that access to the document data is a "lower level" interface than the set_field and get_field methods, and that the two interfaces aren't intended to be used on the same document; users should consider the document data held in a document which has been created with set_field() methods to be an opaque blob.
Using the meta-information to store the field names makes a lot of sense to me.
comment:7 by , 18 years ago
doc.clear_field("foo") would indeed just be the same as doc.set_field("foo", "") which is why I didn't include it - I actually said "clear_fields()" (plural and taking no arguments). That would probably actually be the same as set_data(""), but I think it's helpful to provide it explicitly with the obvious name rather than forcing the user to mix "field" and "data" calls.
I'm not sure a way to iterate through fields is vital. Handy for tools like delve admittedly, but I don't think it's useful for user applications (unless they're abusing fields somehow).
One downside of not explicitly storing fieldnames in the document data is that xapian-compact can't easily merge databases which use fields - it potentially has to rewrite all the document data for every entry from all but one of the databases. Perhaps that kills that idea...
comment:8 by , 18 years ago
Ah - "clear_fields" makes a lot more sense, sorry.
I was thinking of tools like "copy_database()" when suggesting that it should be possible to iterate through the fields (though tools like delve would certainly want to be able to do so, too). The iteration wouldn't neccessarily need to be a full blown class like the other Xapian iterators - indeed, simply having a method which returned a list of field names might be perfectly appropriate.
It would be a shame if we had to store the field names in every entry. Perhaps there's some way we can seed zlib with the field names so that they get compressed well... (I know that we can seed zlib - I'm just not quite sure how we could seed it since we need to know the seed before we add the first document, at present).
comment:10 by , 15 years ago
Description: | modified (diff) |
---|---|
Milestone: | → 1.2.x |
Marking for consideration in 1.2.x, since this would be an API addition.
comment:12 by , 9 years ago
Milestone: | 1.3.x → 1.4.x |
---|
We've too much on the 1.3.x milestone, so I think this need repegging.
comment:13 by , 20 months ago
Milestone: | 1.4.x → 1.5.0 |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
I think we should chose a very compact storage scheme for the built-in field support (rather than XML or JSON or ...) because if people want a particular standard format, they can use an external library to serialise/unserialise it, whereas if they want a very compact format, there aren't many options.
Since then protocol buffers has emerged as a good option for this. This uses a schema so you don't end up storing the field names used explicitly in every document, and a it's also extensible - you can add new fields without invalidating existing data.
I think we should just suggest protocol buffers as a good option in the documentation. It doesn't seem feasible to fully integrate support as from what I've seen typically code is generated from the schema, so it seems the encoding and decoding really needs to be handled from the application code which I've done in [b02984d631043ad04b3720a7869def0b813b5047]. I'll also adjust the "Getting Started" guide to recommend it.
comment:14 by , 20 months ago
Milestone: | 1.5.0 → 1.4.23 |
---|
Some further thoughts, before I forget them:
Rather than a separate class, I think adding Document methods such as: set_field(key, tag), get_field(tag), clear_fields() makes most sense. Non-existent fields should probably read as empty rather than throwing an exception, which is consistent with values, and avoids throwing exceptions in the common case of having optional fields (there's overhead in exception handling everywhere, but exceptions are particularly hard to handle in some of the bindings - PHP4 for example).
I think we should chose a very compact storage scheme for the built-in field support (rather than XML or JSON or ...) because if people want a particular standard format, they can use an external library to serialise/unserialise it, whereas if they want a very compact format, there aren't many options.
We can probably make use of the meta-information stuff to store the field names once and just use numbers for them in the document data.