Ticket #250 (new enhancement)

Opened 9 months ago

Last modified 3 weeks ago

replace_document should make minimal changes to database file

Reported by: richard Owned by: olly
Priority: normal Milestone: 1.1.0
Component: Backend-Chert Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Operating System: All Blocking:

Description

Currently, replace_document simply removes any document with the same document ID and then inserts the new document.

If replace_document is being used to update an existing document slightly, this is a lot of work (and, also, will make changesets for replication a lot larger than they need to be). Ideally, in this case, replace_document would make the minimal set of modifications to the database to bring the document into the new state.

There are two cases - 1, where the document supplied to replace_document() is a new document, and 2, where the document supplied to replace_document() is a modified version of the document being replaced, obtained from the database by get_document(). The second of these cases offer some scope for using the lazy update aspects of Document to avoid having to check parts of the document for changes. This should be used if possible, but in general, this information won't be available - I think it's more important to fix the case where the new Document has been created from scratch (perhaps from a database export).

It may be necessary to have an additional flag somewhere to enable or disable this behaviour - if the application knows that replace document is not being used for incremental updates, but rather for complete changes of documents (or for indexing with user-specified IDs), we don't want to cause too much of a performance hit. On the other hand, I think that replace_document already has to read the termlist in order to remove the current entries from the posting list, so it might be possible to improve at least that part without causing any extra database reads. Comparing the existing document data is liable to be expensive if the data is large - though all the blocks of data need to be accessed anyway, there's a significant cost in building the blocks up. Maybe this could be avoided by adding a method to the btree layer to do the replace, which could work at the individual chunk level - though this code would be fiddly.

One complication is that the replace_document(term) form may remove multipl old documents, rather than a single one. It would be reasonable to skip checking for the minimal set of changes to the documents in this case (or, just delete all but one (the first, or last?) of the old documents, and then check for the modifications).

Marking this as milestone 1.1.0, since it may involve an API modification - we should aim to work out if it does by 1.1.0, though actual implementation may well be deferred to a later milestone.

Change History

Changed 9 months ago by olly

  • owner changed from newbugs to olly

Changed 3 months ago by olly

  • component changed from Backend-Flint to Backend-Chert

We'd implement this for chert (at least initially - it might be reasonable to backport the changes to flint once they're stable I guess...)

Changed 3 weeks ago by richard

It occurs to me that, if we found that the replace_document() method was significantly slower with these modifications in the case where the old document was entirely different to the new document, we could just advise users to do delete_document() and then add_document() in this situation. This is essentially what the current implementation does, so would have the same performance as the current implementation.

Therefore, it seems reasonable to just implement checks for modifications in replace_document(), then profile, and document as appropriate. There should therefore be no need to change the API of replace_document() to make it apply only the modifications.

Note: See TracTickets for help on using tickets.