Ticket #250 (closed enhancement: fixed)
replace_document should make minimal changes to database file
| Reported by: | richard | Owned by: | richard |
|---|---|---|---|
| Priority: | normal | Milestone: | 1.0.18 |
| Component: | Backend-Chert | Version: | SVN trunk |
| Severity: | normal | Keywords: | |
| Cc: | ingmar@…, daniel.menard@… | Blocked By: | |
| Operating System: | All | Blocking: |
Description
Currently, replace_document simply removes any document with the same document ID and then inserts the new document.
If replace_document is being used to update an existing document slightly, this is a lot of work (and, also, will make changesets for replication a lot larger than they need to be). Ideally, in this case, replace_document would make the minimal set of modifications to the database to bring the document into the new state.
There are two cases - 1, where the document supplied to replace_document() is a new document, and 2, where the document supplied to replace_document() is a modified version of the document being replaced, obtained from the database by get_document(). The second of these cases offer some scope for using the lazy update aspects of Document to avoid having to check parts of the document for changes. This should be used if possible, but in general, this information won't be available - I think it's more important to fix the case where the new Document has been created from scratch (perhaps from a database export).
It may be necessary to have an additional flag somewhere to enable or disable this behaviour - if the application knows that replace document is not being used for incremental updates, but rather for complete changes of documents (or for indexing with user-specified IDs), we don't want to cause too much of a performance hit. On the other hand, I think that replace_document already has to read the termlist in order to remove the current entries from the posting list, so it might be possible to improve at least that part without causing any extra database reads. Comparing the existing document data is liable to be expensive if the data is large - though all the blocks of data need to be accessed anyway, there's a significant cost in building the blocks up. Maybe this could be avoided by adding a method to the btree layer to do the replace, which could work at the individual chunk level - though this code would be fiddly.
One complication is that the replace_document(term) form may remove multipl old documents, rather than a single one. It would be reasonable to skip checking for the minimal set of changes to the documents in this case (or, just delete all but one (the first, or last?) of the old documents, and then check for the modifications).
Marking this as milestone 1.1.0, since it may involve an API modification - we should aim to work out if it does by 1.1.0, though actual implementation may well be deferred to a later milestone.

