Opened 17 years ago
Closed 15 years ago
#250 closed enhancement (fixed)
replace_document should make minimal changes to database file
Reported by: | Richard Boulton | Owned by: | Richard Boulton |
---|---|---|---|
Priority: | normal | Milestone: | 1.0.18 |
Component: | Backend-Chert | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | ingmar@…, daniel.menard@… | Blocked By: | |
Blocking: | Operating System: | All |
Description
Currently, replace_document simply removes any document with the same document ID and then inserts the new document.
If replace_document is being used to update an existing document slightly, this is a lot of work (and, also, will make changesets for replication a lot larger than they need to be). Ideally, in this case, replace_document would make the minimal set of modifications to the database to bring the document into the new state.
There are two cases - 1, where the document supplied to replace_document() is a new document, and 2, where the document supplied to replace_document() is a modified version of the document being replaced, obtained from the database by get_document(). The second of these cases offer some scope for using the lazy update aspects of Document to avoid having to check parts of the document for changes. This should be used if possible, but in general, this information won't be available - I think it's more important to fix the case where the new Document has been created from scratch (perhaps from a database export).
It may be necessary to have an additional flag somewhere to enable or disable this behaviour - if the application knows that replace document is not being used for incremental updates, but rather for complete changes of documents (or for indexing with user-specified IDs), we don't want to cause too much of a performance hit. On the other hand, I think that replace_document already has to read the termlist in order to remove the current entries from the posting list, so it might be possible to improve at least that part without causing any extra database reads. Comparing the existing document data is liable to be expensive if the data is large - though all the blocks of data need to be accessed anyway, there's a significant cost in building the blocks up. Maybe this could be avoided by adding a method to the btree layer to do the replace, which could work at the individual chunk level - though this code would be fiddly.
One complication is that the replace_document(term) form may remove multipl old documents, rather than a single one. It would be reasonable to skip checking for the minimal set of changes to the documents in this case (or, just delete all but one (the first, or last?) of the old documents, and then check for the modifications).
Marking this as milestone 1.1.0, since it may involve an API modification - we should aim to work out if it does by 1.1.0, though actual implementation may well be deferred to a later milestone.
Attachments (4)
Change History (23)
comment:1 by , 17 years ago
Owner: | changed from | to
---|
comment:2 by , 16 years ago
Component: | Backend-Flint → Backend-Chert |
---|
comment:3 by , 16 years ago
It occurs to me that, if we found that the replace_document() method was significantly slower with these modifications in the case where the old document was entirely different to the new document, we could just advise users to do delete_document() and then add_document() in this situation. This is essentially what the current implementation does, so would have the same performance as the current implementation.
Therefore, it seems reasonable to just implement checks for modifications in replace_document(), then profile, and document as appropriate. There should therefore be no need to change the API of replace_document() to make it apply only the modifications.
comment:4 by , 16 years ago
Just to note that the merge of the lazyupdate branch in r11431 partly addresses case 2.
comment:5 by , 16 years ago
Milestone: | 1.1.0 → 1.1.1 |
---|
As comment:3 says, it's unlikely this will need an incompatible API change. Also this is pretty core API stuff, used by almost anyone writing an updating indexer, so I think an incompatible API change would need very strong justification. So bumping the milestone.
comment:7 by , 15 years ago
Milestone: | 1.1.4 → 1.2.0 |
---|
My feeling is we've picked the low-hanging fruit from this tree, and while there are probably still useful gains to be had, the work isn't a trivial tweak we can squeeze in before 1.2.0. And this should just be an internal change, so can be made in 1.2.x.
comment:8 by , 15 years ago
I was thinking about this - as Richard says, we need to read the old termlist anyway, so it would be easy to iterate through the old and new at the same time which would avoid the rather large overhead of updating posting lists needlessly. The additional cost when not needed is unlikely to be an issue since we need to do the same work anyway, except for comparing. There may also be cache issues from increased working set size from doing both at once, but I'd be surprised if those matter.
The other cases could be handled at the Btree level - e.g. a "replace()" method which is like "add()" but if the key is already set, checks if the tag value is the same. The big advantage of doing it at this level is that we can probably spot many cases where the tag value must have changed without actually having to unpack the old tag - e.g. if the size is different (for non-compressed tags which aren't split at least).
comment:9 by , 15 years ago
This performance bug has been hitting us pretty hard in notmuch (since notmuch often adds or removes a single term (a tag) from a document with many terms (every term from the content of an email message).
A user just submitted a patch (against the flint backend) which we've measured as providing a 5-6x speedup to notmuch tagging operations:
http://notmuchmail.org/pipermail/notmuch/2009/000886.html
-Carl
comment:10 by , 15 years ago
I've attached a copy of Kan-Ru Chen's patch: whitespace and indenting fixes only from the version on the mailing list.
by , 15 years ago
Attachment: | patch_kan_ru_2.patch added |
---|
Slightly refactored version of the Kan-Ru's patch
comment:11 by , 15 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
Generally, this patch looks good, and passes the testsuite for me (I've not done any timings for it, though). I've refactored the patch a little, to make it a bit more readable:
- I've pulled the comparison of positionlists out into a separate function, and called
that from the one place in the code where the value is wanted, rather than calculating it in advance.
- It's a bit more conventional to use -1, 0 and 1 as values for cmp, rather than
-2, 0 and 2, so I've changed it to do that.
I'd also rename "termlist" to orig_termlist (or old_termlist), and "term" to "new_termiter" - makes the code a little more readable IMO.
There are some other potential improvements in this area in Xapian, and the code currently in Xapian could do with some serious refactoring to make it more readable and easier to maintain and modify.
In particular, one improvement we'd like to do is adding a flag to Xapian::Document to indicate if the positions have been modified at all, and using this flag to avoid needing to compare the old positionlist values with the new ones if they haven't been.
After discussion with Olly, trying to work out how to best get this functionality integrated with trunk, and able to be easily and safely backported to the 1.0.x series, I'm going to try and do the following:
- Add a flag to Xapian::Document to keep track of modifications to the positions. This should be a pretty non-invasive patch, so we'd like to apply it first for ease of backporting.
- Do minimal refactoring of replace_document() to make it more readable.
- Apply Kan-Ru Chen's patch, or something along those lines.
- Possibly do further refactoring of replace_document().
by , 15 years ago
Attachment: | lazy_positionmods.patch added |
---|
Patch to track whether positionlists might be different
comment:12 by , 15 years ago
Attached a patch which does the first of these: it tracks whether any of the positions stored in a document have been modified, and doesn't bother rewriting them if none have been. With Kan-Ru Chen's patch, this would become a check done before we bother comparing the position lists.
Seems to work (ie, passes the current testsuite), but the next thing we need to do is add more tests to check that it works. The testsuite is weak in coverage in this area. We need some tests which add documents to a database, then modify them in various patterns, and then check that the position information is all correct.
comment:13 by , 15 years ago
I haven't studied what any of these patches do in detail. And I only know a little about what position data will be in a notmuch database.
Most of the terms in the database are the content of the mail and for these we use Xapian::TermGenerator::index_text. That generates positions right? If if didn't we couldn't do phrase searches, right? (And we definitely can.)
I carefully tested the master branch of Xapian (svn commit 13730), then with the lazy_positions.patch on top and then with patch_kan_ru_2.patch instead. Here are the results I got for an identical "notmuch tag" operations. Each operation here does a read/modify/write of 8933 documents, and the modification is the addition or removal of one or zero terms (most commonly one, not zero).
I flushed the Linux page cache (echo 3 > /proc/sys/vm/drop_caches) between the different Xapian versions being tested here, (but not between the +foo and -foo commands for each).
Xapian master
$ time ./notmuch tag +foo tag:sent real 3m26.163s user 2m48.031s sys 0m5.136s
$ time ./notmuch tag -foo tag:sent real 3m11.516s user 2m42.254s sys 0m4.304s
With lazy_positions.patch
$ time ./notmuch tag +foo tag:sent real 3m22.187s user 2m44.450s sys 0m4.864s
$ time ./notmuch tag -foo tag:sent real 2m56.191s user 2m36.234s sys 0m3.808s
With patch_kan_ru_2.patch
$ time ./notmuch tag +foo tag:sent real 0m43.120s user 0m34.206s sys 0m1.068s
$ time ./notmuch tag -foo tag:sent real 0m29.294s user 0m27.790s sys 0m0.392s
Perhaps the above is helpful.
-Carl
comment:14 by , 15 years ago
Cc: | added |
---|
comment:15 by , 15 years ago
For those following this bug with interest - I'm going to spend some time over the next few days writing a load of tests of replace_document() to ensure we've got pretty full coverage of it. Then, we can see about getting these patches applied.
by , 15 years ago
Attachment: | patch_kan_ru_4.patch added |
---|
Updated version of the patch, which applies to SVN head, and passes all the newly added tests.
comment:16 by , 15 years ago
Just to note - with the current testsuite, and patch_kan_ru_4.patch applied, we have nearly full code coverage for the FlintWritableDatabase::replace_document() method, with the only exceptions being a couple of error reporting lines, and the automatic flush for postlist changes.
comment:17 by , 15 years ago
I've applied a patch for this (based originally on Kan-Ru's patch, but largely rewritten) to trunk in r13808.
One point to note - for Flint, if the document length has changed, we need to update all the postings for that document, because the document length is stored in the posting entries. This isn't an issue with Chert or Brass, which don't store the document length in posting entries. For Flint, if the only changes are adding or removing of terms with 0 wdf, or more generally if the wdf change sums to 0, only those postings which have actually been changed will be applied, and the speedups should be visible.
This largely fixes this ticket in trunk, though it would be good to make minimal changes to the value slots in Chert and Brass, in a similar manner to this.
comment:18 by , 15 years ago
Cc: | added |
---|
comment:19 by , 15 years ago
Milestone: | 1.2.x → 1.0.18 |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
Backported all the relevant changesets to 1.0 in r13849. Testing in real world applications is most welcome.
I think we should look at whether it's better to encode the new positions and compare the encoded form with the old encoded form (rather than decode the old form, iterate it and the new form to compare, and if they differ encode the new form) as I suspect that's going to be faster for the "unchanged" case, and it clearly is going to be for the "changed" case. I've filed #428 for this.
I also wonder if the Document object should track the changes made, rather than copy the old data and modify it. Currently adding a new term to a document is unnecessarily O(number of terms indexing that document). The current optimisation is still useful in the case of reindexing an unmodified (or slightly modified) document from the original source. I've filed #429 for this.
So closing this ticket now.
We'd implement this for chert (at least initially - it might be reasonable to backport the changes to flint once they're stable I guess...)