Opened 16 years ago

Closed 14 years ago

#250 closed enhancement (fixed)

replace_document should make minimal changes to database file

Reported by: Richard Boulton Owned by: Richard Boulton
Priority: normal Milestone: 1.0.18
Component: Backend-Chert Version: SVN trunk
Severity: normal Keywords:
Cc: ingmar@…, daniel.menard@… Blocked By:
Blocking: Operating System: All

Description

Currently, replace_document simply removes any document with the same document ID and then inserts the new document.

If replace_document is being used to update an existing document slightly, this is a lot of work (and, also, will make changesets for replication a lot larger than they need to be). Ideally, in this case, replace_document would make the minimal set of modifications to the database to bring the document into the new state.

There are two cases - 1, where the document supplied to replace_document() is a new document, and 2, where the document supplied to replace_document() is a modified version of the document being replaced, obtained from the database by get_document(). The second of these cases offer some scope for using the lazy update aspects of Document to avoid having to check parts of the document for changes. This should be used if possible, but in general, this information won't be available - I think it's more important to fix the case where the new Document has been created from scratch (perhaps from a database export).

It may be necessary to have an additional flag somewhere to enable or disable this behaviour - if the application knows that replace document is not being used for incremental updates, but rather for complete changes of documents (or for indexing with user-specified IDs), we don't want to cause too much of a performance hit. On the other hand, I think that replace_document already has to read the termlist in order to remove the current entries from the posting list, so it might be possible to improve at least that part without causing any extra database reads. Comparing the existing document data is liable to be expensive if the data is large - though all the blocks of data need to be accessed anyway, there's a significant cost in building the blocks up. Maybe this could be avoided by adding a method to the btree layer to do the replace, which could work at the individual chunk level - though this code would be fiddly.

One complication is that the replace_document(term) form may remove multipl old documents, rather than a single one. It would be reasonable to skip checking for the minimal set of changes to the documents in this case (or, just delete all but one (the first, or last?) of the old documents, and then check for the modifications).

Marking this as milestone 1.1.0, since it may involve an API modification - we should aim to work out if it does by 1.1.0, though actual implementation may well be deferred to a later milestone.

Attachments (4)

patch_kan_ru.patch (7.3 KB ) - added by Richard Boulton 14 years ago.
Cleaned up version of Kan-Ru Chen's patch
patch_kan_ru_2.patch (7.9 KB ) - added by Richard Boulton 14 years ago.
Slightly refactored version of the Kan-Ru's patch
lazy_positionmods.patch (3.0 KB ) - added by Richard Boulton 14 years ago.
Patch to track whether positionlists might be different
patch_kan_ru_4.patch (8.9 KB ) - added by Richard Boulton 14 years ago.
Updated version of the patch, which applies to SVN head, and passes all the newly added tests.

Download all attachments as: .zip

Change History (23)

comment:1 by Olly Betts, 16 years ago

Owner: changed from New Bugs to Olly Betts

comment:2 by Olly Betts, 16 years ago

Component: Backend-FlintBackend-Chert

We'd implement this for chert (at least initially - it might be reasonable to backport the changes to flint once they're stable I guess...)

comment:3 by Richard Boulton, 15 years ago

It occurs to me that, if we found that the replace_document() method was significantly slower with these modifications in the case where the old document was entirely different to the new document, we could just advise users to do delete_document() and then add_document() in this situation. This is essentially what the current implementation does, so would have the same performance as the current implementation.

Therefore, it seems reasonable to just implement checks for modifications in replace_document(), then profile, and document as appropriate. There should therefore be no need to change the API of replace_document() to make it apply only the modifications.

comment:4 by Olly Betts, 15 years ago

Just to note that the merge of the lazyupdate branch in r11431 partly addresses case 2.

comment:5 by Olly Betts, 15 years ago

Milestone: 1.1.01.1.1

As comment:3 says, it's unlikely this will need an incompatible API change. Also this is pretty core API stuff, used by almost anyone writing an updating indexer, so I think an incompatible API change would need very strong justification. So bumping the milestone.

comment:6 by Olly Betts, 15 years ago

Milestone: 1.1.11.1.4

Triaging milestone:1.1.1 bugs.

comment:7 by Olly Betts, 15 years ago

Milestone: 1.1.41.2.0

My feeling is we've picked the low-hanging fruit from this tree, and while there are probably still useful gains to be had, the work isn't a trivial tweak we can squeeze in before 1.2.0. And this should just be an internal change, so can be made in 1.2.x.

comment:8 by Olly Betts, 14 years ago

I was thinking about this - as Richard says, we need to read the old termlist anyway, so it would be easy to iterate through the old and new at the same time which would avoid the rather large overhead of updating posting lists needlessly. The additional cost when not needed is unlikely to be an issue since we need to do the same work anyway, except for comparing. There may also be cache issues from increased working set size from doing both at once, but I'd be surprised if those matter.

The other cases could be handled at the Btree level - e.g. a "replace()" method which is like "add()" but if the key is already set, checks if the tag value is the same. The big advantage of doing it at this level is that we can probably spot many cases where the tag value must have changed without actually having to unpack the old tag - e.g. if the size is different (for non-compressed tags which aren't split at least).

comment:9 by Carl Worth, 14 years ago

This performance bug has been hitting us pretty hard in notmuch (since notmuch often adds or removes a single term (a tag) from a document with many terms (every term from the content of an email message).

A user just submitted a patch (against the flint backend) which we've measured as providing a 5-6x speedup to notmuch tagging operations:

http://notmuchmail.org/pipermail/notmuch/2009/000886.html

-Carl

by Richard Boulton, 14 years ago

Attachment: patch_kan_ru.patch added

Cleaned up version of Kan-Ru Chen's patch

comment:10 by Richard Boulton, 14 years ago

I've attached a copy of Kan-Ru Chen's patch: whitespace and indenting fixes only from the version on the mailing list.

by Richard Boulton, 14 years ago

Attachment: patch_kan_ru_2.patch added

Slightly refactored version of the Kan-Ru's patch

comment:11 by Richard Boulton, 14 years ago

Owner: changed from Olly Betts to Richard Boulton
Status: newassigned

Generally, this patch looks good, and passes the testsuite for me (I've not done any timings for it, though). I've refactored the patch a little, to make it a bit more readable:

  • I've pulled the comparison of positionlists out into a separate function, and called

that from the one place in the code where the value is wanted, rather than calculating it in advance.

  • It's a bit more conventional to use -1, 0 and 1 as values for cmp, rather than

-2, 0 and 2, so I've changed it to do that.

I'd also rename "termlist" to orig_termlist (or old_termlist), and "term" to "new_termiter" - makes the code a little more readable IMO.

There are some other potential improvements in this area in Xapian, and the code currently in Xapian could do with some serious refactoring to make it more readable and easier to maintain and modify.

In particular, one improvement we'd like to do is adding a flag to Xapian::Document to indicate if the positions have been modified at all, and using this flag to avoid needing to compare the old positionlist values with the new ones if they haven't been.

After discussion with Olly, trying to work out how to best get this functionality integrated with trunk, and able to be easily and safely backported to the 1.0.x series, I'm going to try and do the following:

  • Add a flag to Xapian::Document to keep track of modifications to the positions. This should be a pretty non-invasive patch, so we'd like to apply it first for ease of backporting.
  • Do minimal refactoring of replace_document() to make it more readable.
  • Apply Kan-Ru Chen's patch, or something along those lines.
  • Possibly do further refactoring of replace_document().

by Richard Boulton, 14 years ago

Attachment: lazy_positionmods.patch added

Patch to track whether positionlists might be different

comment:12 by Richard Boulton, 14 years ago

Attached a patch which does the first of these: it tracks whether any of the positions stored in a document have been modified, and doesn't bother rewriting them if none have been. With Kan-Ru Chen's patch, this would become a check done before we bother comparing the position lists.

Seems to work (ie, passes the current testsuite), but the next thing we need to do is add more tests to check that it works. The testsuite is weak in coverage in this area. We need some tests which add documents to a database, then modify them in various patterns, and then check that the position information is all correct.

comment:13 by Carl Worth, 14 years ago

I haven't studied what any of these patches do in detail. And I only know a little about what position data will be in a notmuch database.

Most of the terms in the database are the content of the mail and for these we use Xapian::TermGenerator::index_text. That generates positions right? If if didn't we couldn't do phrase searches, right? (And we definitely can.)

I carefully tested the master branch of Xapian (svn commit 13730), then with the lazy_positions.patch on top and then with patch_kan_ru_2.patch instead. Here are the results I got for an identical "notmuch tag" operations. Each operation here does a read/modify/write of 8933 documents, and the modification is the addition or removal of one or zero terms (most commonly one, not zero).

I flushed the Linux page cache (echo 3 > /proc/sys/vm/drop_caches) between the different Xapian versions being tested here, (but not between the +foo and -foo commands for each).

Xapian master


$ time ./notmuch tag +foo tag:sent real 3m26.163s user 2m48.031s sys 0m5.136s

$ time ./notmuch tag -foo tag:sent real 3m11.516s user 2m42.254s sys 0m4.304s

With lazy_positions.patch


$ time ./notmuch tag +foo tag:sent real 3m22.187s user 2m44.450s sys 0m4.864s

$ time ./notmuch tag -foo tag:sent real 2m56.191s user 2m36.234s sys 0m3.808s

With patch_kan_ru_2.patch


$ time ./notmuch tag +foo tag:sent real 0m43.120s user 0m34.206s sys 0m1.068s

$ time ./notmuch tag -foo tag:sent real 0m29.294s user 0m27.790s sys 0m0.392s

Perhaps the above is helpful.

-Carl

comment:14 by Ingmar Vanhassel, 14 years ago

Cc: ingmar@… added

comment:15 by Richard Boulton, 14 years ago

For those following this bug with interest - I'm going to spend some time over the next few days writing a load of tests of replace_document() to ensure we've got pretty full coverage of it. Then, we can see about getting these patches applied.

by Richard Boulton, 14 years ago

Attachment: patch_kan_ru_4.patch added

Updated version of the patch, which applies to SVN head, and passes all the newly added tests.

comment:16 by Richard Boulton, 14 years ago

Just to note - with the current testsuite, and patch_kan_ru_4.patch applied, we have nearly full code coverage for the FlintWritableDatabase::replace_document() method, with the only exceptions being a couple of error reporting lines, and the automatic flush for postlist changes.

comment:17 by Richard Boulton, 14 years ago

I've applied a patch for this (based originally on Kan-Ru's patch, but largely rewritten) to trunk in r13808.

One point to note - for Flint, if the document length has changed, we need to update all the postings for that document, because the document length is stored in the posting entries. This isn't an issue with Chert or Brass, which don't store the document length in posting entries. For Flint, if the only changes are adding or removing of terms with 0 wdf, or more generally if the wdf change sums to 0, only those postings which have actually been changed will be applied, and the speedups should be visible.

This largely fixes this ticket in trunk, though it would be good to make minimal changes to the value slots in Chert and Brass, in a similar manner to this.

comment:18 by Daniel Ménard, 14 years ago

Cc: daniel.menard@… added

comment:19 by Olly Betts, 14 years ago

Milestone: 1.2.x1.0.18
Resolution: fixed
Status: assignedclosed

Backported all the relevant changesets to 1.0 in r13849. Testing in real world applications is most welcome.

I think we should look at whether it's better to encode the new positions and compare the encoded form with the old encoded form (rather than decode the old form, iterate it and the new form to compare, and if they differ encode the new form) as I suspect that's going to be faster for the "unchanged" case, and it clearly is going to be for the "changed" case. I've filed #428 for this.

I also wonder if the Document object should track the changes made, rather than copy the old data and modify it. Currently adding a new term to a document is unnecessarily O(number of terms indexing that document). The current optimisation is still useful in the case of reindexing an unmodified (or slightly modified) document from the original source. I've filed #429 for this.

So closing this ticket now.

Note: See TracTickets for help on using tickets.