Opened 7 years ago

Last modified 12 months ago

#742 new defect

Xapian should provide a way to securely remove a document from the database

Reported by: Daniel Kahn Gillmor Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: Backend-Honey Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

currently, if i remove a document from a xapian index, the indexed terms remain in the db, but are marked as part of the freelist.

This means that removal of a document is "insecure" in the sense that if someone gained access to the index after message deletion, they could recover information about the document by inspecting the contents of the freelist.

There may be other traces of a document that are retained in the index as well: for example, on IRC, olly mentioned:

oh, there's one awkward thing in the backend stuff -- dividing keys get created in the branch levels based on the leaf level keys around where the block is split

Some of these fixes may be easier to do than others.

For example, it might be pretty easy to zero blocks when they're returned to the freelist, but it might be harder to deal with the dividing keys. It's still worth fixing the easy parts, even if some harder challenges remain.

Another way to think about the problem is one of "index reproducibility" -- if an index contains exactly the same set of documents as another index, a byte-for-byte identical data store on disk is the ideal. Any divergence from that ideal leaks some information about documents that have been added to the database in the past, and then subseqently removed.

It's possible that any of these fixes incur a cost that some people are reluctant to pay (e.g. they're not concerned about the confidentiality of any of their indexed documents, or they're confident in the long-term confidentiality of the index itself for other reasons). So it seems likely that the feature needs to be optional. Whether the choice of feature is opt-in or opt-out; and whether the choice is made done on a per-deletion basis, or a per-database basis, or a per-xapian-session basis, i don't know.

I'm happy to review API proposals if that'd be useful.

Change History (4)

comment:1 by Olly Betts, 7 years ago

I think there are five parts to this, in approximately increasing difficulty (and also roughly decreasing benefit - the first three items are both pretty key though):

  • When a block is compacted to put all the freespace together, we should zero out the freespace after the compact as it may contain traces of the items in the block. I'm not sure if this is something to always do, but it's clearly not horrendously expensive. [GlassTable::compact()]
  • Blocks are copy on write. When an item is removed but the block isn't emptied, a copy of that block is made, and the directory at the start of the block is updated to remove the pointer to the block. We should also zero out the space it occupied. I'm not sure if this is something to always do, but it's clearly not horrendously expensive. [GlassTable::delete_leaf_item()] and [GlassTable::delete_branch_item()]
  • When we copy on write, the old version of the block is added to the freelist as being "freed by revision $new_revision". Such blocks need to be zeroed out at some point, if they've not been reused. However, we can't just zero out when added to the freelist - at that point $new_revision hasn't been committed yet, and even after commit, readers may be using the old revision. This is related to DatabaseModifiedError, and the plan to support MVCC to help avoid that. I think this may need a "secure commit" option which goes through blocks on the freelist and zeros them out (except that it needs to set the revision in the zeroed blocks to a new revision). This seems like something that we can't really just always do, since there seems to need to be some choice as to when.
  • When an item at the start or end of a leaf block is deleted, the dividing key ought to be recomputed, as it may leak some information (at least for tables where the key is computed from a term rather than just a document id). I think it doesn't need recomputing if other items in the block are deleted, because even though the dividing key might have originally have been computed based on those, it could also have been computed from the key which got inserted between that item and the dividing key. Recomputing dividing keys may be worthwhile to do sometimes in general - it's something I've wondered before. It might be worth doing anyway - when a block is COWed, the parent block also needs to be COWed to update the pointer to the child (but the parent COW is typically done once for multiple updates to its children). If you want to experiment to see what the dividing keys reveal, then this command will show all of them for a table: xapian-check somedb/postlist f|grep ' --> \[' (postlist and position are the interesting ones).
  • As you say, there may be more subtle information leaks via the "shape" of the DB. That's hardest to deal with, except by simply compacting the database (running compact after each commit would actually address all your concerns, though it's quite a big hammer). But running compact regularly would make a lot of sense as it would limit how long any leaked information could persist for. Some of my plans for the next backend would probably help here too (I'm thinking that mass updates would get applied in a LSMT-like manner - that doesn't given an identical DB for a given doc set, but it should reduce the variance significantly).

I think this has to be a per Database setting, which can only be enabled at creation or right after compaction (or perhaps enabling it goes through and zeros out stuff, etc but that's probably comparable work to compacting). I don't see how this can work on a per-deletion basis or a per-session basis - if you've not been zeroing from creation, then there could be copies of the sensitive data all over the place.

Last edited 7 years ago by Olly Betts (previous) (diff)

comment:2 by Olly Betts, 7 years ago

I think this has to be a per Database setting, which can only be enabled at creation or right after compaction (or perhaps enabling it goes through and zeros out stuff, etc but that's probably comparable work to compacting).

Or you can enable it at any point, but the semantics in that case are that it'll only reliably protect data added after you enabled it.

comment:3 by Olly Betts, 6 years ago

Component: OtherBackend-Honey
Milestone: 1.5.0

Some of my plans for the next backend would probably help here too (I'm thinking that mass updates would get applied in a ​LSMT-like manner - that doesn't given an identical DB for a given doc set, but it should reduce the variance significantly).

The next backend ("honey") is progressing, and it looks like it will help a lot with concerns about lingering traces of removed documents - it gets us very close to database reproducibility (if the index for the table gets fully regenerated, we should be entirely reproducible; I'm not entirely sure about the details of incrementally updating that index).

Instead of a B-tree, it uses (though this isn't all finished yet) a sorted string table structure, with multiple levels which are merged together with a rolling merge. That essentially means that for no extra work we get a limited lifetime for removed documents living on - it's only until the rolling merge process next gets to them, because the SSTable is just all the (key,value) pairs in the table serialised contiguously in ascending key order. And the rolling merge would probably just build a fresh index for its merged output as it goes.

In this design, xapian-compact would tell the rolling merge to continue until there's a single SSTable, which should then be clean of removed information, so for notmuch that would provide a way to clean up after handling some sensitive emails. I'd guess this would be significantly cheaper than compacting a glass database too.

So I think I'm unlikely to try to address this for glass at this point, but instead will review this once honey is fully functional. I'm still happy to review patches to improve the situation for glass if anyone wants to work on that.

comment:4 by Olly Betts, 12 months ago

Milestone: 1.5.02.0.0

Retargetting as the new plan is to get a new release series out, without finishing honey. It'll probably not really change when the release series with honey finished will come out, but it'll mean the stable release series is much closer to the development version.

Note: See TracTickets for help on using tickets.