Opened 8 years ago
Last modified 21 months ago
#742 new defect
Xapian should provide a way to securely remove a document from the database
Reported by: | Daniel Kahn Gillmor | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Backend-Honey | Version: | |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
currently, if i remove a document from a xapian index, the indexed terms remain in the db, but are marked as part of the freelist.
This means that removal of a document is "insecure" in the sense that if someone gained access to the index after message deletion, they could recover information about the document by inspecting the contents of the freelist.
There may be other traces of a document that are retained in the index as well: for example, on IRC, olly mentioned:
oh, there's one awkward thing in the backend stuff -- dividing keys get created in the branch levels based on the leaf level keys around where the block is split
Some of these fixes may be easier to do than others.
For example, it might be pretty easy to zero blocks when they're returned to the freelist, but it might be harder to deal with the dividing keys. It's still worth fixing the easy parts, even if some harder challenges remain.
Another way to think about the problem is one of "index reproducibility" -- if an index contains exactly the same set of documents as another index, a byte-for-byte identical data store on disk is the ideal. Any divergence from that ideal leaks some information about documents that have been added to the database in the past, and then subseqently removed.
It's possible that any of these fixes incur a cost that some people are reluctant to pay (e.g. they're not concerned about the confidentiality of any of their indexed documents, or they're confident in the long-term confidentiality of the index itself for other reasons). So it seems likely that the feature needs to be optional. Whether the choice of feature is opt-in or opt-out; and whether the choice is made done on a per-deletion basis, or a per-database basis, or a per-xapian-session basis, i don't know.
I'm happy to review API proposals if that'd be useful.
Change History (4)
comment:2 by , 8 years ago
I think this has to be a per Database setting, which can only be enabled at creation or right after compaction (or perhaps enabling it goes through and zeros out stuff, etc but that's probably comparable work to compacting).
Or you can enable it at any point, but the semantics in that case are that it'll only reliably protect data added after you enabled it.
comment:3 by , 7 years ago
Component: | Other → Backend-Honey |
---|---|
Milestone: | → 1.5.0 |
Some of my plans for the next backend would probably help here too (I'm thinking that mass updates would get applied in a LSMT-like manner - that doesn't given an identical DB for a given doc set, but it should reduce the variance significantly).
The next backend ("honey") is progressing, and it looks like it will help a lot with concerns about lingering traces of removed documents - it gets us very close to database reproducibility (if the index for the table gets fully regenerated, we should be entirely reproducible; I'm not entirely sure about the details of incrementally updating that index).
Instead of a B-tree, it uses (though this isn't all finished yet) a sorted string table structure, with multiple levels which are merged together with a rolling merge. That essentially means that for no extra work we get a limited lifetime for removed documents living on - it's only until the rolling merge process next gets to them, because the SSTable is just all the (key,value) pairs in the table serialised contiguously in ascending key order. And the rolling merge would probably just build a fresh index for its merged output as it goes.
In this design, xapian-compact would tell the rolling merge to continue until there's a single SSTable, which should then be clean of removed information, so for notmuch that would provide a way to clean up after handling some sensitive emails. I'd guess this would be significantly cheaper than compacting a glass database too.
So I think I'm unlikely to try to address this for glass at this point, but instead will review this once honey is fully functional. I'm still happy to review patches to improve the situation for glass if anyone wants to work on that.
comment:4 by , 21 months ago
Milestone: | 1.5.0 → 2.0.0 |
---|
Retargetting as the new plan is to get a new release series out, without finishing honey. It'll probably not really change when the release series with honey finished will come out, but it'll mean the stable release series is much closer to the development version.
I think there are five parts to this, in approximately increasing difficulty (and also roughly decreasing benefit - the first three items are both pretty key though):
GlassTable::compact()
]GlassTable::delete_leaf_item()
] and [GlassTable::delete_branch_item()
]DatabaseModifiedError
, and the plan to support MVCC to help avoid that. I think this may need a "secure commit" option which goes through blocks on the freelist and zeros them out (except that it needs to set the revision in the zeroed blocks to a new revision). This seems like something that we can't really just always do, since there seems to need to be some choice as to when.xapian-check somedb/postlist f|grep ' --> \['
(postlist
andposition
are the interesting ones).I think this has to be a per Database setting, which can only be enabled at creation or right after compaction (or perhaps enabling it goes through and zeros out stuff, etc but that's probably comparable work to compacting). I don't see how this can work on a per-deletion basis or a per-session basis - if you've not been zeroing from creation, then there could be copies of the sensitive data all over the place.