Opened 14 years ago
Last modified 20 months ago
#500 assigned defect
Shorter max length for terms that contain zero bytes
Reported by: | Versmisse David | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Backend-Glass | Version: | 1.2.0 |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
Hello,
We can store terms with a maximum length of 245 bytes. But visibly, if a term contains '\0', they are counted 2 times.
I don't know if the problem is due to the xapian binding (python) or to the xapian core.
This small snippet of python can reproduce the problem:
from xapian import WritableDatabase, DB_CREATE, Document db = WritableDatabase('test_db', DB_CREATE) doc = Document() doc.add_posting('\x00' * 200, 1) db.add_document(doc) db.flush()
With an other character '\x01' or \x02', ... this code works without problem.
Thank you by advance for your answer.
- Versmisse.
Change History (7)
comment:1 by , 14 years ago
Component: | Xapian-bindings (Python) → Backend-Brass |
---|---|
Owner: | changed from | to
comment:3 by , 14 years ago
Milestone: | → 1.3.0 |
---|---|
Status: | new → assigned |
Setting milestone - this should get resolved in branch in the 1.3.x development series.
comment:4 by , 13 years ago
Milestone: | 1.3.0 → 1.3.x |
---|
comment:5 by , 10 years ago
Component: | Backend-Brass → Backend-Glass |
---|
comment:6 by , 9 years ago
Milestone: | 1.3.x → 1.4.x |
---|---|
Summary: | Problem with terms that contain the '\0' character. → Shorter max length for terms that contain zero bytes |
Sadly I don't think we're going to manage to get this done for 1.4.0.
comment:7 by , 20 months ago
Milestone: | 1.4.x → 2.0.0 |
---|
This is still present.
The original plan for addressing this was to have a custom per-table key comparison function rather than using a byte-string compare. That's proved awkward to do as it prevents us storing key deltas, which saves a lot of space (honey implements this), or at least it prevents the obvious approach from working - I guess we could perhaps have a custom per-table key delta function (or probably set of functions), but this seems like a lot of complexity for a corner case.
Perhaps we need to come up with another way to address this. Allowing longer keys also creates more complexity as we'd need to allow two bytes for key size in some cases (or have an extra byte overhead on every key length). We could just declare that terms containing zero bytes aren't supported, which would side-step the problem - I can't really see a good use-case for it, but outlawing it seems clumsy.
Maybe we just need to make sure it's clearly documented so people wanting to use zero bytes are warned up front and can decide if the reduction in supported term length is a problem or not.
This is a known issue for the quartz, flint, and chert database backends - internally a zero byte has to be encoded as two bytes to get the keys to sort in the desired order.
We can't address this for these backends without breaking compatibility with existing databases (which this issue certainly doesn't justify doing), but we're already planning to eliminate this restriction for brass, so setting the component to that.