Opened 14 years ago

Last modified 12 months ago

#500 assigned defect

Shorter max length for terms that contain zero bytes

Reported by: Versmisse David Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: Backend-Glass Version: 1.2.0
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Hello,

We can store terms with a maximum length of 245 bytes. But visibly, if a term contains '\0', they are counted 2 times.

I don't know if the problem is due to the xapian binding (python) or to the xapian core.

This small snippet of python can reproduce the problem:

from xapian import WritableDatabase, DB_CREATE, Document
db = WritableDatabase('test_db', DB_CREATE)
doc = Document()
doc.add_posting('\x00' * 200, 1)
db.add_document(doc)
db.flush()

With an other character '\x01' or \x02', ... this code works without problem.

Thank you by advance for your answer.

  1. Versmisse.

Change History (7)

comment:1 by Olly Betts, 14 years ago

Component: Xapian-bindings (Python)Backend-Brass
Owner: changed from Richard Boulton to Olly Betts

This is a known issue for the quartz, flint, and chert database backends - internally a zero byte has to be encoded as two bytes to get the keys to sort in the desired order.

We can't address this for these backends without breaking compatibility with existing databases (which this issue certainly doesn't justify doing), but we're already planning to eliminate this restriction for brass, so setting the component to that.

comment:2 by Versmisse David, 14 years ago

OK, thank you for your quick answer. D. Versmisse.

comment:3 by Olly Betts, 13 years ago

Milestone: 1.3.0
Status: newassigned

Setting milestone - this should get resolved in branch in the 1.3.x development series.

comment:4 by Olly Betts, 12 years ago

Milestone: 1.3.01.3.x

comment:5 by Olly Betts, 9 years ago

Component: Backend-BrassBackend-Glass

comment:6 by Olly Betts, 8 years ago

Milestone: 1.3.x1.4.x
Summary: Problem with terms that contain the '\0' character.Shorter max length for terms that contain zero bytes

Sadly I don't think we're going to manage to get this done for 1.4.0.

comment:7 by Olly Betts, 12 months ago

Milestone: 1.4.x2.0.0

This is still present.

The original plan for addressing this was to have a custom per-table key comparison function rather than using a byte-string compare. That's proved awkward to do as it prevents us storing key deltas, which saves a lot of space (honey implements this), or at least it prevents the obvious approach from working - I guess we could perhaps have a custom per-table key delta function (or probably set of functions), but this seems like a lot of complexity for a corner case.

Perhaps we need to come up with another way to address this. Allowing longer keys also creates more complexity as we'd need to allow two bytes for key size in some cases (or have an extra byte overhead on every key length). We could just declare that terms containing zero bytes aren't supported, which would side-step the problem - I can't really see a good use-case for it, but outlawing it seems clumsy.

Maybe we just need to make sure it's clearly documented so people wanting to use zero bytes are warned up front and can decide if the reduction in supported term length is a problem or not.

Note: See TracTickets for help on using tickets.