Opened 17 months ago
Last modified 14 months ago
#822 new defect
Honey format tweaks
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Backend-Honey | Version: | |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
The encoding of spelling "tail" and "bookend" term lists could be improved.
In honey the spelling data encoding makes use of knowing that the last 2 (for tail) or 1 (for bookend) bytes are fixed and that we can know them by looking at the key, but we still store a reuse byte for the first entry. This could reuse up to two bytes, but usually won't save any and takes a byte to store, so overall it costs us slightly under one byte per tail and per bookend term list. That's less than twice the number of spelling targets (typically significantly so since many words have the same last two bytes / same first and last byte) so it's not a vast saving (e.g. the largest spelling data table I have to hand is from recoll which has 494633 spelling targets but only 1617 bookends and 1802 tails, so the saving there would be at most 3419 bytes), but supporting this also complicates decode because it is possible for the reuse and tail to overlap (we weren't handling this situation correctly until 99873ea22f22e8cb99d4f1db2d6591c2f725afa8) so we really should sort it out at some point.
Decision: leave as-is for 1.5.x, change once we start development after the next stable release.