The structure of blocks is broadly similar to Quartz. The key differences are:

  • We never leave slack space between items in the block. The main benefit is a simpler structure - we don't need to store max_free, total_free, etc. Also we don't need to explicitly store the length of a tag - we know where the item ends because that's where the next item starts. Some extra copying of memory is required, but this is bounded above by O(n) (we never copy more than BLOCKSIZE bytes per block).
  • We might not store the block level - we only need to know if a block is at the leaf level or not (which we use to provide a faster version of next() and prev() when a database has revision 1). So we can just store an "is_leaf" flag which only requires a bit not a whole byte (but then 7 / 8 / 8192 is a tiny saving, so perhaps this is worth storing the level as it allows a little bit of consistency checking). But note it's going to be annoying if we need the level bit/byte in each block if we implement the idea of using whole blocks to store large tags...
  • The key counter and "out of" values can be squeezed into a single byte if we also store long tags in whole blocks. Then we only need at most 7(ish) chunks as items. If 8 chunks is enough, we have a spare bit we can use to flag "tag compressed with gzip" and another to flag "tag starts with a list of blocks full of tag data".
  • Each block header contains a byte value which indicates how many common prefix bytes have been removed from all keys in the block (those common prefix bytes can be found from the key value which lead to us reading this block)
  • Quartz has 2 base files _per table_. This is a lot of files to keep creating and deleting, plus we have to jump through hoops to ensure we open a consistent set of tables. We don't have a bitmap to store in the base file, so instead let's just have 2 base files _per database_. Then atomically applying a new revision becomes a much simpler operation!
  • Consider merging blocks on deletion, like B+ trees are supposed to (see Knuth)
  • Consider the B* tree feature where we rotate entries into the next or previous sibling block to avoid splitting. This allows the tree to achieve a higher percentage block utilisation, which is likely to be a good thing. If it's a tradeoff, then we could make it a tunable feature...
Last modified 12 years ago Last modified on 07/01/08 02:48:01