Opened 14 years ago

Closed 14 years ago

Last modified 14 years ago

#491 closed defect (invalid)

calling replace_document() twice doubles the size of the database

Reported by: maad Owned by: Olly Betts
Priority: normal Milestone:
Component: Other Version:
Severity: major Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Calling replace_document twice with the same key doubles the size of the database on disk (unexpected). However when running search only one document returned (expected). Subsequent replace_document calls do not change the database size(expected). This is the code to replicate:

<?php

include "xapian.php";

$value="thisistest";
myreplace($value);
passthru("ls -l replace.db");
myreplace($value);
passthru("ls -l replace.db");
myreplace($value);
passthru("ls -l replace.db");


function myreplace($value)
{
  try 
  {
    // Open the database for update, creating a new database if necessary.
    $database = new XapianWritableDatabase("replace.db", Xapian::DB_CREATE_OR_OPEN);

    $indexer = new XapianTermGenerator();
    $stemmer = new XapianStem("english");
    $indexer->set_stemmer($stemmer);


	$doc = new XapianDocument();
		
	$doc->add_term("Q$value");
		
	$indexer->set_document($doc);
	$indexer->index_text($value);

	//$database->add_document($doc);
	$database->replace_document("Q$value", $doc);

    $database = Null;
  } 
  catch (Exception $e) 
  {
    print $e->getMessage() . "\n";
    exit(1);
  }
}

?>

Output:

total 84
-rw-r--r-- 1 alex alex    0 2010-06-28 20:30 flintlock
-rw-r--r-- 1 alex alex   12 2010-06-28 20:30 iamflint
-rw-r--r-- 1 alex alex   13 2010-06-28 20:30 position.baseA
-rw-r--r-- 1 alex alex   14 2010-06-28 20:30 position.baseB
-rw-r--r-- 1 alex alex 8192 2010-06-28 20:30 position.DB
-rw-r--r-- 1 alex alex   13 2010-06-28 20:30 postlist.baseA
-rw-r--r-- 1 alex alex   14 2010-06-28 20:30 postlist.baseB
-rw-r--r-- 1 alex alex 8192 2010-06-28 20:30 postlist.DB
-rw-r--r-- 1 alex alex   13 2010-06-28 20:30 record.baseA
-rw-r--r-- 1 alex alex   14 2010-06-28 20:30 record.baseB
-rw-r--r-- 1 alex alex 8192 2010-06-28 20:30 record.DB
-rw-r--r-- 1 alex alex   13 2010-06-28 20:30 termlist.baseA
-rw-r--r-- 1 alex alex   14 2010-06-28 20:30 termlist.baseB
-rw-r--r-- 1 alex alex 8192 2010-06-28 20:30 termlist.DB
-rw-r--r-- 1 alex alex   13 2010-06-28 20:30 value.baseA
-rw-r--r-- 1 alex alex   14 2010-06-28 20:30 value.baseB
-rw-r--r-- 1 alex alex 8192 2010-06-28 20:30 value.DB
total 124
-rw-r--r-- 1 alex alex     0 2010-06-28 20:30 flintlock
-rw-r--r-- 1 alex alex    12 2010-06-28 20:30 iamflint
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 position.baseA
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 position.baseB
-rw-r--r-- 1 alex alex 16384 2010-06-28 20:30 position.DB
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 postlist.baseA
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 postlist.baseB
-rw-r--r-- 1 alex alex 16384 2010-06-28 20:30 postlist.DB
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 record.baseA
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 record.baseB
-rw-r--r-- 1 alex alex 16384 2010-06-28 20:30 record.DB
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 termlist.baseA
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 termlist.baseB
-rw-r--r-- 1 alex alex 16384 2010-06-28 20:30 termlist.DB
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 value.baseA
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 value.baseB
-rw-r--r-- 1 alex alex 16384 2010-06-28 20:30 value.DB
total 124
-rw-r--r-- 1 alex alex     0 2010-06-28 20:30 flintlock
-rw-r--r-- 1 alex alex    12 2010-06-28 20:30 iamflint
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 position.baseA
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 position.baseB
-rw-r--r-- 1 alex alex 16384 2010-06-28 20:30 position.DB
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 postlist.baseA
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 postlist.baseB
-rw-r--r-- 1 alex alex 16384 2010-06-28 20:30 postlist.DB
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 record.baseA
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 record.baseB
-rw-r--r-- 1 alex alex 16384 2010-06-28 20:30 record.DB
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 termlist.baseA
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 termlist.baseB
-rw-r--r-- 1 alex alex 16384 2010-06-28 20:30 termlist.DB
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 value.baseA
-rw-r--r-- 1 alex alex    14 2010-06-28 20:30 value.baseB
-rw-r--r-- 1 alex alex 16384 2010-06-28 20:30 value.DB

Change History (3)

comment:1 by Olly Betts, 14 years ago

Resolution: invalid
Status: newclosed

This isn't a bug, just how the Btree files work.

Modifications are made by copying modified blocks so that we can rollback change. So after the second operation, each table has 2 blocks in, one unused. Subsequent repetitions reuse the spare block. If you want to eliminate the unused space, xapian-compact will make a copy of the table with unused space minimised.

comment:2 by maad, 14 years ago

would not it be possible if it checked if the data is absolutely the same and than would do nothing? it is kinda awkward behavior, at least in case when the record does not change at all

comment:3 by Olly Betts, 14 years ago

We do actually check that. But to commit the table, we need a new root block. So if you replace a million documents with themselves, you should see the table sizes increase by a single block (8KB by default). So it's only "doubles the size" if your database tables are only 8KB to start with.

The new Btree manager I'm working on (see browser:branches/brass-btree for the code) reuses the existing root block in this case.

It would be nice to reuse the root block in the current backends too, but when I looked briefly at trying to avoid committing unmodified tables some years ago, the obvious simple change didn't worked, so I think somewhere assumes that the revision number stored in the root block is that of the latest revision.

If you want to work on a patch to achieve this for the current chert backend, feel free - chert_table.cc and chert_database.cc are the files to look at. If the patch looks sane and the test coverage for it is good, I would support applying it.

Note: See TracTickets for help on using tickets.