Opened 13 years ago

Closed 11 years ago

Last modified 11 years ago

#278 closed defect (fixed)

When changesets are being generated, old changesets aren't cleaned up

Reported by: Richard Boulton Owned by: Dan
Priority: low Milestone: 1.2.5
Component: Backend-Chert Version: SVN trunk
Severity: normal Keywords:
Cc: dcolish@… Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Currently, changesets are generated when the "XAPIAN_MAX_CHANGESETS" environment variable is set to a non-empty value. However, they are never removed. Whenever a changeset is generated, the number of changesets around should be checked, and old changesets should be removed if too many old changesets exist.

Alternatively (or as well), a different criteria might be useful for the changesets: it might be useful to be able to set an absolute limit on the total size of the changesets, or perhaps, a limit on the total size of the changesets as a proportion of the total database size.

Attachments (3)

remove_max_changesets.diff (2.0 KB ) - added by Dan 11 years ago.
max_changesets.diff (11.4 KB ) - added by Dan 11 years ago.
Updated diff with tests
max_changesets_post_review.diff (16.6 KB ) - added by Dan 11 years ago.

Download all attachments as: .zip

Change History (22)

comment:1 by Richard Boulton, 13 years ago

Status: newassigned

comment:2 by Richard Boulton, 13 years ago

Component: OtherBackend-Chert

(Setting component to Backend-Chert, though it also affects flint.)

comment:3 by Olly Betts, 13 years ago

Description: modified (diff)

Fixed to not generate a changeset for the first revision in trunk r11069 (description updated).

In order to efficiently implement keeping a maximum number of changesets and/or maximum size of changesets, I think we want to store the lowest changeset number somewhere. Then we can calculate current_revision - lowest_changeset_number and compare with maximum_number to implement that. And if we also store the total size (as number of blocks would make most sense probably), we can then delete the lowest numbered changeset files and decrease this total appropriately until it is within the specified limit. We can also age by date by checking the timestamps of the lowest numbered changeset files.

(We could establish the lowest changeset number in O(log(current_revision)) by binary chop, but that seems unnecessary work...)

comment:4 by Olly Betts, 13 years ago

Blocking: 227 removed

(In #227) This is done except for the two blockers, and a bug just to track two others is just clutter so unmarking the blockage and closing.

comment:5 by Olly Betts, 13 years ago


This doesn't require API (or even ABI) changes to fix AFAICS - just a file or something noting the lowest changeset number which we could easily handle the absence of by scanning the directory contents or a binary chop. So -> milestone:1.1.1

comment:6 by Olly Betts, 12 years ago


Triaging milestone:1.1.1 bugs.

comment:7 by Olly Betts, 12 years ago


comment:8 by Olly Betts, 12 years ago

Priority: normallow

This is a bit of a usability obstacle, but the user could add a cronjob to clean up old ones or something, so it's not a blocker, and it doesn't break API or ABI. So priority low.

comment:9 by Olly Betts, 12 years ago


Bumping to stay on schedule.

by Dan, 11 years ago

Attachment: remove_max_changesets.diff added

comment:10 by Dan, 11 years ago

Updated diff uses dbstats for brass backend. For chert and flint it will use the more naive cleanup method. Also I move where the clean up happens to make sure we dont remove files if there is an exception when creating the changeset.

comment:11 by Olly Betts, 11 years ago

I applied a similar change to factor out MAX_OPEN_RETRIES in r15377.

Otherwise this mostly looks good, though I didn't try it yet. A few comments:

  • If the changeset file is successfully unlinked, then oldest_changeset isn't updated so I think this will loop forever after successfully removing an old changeset. I think you just want to move the line which increments it down a line so it is outside the if statement.
  • I would keep the total document length last (if only because otherwise the comment about why it is last is confusing).
  • There's no need for the str() here (or similar cases) - this is iostream style so numeric types can just be written directly:
LOGLINE(DB, "Removed changeset " << str(oldest_changeset)); 
  • Using '{ }' rather than ';' for an empty loop body is a bit more visible, so we prefer that (I've actually documented this convention in r15379).
  • There's whitespace added to a blank line in xapian-core/backends/brass/brass_database.h - we try to avoid adding trailing whitespace and other whitespace irregularities (some existing code has it, which we've resisted wholesale fixing as svn blame regards it as a significant change).
  • The indent of the conditional #include in was intentional (for more complex cases, it's easier to read with the indent).

comment:12 by Dan, 11 years ago

Cc: dcolish@… added

comment:13 by Dan, 11 years ago

Owner: changed from Richard Boulton to Dan
Status: assignednew

comment:14 by Dan, 11 years ago

Status: newassigned

by Dan, 11 years ago

Attachment: max_changesets.diff added

Updated diff with tests

comment:15 by Olly Betts, 11 years ago

Looks good to me, but a few nits:


Comment typo - "being" -> "begin".

Another case where we might delete multiple changeset files is if the value of max_changesets is reduced by the user (for the comment).

I think this will try to delete changeset0 the first time, which is fairly harmless but easy to avoid.

stop_changeset should be brass_revision_number_t not unsigned.

I'd just call the methods on the stats member directly rather than adding another two private methods to BrassDatabase which only get called from one place.

chert and flint:

"changset" -> "changeset" in comment.


A simpler (and clearer) check for the changeset not existing is:

    TEST(!file_exists(masterpath + "/changes1");

Also might be good to test that if you reduce max_changesets then the intervening ones also get removed (e.g. set max to 4, commit 5 changes, check that the first one got removed, reduce max to 2, commit a change, and check that the last 2 changesets exist but the 3 before that don't).

comment:16 by Richard Boulton, 11 years ago

I've just gone through the patch myself, and would echo Olly's comments. Given that the algorithm in brass is new (and a little bit complex) it would also be reassuring to have a test for the situation where max_changesets is increased again. ie:

  • set max to 2
  • commit 3 changes
  • check that only the last 2 changesets exist
  • set max to 3
  • commit another 2 changesets
  • check that only the last 3 changesets exist

This is looking good, though.

comment:17 by Dan, 11 years ago

Ok, I think I got all the changes suggested. One thing that I didn't see an issue with was the 0th changeset, so I left that out.

comment:18 by Dan, 11 years ago

Resolution: fixed
Status: assignedclosed

Landed patch at revision 15403.

comment:19 by Olly Betts, 11 years ago

Milestone: 1.2.x1.2.5
Note: See TracTickets for help on using tickets.