Opened 10 years ago

Closed 9 years ago

#666 closed enhancement (fixed)

Implement single-file format for glass backend

Reported by: Will Greenberg Owned by: Olly Betts
Priority: normal Milestone: 1.3.4
Component: Backend-Glass Version: SVN trunk
Severity: normal Keywords:
Cc: cosimo.cecchi@…, kelson@…, philip.chimento@… Blocked By:
Blocking: Operating System: All

Description

In some cases, it is advantageous to be able to ship a xapian database as a self-contained, singular file. Glass could have an optional construction parameter which allows reading/writing a database in this single-file format.

Related mailing list thread: http://comments.gmane.org/gmane.comp.search.xapian.general/9664

Change History (16)

comment:1 by Cosimo Cecchi, 10 years ago

Cc: cosimo.cecchi@… added

comment:2 by Kelson, 10 years ago

Cc: kelson@… added

comment:3 by Philip Chimento, 10 years ago

Cc: philip.chimento@… added

comment:4 by Olly Betts, 9 years ago

Owner: set to Olly Betts
Status: newassigned

I have a working implementation of this for glass (read-only currently, packed up with a script) - see git branch single-file-glass. I'm still working on it, but it's at the stage where people could usefully try it out.

comment:5 by Kelson, 9 years ago

@Olly

That's a really good news! I have checked out your code and compiled it. I have also integrated it with the Kiwix indexing code and... nothing has changed: it looks like I still create a "chert" database (a directory with many files and one is called 'iamchert'). I probably do something wrong, maybe I should change the way I create the Xapian DB? Currently I do:

Xapian::WritableDatabase(indexPath, Xapian::DB_CREATE_OR_OVERWRITE);

comment:6 by James Aylett, 9 years ago

@Kelson — there's a comment at the top of the source to `pack-single-file-glass-db`, which should help.

comment:7 by Olly Betts, 9 years ago

There's only read support for single file databases (at present at least), so you need to create a glass database (not chert) and index the data to that first, and then pack it up with that script James pointed to.

You can specify you want to create a glass database with:

Xapian::WritableDatabase(indexPath, Xapian::DB_CREATE_OR_OVERWRITE|Xapian::DB_BACKEND_GLASS);
Last edited 9 years ago by Olly Betts (previous) (diff)

comment:8 by Kelson, 9 years ago

@James @Olly

With your help I was able to go ahead. Everything seems to works well: I'm able to create/read single-file Xapian indexes. This really opens new doors for Kiwix/ZIM files. Kudos!

To be able to use it with ZIM files/Kiwix, I still have to requests/questions:

  • this version seems to need librt, I hope this is correctly supported on other OSes (iOS, OSX, Windows, Android). I still don't have tested.
  • index needs to be "merged" in an other file (ZIM file). Would that be possible to have an openIndex() with a baseline offset as additional parameter?
  • this is a pity, but a lot of devices are still not able to deal with files over 4GB. In particular with Android. Many of our Xapian indexes are over this limit. We have fixed that way for ZIM files https://git.wikimedia.org/blob/openzim/165eab3e154c60b5b6436d653dc7c90f56cf7456/zimlib%2Fsrc%2Ffstream.cpp... But not sure how this can work if the Xapian index is merged within the ZIM file...

comment:9 by Olly Betts, 9 years ago

librt

That's for timing out check_at_least, which is a new feature added in the 1.3 development series - if the functions needed aren't present, this feature is just disabled.

index needs to be "merged" in an other file (ZIM file). Would that be possible to have an openIndex() with a baseline offset as additional parameter?

Just lseek() the fd to where the Xapian database starts before you pass it in.

a lot of devices are still not able to deal with files over 4GB

So you want everything in one file ... which you then split up into several files?

That's not something which the people kindly funding this development require (or if they do it wasn't mentioned in the brief), so it's not likely to happen as part of my current work on this.

I had thought the days of the maximum file size being an issue were a decade or so behind us, but sadly it seems I was optimistic. What's the actual max file size for these devices, and how big are the databases? When the filesize limit was 2GB - 1 byte, we used to work around it by splitting databases into a series of 1GB chunks. The really annoying problem is managing filehandles - for a DB split into a lot of chunks, keeping them all open risks running out of filehandles, while having just the chunk you're currently reading from open means you can end up opening and closing the files a lot.

comment:10 by Kelson, 9 years ago

If I can use lseek(), then this is simply perfect, but I can not find a Xapian::Database constructor accepting a fd at http://xapian.org/docs/apidoc/html/classXapian_1_1Database.html. It's an undocumented feature or just I'm a the from page?

Regarding the file splitting, dekstop OSes have now a good support of big files. The problem we face is mostly on Android where low end devices still don't support exFat. Splitting the index might be an idea I should give a try.

Otherwise, from my point of view, this ticket is implemented. Thx again for the work. I will use it in Kiwix as soon as this is released.

comment:11 by Olly Betts, 9 years ago

The website documentation is for the stable release, while this code isn't even in git master yet (it's currently on a branch).

You probably have the docs for the version you built in xapian-core/docs/apidoc/ - if you disabled the documentation building, the doxygen comments in the API headers in `xapian-core/include/xapian/' are usually quite easy to read directly.

I already understood that the issue is with some android devices, but you didn't answer what the maximum size supported actually is. Is it the 4GB - 1 byte which FAT32 apparently supports?

comment:12 by Kelson, 9 years ago

@Olly

Yes, the limit is 4GB with FAT32.

For the lseek(), please let me know as soon as this is documented how to change the Xapian::Database internal file descriptor position. Then I will be able to test the merge with a ZIM file.

Remark: I don't know if this part of your plan, but of course, if there is a C++ version of "pack-single-file-glass-db", part of the library, this would be even better.

comment:13 by Olly Betts, 9 years ago

Component: OtherBackend-Glass
Milestone: 1.3.x1.3.4

kelson: You just open a file descriptor on your file which has the Xapian DB embedded, lseek() it to where the Xapian DB starts, and then pass the file descriptor to the Xapian::Database constructor (instead of a filename) - Xapian just looks where the descriptor's file position already is.

The "pack-single-file-glass-db" script is now superseded - this functionality has been folded into database compaction, so you can just use xapian-compact or the compaction API.

Hoping to get this merged for 1.3.4, though could conceivably be 1.3.5.

comment:14 by Olly Betts, 9 years ago

Merged in [2d23dc479341f2b50357e5f24edaab798dc775b5]. There's still a bit of sorting out to do, but it seemed a good idea to get this onto master so we stop accruing conflicts with other changes.

comment:15 by Olly Betts, 9 years ago

As of [17d7af6167b1052407314933a66e85938378bdfb] I think this is now fully working.

The single file databases are now relocatable, so you can pack them into a container file and search from within that.

The remaining TODO is allowing injecting the single file database into an existing file as it is produced - this would allow avoiding having to generate an intermediate file and then copy it. I thought I had written code for this already, but I'm not sure where it got to.

comment:16 by Olly Betts, 9 years ago

Resolution: fixed
Status: assignedclosed

[75727faacc18c585fd352f7e12785e346023a26b] implements compacting to an fd, so I think this is all now done.

The only restriction is that such databases are read-only (and have to be created via compacting an existing database). There's no fundamental reason they couldn't support writing, but it would be extra work (I think you'd want to have a single freelist shared between the tables, and to sort out some equivalent to atomically updating the version file when committing).

Note: See TracTickets for help on using tickets.