Opened 10 years ago
Closed 9 years ago
#666 closed enhancement (fixed)
Implement single-file format for glass backend
Reported by: | Will Greenberg | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.3.4 |
Component: | Backend-Glass | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | cosimo.cecchi@…, kelson@…, philip.chimento@… | Blocked By: | |
Blocking: | Operating System: | All |
Description
In some cases, it is advantageous to be able to ship a xapian database as a self-contained, singular file. Glass could have an optional construction parameter which allows reading/writing a database in this single-file format.
Related mailing list thread: http://comments.gmane.org/gmane.comp.search.xapian.general/9664
Change History (16)
comment:1 by , 10 years ago
Cc: | added |
---|
comment:2 by , 10 years ago
Cc: | added |
---|
comment:3 by , 10 years ago
Cc: | added |
---|
comment:4 by , 9 years ago
Owner: | set to |
---|---|
Status: | new → assigned |
comment:5 by , 9 years ago
@Olly
That's a really good news! I have checked out your code and compiled it. I have also integrated it with the Kiwix indexing code and... nothing has changed: it looks like I still create a "chert" database (a directory with many files and one is called 'iamchert'). I probably do something wrong, maybe I should change the way I create the Xapian DB? Currently I do:
Xapian::WritableDatabase(indexPath, Xapian::DB_CREATE_OR_OVERWRITE);
comment:6 by , 9 years ago
@Kelson — there's a comment at the top of the source to `pack-single-file-glass-db`, which should help.
comment:7 by , 9 years ago
There's only read support for single file databases (at present at least), so you need to create a glass database (not chert) and index the data to that first, and then pack it up with that script James pointed to.
You can specify you want to create a glass database with:
Xapian::WritableDatabase(indexPath, Xapian::DB_CREATE_OR_OVERWRITE|Xapian::DB_BACKEND_GLASS);
comment:8 by , 9 years ago
@James @Olly
With your help I was able to go ahead. Everything seems to works well: I'm able to create/read single-file Xapian indexes. This really opens new doors for Kiwix/ZIM files. Kudos!
To be able to use it with ZIM files/Kiwix, I still have to requests/questions:
- this version seems to need librt, I hope this is correctly supported on other OSes (iOS, OSX, Windows, Android). I still don't have tested.
- index needs to be "merged" in an other file (ZIM file). Would that be possible to have an openIndex() with a baseline offset as additional parameter?
- this is a pity, but a lot of devices are still not able to deal with files over 4GB. In particular with Android. Many of our Xapian indexes are over this limit. We have fixed that way for ZIM files https://git.wikimedia.org/blob/openzim/165eab3e154c60b5b6436d653dc7c90f56cf7456/zimlib%2Fsrc%2Ffstream.cpp... But not sure how this can work if the Xapian index is merged within the ZIM file...
comment:9 by , 9 years ago
librt
That's for timing out check_at_least, which is a new feature added in the 1.3 development series - if the functions needed aren't present, this feature is just disabled.
index needs to be "merged" in an other file (ZIM file). Would that be possible to have an openIndex() with a baseline offset as additional parameter?
Just lseek()
the fd to where the Xapian database starts before you pass it in.
a lot of devices are still not able to deal with files over 4GB
So you want everything in one file ... which you then split up into several files?
That's not something which the people kindly funding this development require (or if they do it wasn't mentioned in the brief), so it's not likely to happen as part of my current work on this.
I had thought the days of the maximum file size being an issue were a decade or so behind us, but sadly it seems I was optimistic. What's the actual max file size for these devices, and how big are the databases? When the filesize limit was 2GB - 1 byte
, we used to work around it by splitting databases into a series of 1GB chunks. The really annoying problem is managing filehandles - for a DB split into a lot of chunks, keeping them all open risks running out of filehandles, while having just the chunk you're currently reading from open means you can end up opening and closing the files a lot.
comment:10 by , 9 years ago
If I can use lseek(), then this is simply perfect, but I can not find a Xapian::Database constructor accepting a fd at http://xapian.org/docs/apidoc/html/classXapian_1_1Database.html. It's an undocumented feature or just I'm a the from page?
Regarding the file splitting, dekstop OSes have now a good support of big files. The problem we face is mostly on Android where low end devices still don't support exFat. Splitting the index might be an idea I should give a try.
Otherwise, from my point of view, this ticket is implemented. Thx again for the work. I will use it in Kiwix as soon as this is released.
comment:11 by , 9 years ago
The website documentation is for the stable release, while this code isn't even in git master yet (it's currently on a branch).
You probably have the docs for the version you built in xapian-core/docs/apidoc/
- if you disabled the documentation building, the doxygen comments in the API headers in `xapian-core/include/xapian/' are usually quite easy to read directly.
I already understood that the issue is with some android devices, but you didn't answer what the maximum size supported actually is. Is it the 4GB - 1 byte
which FAT32 apparently supports?
comment:12 by , 9 years ago
@Olly
Yes, the limit is 4GB with FAT32.
For the lseek(), please let me know as soon as this is documented how to change the Xapian::Database internal file descriptor position. Then I will be able to test the merge with a ZIM file.
Remark: I don't know if this part of your plan, but of course, if there is a C++ version of "pack-single-file-glass-db", part of the library, this would be even better.
comment:13 by , 9 years ago
Component: | Other → Backend-Glass |
---|---|
Milestone: | 1.3.x → 1.3.4 |
kelson: You just open a file descriptor on your file which has the Xapian DB embedded, lseek()
it to where the Xapian DB starts, and then pass the file descriptor to the Xapian::Database
constructor (instead of a filename) - Xapian just looks where the descriptor's file position already is.
The "pack-single-file-glass-db" script is now superseded - this functionality has been folded into database compaction, so you can just use xapian-compact or the compaction API.
Hoping to get this merged for 1.3.4, though could conceivably be 1.3.5.
comment:14 by , 9 years ago
Merged in [2d23dc479341f2b50357e5f24edaab798dc775b5]. There's still a bit of sorting out to do, but it seemed a good idea to get this onto master so we stop accruing conflicts with other changes.
comment:15 by , 9 years ago
As of [17d7af6167b1052407314933a66e85938378bdfb] I think this is now fully working.
The single file databases are now relocatable, so you can pack them into a container file and search from within that.
The remaining TODO is allowing injecting the single file database into an existing file as it is produced - this would allow avoiding having to generate an intermediate file and then copy it. I thought I had written code for this already, but I'm not sure where it got to.
comment:16 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
[75727faacc18c585fd352f7e12785e346023a26b] implements compacting to an fd, so I think this is all now done.
The only restriction is that such databases are read-only (and have to be created via compacting an existing database). There's no fundamental reason they couldn't support writing, but it would be extra work (I think you'd want to have a single freelist shared between the tables, and to sort out some equivalent to atomically updating the version file when committing).
I have a working implementation of this for glass (read-only currently, packed up with a script) - see git branch
single-file-glass
. I'm still working on it, but it's at the stage where people could usefully try it out.