Opened 6 years ago
Closed 2 years ago
#780 closed enhancement (fixed)
Support for epub
Reported by: | Jugnu | Owned by: | Jugnu |
---|---|---|---|
Priority: | normal | Milestone: | 1.5.0 |
Component: | Omega | Version: | 1.4.11 |
Severity: | normal | Keywords: | omega file support omindex |
Cc: | Kelson | Blocked By: | |
Blocking: | Operating System: | Linux |
Description (last modified by )
Git PR for issue : https://github.com/xapian/xapian/pull/235
EPUB ------
Skipping - unknown MIME type 'application/epub+zip'
Skipping - unknown MIME type 'application/zip'
Current situation is the omindex is able to index the epub files. However, there is work needed to perfectly parse the metadeta information correctly such as overall title, individual chapters, authors etc. There are many other parts within index_file.cc which does the basic indexing but lacks perfect metadata parsing. For instance, the formats can be found through searching for : FIXME: Implement support for metadata.
To do :
- Also tests are needed to ensure that epub supports is generalized across different generation of epubs, their file directory structures etc..
- Correct metadata parsing
Change History (16)
comment:1 by , 6 years ago
comment:2 by , 6 years ago
Are these files you are attaching redistributable? The epub says "Copyright © 2009 by John C. Bogle. All rights reserved" on the title page, which suggests not.
If we don't have a licence to redistribute them we shouldn't be redistributing (and please don't attach files that don't have a licence allowing redistribution).
comment:3 by , 6 years ago
Description: | modified (diff) |
---|---|
Summary: | Support for .pkl file → Support for epub and .pkl file |
comment:4 by , 6 years ago
Description: | modified (diff) |
---|
The error in the description is rather strange:
Exception: DatabaseError: Modifications failed (DatabaseError: Error reading block 0 (Protocol error)), and couldn't open at the old revision: Error reading block 0.
Is that repeatable?
Is there anything unusual about the disk partition which you have the database on?
comment:5 by , 6 years ago
I guess this came because with my vagrant, sometimes it misbehaves like doing ls and it said " ls: cannot access shared: Protocol error ". I do not have the problem now as i did some googling about not able to access shared drives. So did vagrant reload and vagrant ssh.
Also in addition to this, I was also not able to use cgi-bin for query and it gave some error like "cannot access query template ; protocol error" when I performed search in search bar. But errors were gone after vagrant reload.
comment:6 by , 6 years ago
Resolution: | → incomplete |
---|---|
Status: | new → closed |
comment:7 by , 6 years ago
Resolution: | incomplete |
---|---|
Status: | closed → reopened |
comment:8 by , 6 years ago
Description: | modified (diff) |
---|
comment:10 by , 6 years ago
Description: | modified (diff) |
---|
comment:11 by , 6 years ago
Description: | modified (diff) |
---|---|
Summary: | Support for epub and .pkl file → Support for epub |
comment:12 by , 5 years ago
Cc: | added |
---|
comment:13 by , 3 years ago
I've deleted the random polarities.pkl
as it seems unrelated to EPUB so doesn't seem relevant here, and its copyright status was unclear.
comment:14 by , 2 years ago
We now support indexing files using libe-book, and I just noticed libe-book has some "experimental" support for EPUB, which is disabled unless configured with --enable-experimental
.
I've not tried it yet, but probably improving that to a suitable standard to no longer be deemed experimental would be a good way to resolve this. I'd expect that's much less work than implementing our own EPUB parsing, and less work to maintain going forwards. It also potentially benefits other projects using libe-book.
comment:15 by , 2 years ago
I had a brief go at trying it, but it seems to require libcss which isn't packaged for Debian.
Looks like libgepub may be a plausible option and is packaged: https://github.com/danigm/libgepub/blob/master/libgepub/gepub-doc.h
comment:16 by , 2 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Implemented support using libgepub in b2109fcdc802565bb148b8d2be6d2b38920fa7e8. There are automated tests with EPUB 2 and EPUB 3 files exported from libreoffice.
Currently libgepub 0.6 or 0.7 should work, but CI only tests 0.6 due to that being what's available on the Ubuntu versions there.
Author and title are extracted successfully. Page count is set to the number of "chapters" which seems about as close as we can get since each chapter seems to be an HTML document. I wrote code for GEPUB_META_DESC
but I don't see that actually appearing from the example files I tried with.
The EPUB format seems to support some other metadata types, but I didn't see how to get them.
For the courtesy, my omindex and omega both are 1.5.0