Opened 6 years ago

Closed 2 years ago

#780 closed enhancement (fixed)

Support for epub

Reported by: Jugnu Owned by: Jugnu
Priority: normal Milestone: 1.5.0
Component: Omega Version: 1.4.11
Severity: normal Keywords: omega file support omindex
Cc: Kelson Blocked By:
Blocking: Operating System: Linux

Description (last modified by Jugnu)

Git PR for issue : https://github.com/xapian/xapian/pull/235

EPUB ------

Skipping - unknown MIME type 'application/epub+zip'

Skipping - unknown MIME type 'application/zip'

Current situation is the omindex is able to index the epub files. However, there is work needed to perfectly parse the metadeta information correctly such as overall title, individual chapters, authors etc. There are many other parts within index_file.cc which does the basic indexing but lacks perfect metadata parsing. For instance, the formats can be found through searching for : FIXME: Implement support for metadata.

To do :

  1. Also tests are needed to ensure that epub supports is generalized across different generation of epubs, their file directory structures etc..
  2. Correct metadata parsing

Change History (16)

comment:1 by Jugnu, 6 years ago

For the courtesy, my omindex and omega both are 1.5.0

comment:2 by Olly Betts, 6 years ago

Are these files you are attaching redistributable? The epub says "Copyright © 2009 by John C. Bogle. All rights reserved" on the title page, which suggests not.

If we don't have a licence to redistribute them we shouldn't be redistributing (and please don't attach files that don't have a licence allowing redistribution).

comment:3 by Jugnu, 6 years ago

Description: modified (diff)
Summary: Support for .pkl fileSupport for epub and .pkl file

comment:4 by Olly Betts, 6 years ago

Description: modified (diff)

The error in the description is rather strange:

Exception: DatabaseError: Modifications failed (DatabaseError: Error reading block 0 (Protocol error)), and couldn't open at the old revision: Error reading block 0.

Is that repeatable?

Is there anything unusual about the disk partition which you have the database on?

comment:5 by Jugnu, 6 years ago

I guess this came because with my vagrant, sometimes it misbehaves like doing ls and it said " ls: cannot access shared: Protocol error ". I do not have the problem now as i did some googling about not able to access shared drives. So did vagrant reload and vagrant ssh.

Also in addition to this, I was also not able to use cgi-bin for query and it gave some error like "cannot access query template ; protocol error" when I performed search in search bar. But errors were gone after vagrant reload.

Last edited 6 years ago by Jugnu (previous) (diff)

comment:6 by Jugnu, 6 years ago

Resolution: incomplete
Status: newclosed

comment:7 by Jugnu, 6 years ago

Resolution: incomplete
Status: closedreopened

comment:8 by Jugnu, 6 years ago

Description: modified (diff)

comment:9 by Jugnu, 6 years ago

Added support for inner and outer level html files exploration.

comment:10 by Jugnu, 6 years ago

Description: modified (diff)

comment:11 by Jugnu, 6 years ago

Description: modified (diff)
Summary: Support for epub and .pkl fileSupport for epub

comment:12 by Kelson, 5 years ago

Cc: Kelson added

comment:13 by Olly Betts, 3 years ago

I've deleted the random polarities.pkl as it seems unrelated to EPUB so doesn't seem relevant here, and its copyright status was unclear.

comment:14 by Olly Betts, 2 years ago

We now support indexing files using libe-book, and I just noticed libe-book has some "experimental" support for EPUB, which is disabled unless configured with --enable-experimental.

I've not tried it yet, but probably improving that to a suitable standard to no longer be deemed experimental would be a good way to resolve this. I'd expect that's much less work than implementing our own EPUB parsing, and less work to maintain going forwards. It also potentially benefits other projects using libe-book.

comment:15 by Olly Betts, 2 years ago

I had a brief go at trying it, but it seems to require libcss which isn't packaged for Debian.

Looks like libgepub may be a plausible option and is packaged: https://github.com/danigm/libgepub/blob/master/libgepub/gepub-doc.h

comment:16 by Olly Betts, 2 years ago

Resolution: fixed
Status: reopenedclosed

Implemented support using libgepub in b2109fcdc802565bb148b8d2be6d2b38920fa7e8. There are automated tests with EPUB 2 and EPUB 3 files exported from libreoffice.

Currently libgepub 0.6 or 0.7 should work, but CI only tests 0.6 due to that being what's available on the Ubuntu versions there.

Author and title are extracted successfully. Page count is set to the number of "chapters" which seems about as close as we can get since each chapter seems to be an HTML document. I wrote code for GEPUB_META_DESC but I don't see that actually appearing from the example files I tried with.

The EPUB format seems to support some other metadata types, but I didn't see how to get them.

Note: See TracTickets for help on using tickets.