wiki:GSoC2020/TextExtraction/ProjectPlan

Text extraction libraries: project plan

Omega's omindex indexer currently has support for indexing various documents such as HTML, plain-text, CSV, etc (for a complete list - refer Omega Overview ). Support for using external libraries instead of external programs was added in GSoC 2019 Project.

This project would focus on adding support for various other shared libraries which will avoid the overhead of running an external filter and thus speed up indexing.

It will also include completing any remaining task (if any) in omindextest and getting it merged.



Work under review

Libmwaw

Libmwaw is a part of the Document Liberation Project and is based upon Librevenge. It is a project which supports parsing and converting various pre-OSX MAC text formats, graphic formats and some presentation formats. Omega currently has no functionality to index these file formats. Libmwaw will allow Omega to extract text as well as metadata from such files.

License:

  • GNU Library or Lesser General Public License version 2.0 (LGPLv2),
  • Mozilla Public License 2.0 (MPL 2.0)

Formats that will be added:

The relevant code for this can be found in PR 315.


Merged Libraries and work

Add Mimetype to Handler

Mimetype of the files is now sent to the handlers along with other parameters. See handler.h for the function definition of extract.

This allows handlers to decide which parser to use based on mime-type in the case when a single handler supports multiple types of formats.

The relevant code for this can be found in PR 304.

Libarchive

Omega's indexer already had support for various formats, including zip based formats such as OpenDocument Format(ODF), OpenOffice.org XML (.sxi, .sxc, etc), OOXML formats(.docx, .xlsx, etc). The functionality of using these formats was provided by using external filters such as unzip.

The Libarchive project provides a C library which can read and write streaming archives in a variety of formats. Omega now uses Libarchive to extract data from zip-based formats.

Libarchive is used to extract data from these formats instead of unzip and uses various parsers to extract text from it.

File formats now supported using Libarchive :

OpenDocument Format

  • .odt
  • .ods
  • .odp
  • .odg
  • .odc
  • .odf
  • .odb
  • .odi
  • .odm
  • .ott
  • .ots
  • .otp
  • .otg
  • .otc
  • .otf
  • .oti
  • .oth

OpenOffice.org XML

  • .sxc
  • .stc
  • .sxd
  • .std
  • .sxi
  • .sti
  • .sxm
  • .sxw
  • .sxg
  • .stw

OOXML Format

  • .docx
  • .dotx
  • .xlsx
  • .xltx
  • .pptx
  • .ppsx
  • .potx

The relevant code can be found in handler_libarchive.cc. Refer PR 300 and PR 303 .

Libabw

Omega uses XMLparser and file_to_string() to extract data (content) from abiword and compressed abiword files.

Libabw is a library based on Librevenge that parses the file formats of Abiword documents. This allows Omega to index metadata from Abiword files along with the main content.

License: MPL 2.0

Issues:

  • Some versions of Libabw failed to extract the title from the file. This has been fixed now but if the user has an older version of libabw then the title may not be extracted.

Formats supported:

  • .abw
  • .zabw

The relevant code for this can be found in handler_libabw.cc and in the PR 307

Libcdr

CorelDRAW is a vector graphics editor by Corel Corporation. Currently, Omega had no support for CorelDRAW files. Libcdr allows Omega to extract text from such files and thus extend its capability.

Libcdr is based on Librevenge and is a part of the Document Liberation Project.

License: MPL 2.0

Issues

  • Libcdr doesn't seem to extract any data from cmx files. CMX files when opened in Ubuntu (19.10 - Libre Office DRAW) don't display the text. I checked for different fonts as my system may not have a particular font but this is not a font issue.
  • Libcdr doesn't seem to extract any meta-data from cdr files. CorelDRAW files are zip compressed (versions >= X4) and libarchive along with metaXMLparser might be the solution.
  • .cdr files from version X4(14) onwards are zip-compressed directories. Libcdr seems to extract data(content) including different character sets from cdr files older than X4.
  • Libcdr versions >= 0.1.6 extract text correctly from cdr files >= X6. Older versions of libcdr lack a bug fix for extracting text correctly.

Format added:

  • .cdr

The relevant code can be found in handler_libcdr and in PR 311.

Libextractor

Modern file formats have provisions to give a description of a file using "metadata". The goal of the Libextractor project is to provide an interface to obtain metadata from various file formats. Omega can use Libextractor to extract metadata from various audio and video formats. Libextractor provides a variety of information such as -name of the software used to create the file, the author, descriptions, album titles, image dimensions or the duration of a movie (depending on the format).

Libextractor uses helper libraries to perform extraction from the files.

Issues

  • As libextractor uses external libraries to extract metadata from files, it is possible that different systems have different libraries installed and hence files that it can extract data from might vary from system to system. Although libextractor's documentation mentions that in case of a corrupt file or a file of unknown format, it will just return NULL and not an error.
  • In order to add tests for libextractor, the major hurdle is that it is possible that Libextractor is present (thus the file is indexed) and specific plugin for that file format(libextractor usually uses 2 plugins - mime and a format specific plugin like gstreamer,ogg) may or may not be present.
    • If the format-specific plugin is present, it will extract a lot of metadata.
    • If the format-specific plugin is not present, it will extract only the mime-type (from the mime plugin)
    So the test program will have to know if the plugin is present/absent and on the basis of that decide whether to pass/fail/skip the test. This is solved by changing the implementation of compare_test() from bool to enum. This enables the main program to identify the test result as passed/failed/skipped.
    • Pass - When all terms are found.
    • Skip - When no terms are found
    • Fail - When one or more than one terms are not found (but not all).

Formats that can be added :

  • .flac
  • .mp3
  • .ogg
  • .oga
  • .spx
  • .ogv
  • .wav
  • .s3m
  • .XM
  • .IT
  • .flv
  • .avi
  • .mpg
  • .qt
  • .asf

The relevant code for this can be found in handler_libextractor and PR 306.

Add per-testcase flags in omindextest

This changes the implementation of testcase from vector to struct in omindexcheck.cc. This enables us to set various flags such as FAIL_IF_NO_TERMS, which allows tests in Omega to pass,fail,skip tests depending upon in what type of file no terms were found. This also allows adding other flags in future.

The relevant code for this can be found in - PR 306.

Last modified 4 years ago Last modified on 01/15/21 16:53:50
Note: See TracWiki for help on using the wiki.