Text Extraction
Work Product Summary
The aim of this project is to extend Omega's functionality to extract text and metadata from various file formats for indexing. Various formats that were supported using external filters have also been replaced with shared libraries.
Pull requests and Commits
The main pull requests of this project are:
- Add mime-type to handlers
- Add support for ODF formats- Libarchive
- Add support for OOXML formats- Libarchive
- Add support for Abiword documents- Libabw
- Add support for CorelDRAW files- Libcdr
Link to all merged commits
Details about these can be found in the Project Plan page under Merged Libraries and work section.
Work in Progress
Currently, I am working on adding support for legacy Mac documents using Libmwaw. PRs under review are:
- Add support for legacy Mac documents- Libmwaw
- PR 306
- Adding support for extracting metadata from Audio/Video files - Libextractor
- Changing the implementation of omindexcheck.
Details about these can be found in the Project Plan page under Work under review section.
Future Work
Omega already supports many popular file formats. Support for some obsolete file formats can be added. Refer to Future Work page for details about some of them.
Various file formats have been supported using external filters. Adding support for these using shared libraries will make it more efficient and also allow us to extract meta-data in some cases. There is documentation on how to do it.