wiki:GSoC2020/TextExtraction/WorkProduct

Text Extraction

Work Product Summary

The aim of this project is to extend Omega's functionality to extract text and metadata from various file formats for indexing. Various formats that were supported using external filters have also been replaced with shared libraries.

Pull requests and Commits

The main pull requests of this project are:

Link to all merged commits

Details about these can be found in the Project Plan page under Merged Libraries and work section.

Work in Progress

Currently, I am working on adding support for legacy Mac documents using Libmwaw. PRs under review are:

Details about these can be found in the Project Plan page under Work under review section.

Future Work

Omega already supports many popular file formats. Support for some obsolete file formats can be added. Refer to Future Work page for details about some of them.

Various file formats have been supported using external filters. Adding support for these using shared libraries will make it more efficient and also allow us to extract meta-data in some cases. There is documentation on how to do it.

Last modified 4 years ago Last modified on 30/08/20 19:28:18
Note: See TracWiki for help on using the wiki.