wiki:GSoC2019/TextExtraction/Work

Text-Extraction Libraries

Work Product

The aim of this project is adding support to extract text from various file formats during indexing through external libraries. The main part of this is the modules worker and assistant which bring a way of integrating new libraries. These modules deal with library errors and isolate them in subprocesses to avoid them from crashing omindex.

Pull Request and Commits

The main pull request of this project are:

These are the pull request corresponding to the added libraries:

Link containing all merged commits

Please, read Notes to get more information about this work.

Work in Progress

Currently, I am working on Omindextest and adding test cases to it. I would like to extend it and test other features of the program as I find having some automated testing of omindex really important.

Future Work

As future work I think that improve Omindextest would be important. Adding more test cases to test different features or improve the reliability of omindex is crucial for develop long term code.

Adding new formats and libraries could be another point. There is documentation about how to do it and it is advisable to choose popular formats and libraries with an active community.

Last modified 8 weeks ago Last modified on 22/08/19 13:02:34