wiki:GSoC2020/TextExtraction/FutureWork

Future Work

Omega has support for a wide variety of file formats Various file formats have been supported using external filters. Adding support for these using shared libraries will make it more efficient and also allow us to extract meta-data in some cases. Some other file formats for which support still doesn't exist can be added. Some of these are mentioned below.

There is ​documentation on how to do it.

Libraries and file formats that can be supported

Libwpd

Wordperfect (.wpd) files are currently supported using 'wpd2text' command.

This can be replaced with libwpd. Libwpd is a library for reading and writing WordPerfect documents. It imports from Wordperfect 4/5/6/7/8/9/10/11 and WordPerfect for Macintosh 1.x/2.x/3.5e files. libwpd is based on librevenge.

License: GNU Library or Lesser General Public License version 2.0 (LGPLv2), Mozilla Public License 2.0 (MPL 2.0)

Formats that it will support:

  • .wpd

LaTeX

LaTeX is very widely used to produce technical and scientific documentation. There aren't many parsers available that can extract text from Latex documents perfectly. If a Library is available which can extract text and meta-data from LaTeX documents, it will be a very nice add on to Omega's functionality.

Librevenge

There are various libraries under the Document Liberation Project based on Librevenge.

Many of these such as libe-book, libetonyek, libcdr, libabw, etc have already been added to Omega. There are still few of these for which support can be added. You can find a list of such libraries. libwpd is one of them.

Last modified 4 years ago Last modified on 30/08/20 17:12:02
Note: See TracWiki for help on using the wiki.