wiki:GSoC2019/TextExtraction/Notes

Notes about the problems faced during the project

Adding the bases

The core of this project are the Worker class and the assistant process. Through them omindex will have the possibility of using external libraries to extract information from different files, so their communication and run-time are really important. In this commit you will find the code(a first version) related to this part of the project.

The aim of the worker class is to communicate with the assistant process through sockets and keep track of its activity. On the other hand, the assistant process is going to use the library and try to extract all the necessary information.

Poppler

Poppler is a PDF rendering library based on the xpdf-3.0 code base. This is a popular library with good documentation so it is not difficult to find information about it on the internet. At this moment (June 2019) the last version is 0.77, but I am using old methods for compatibility issues.

One problem was that libpoppler.so is linked against GNU stdlibc++ and we cannot compile it with clang++ -stdlib=libc++, so it was necessary to modify the building system and add AC_LINK_IFELSE to check if we can link poppler to the program. You can read more about this here.

Another option was using poppler-glib version, which seems to compile but we didn't want to add glib as a dependency. In this commit you can see the handler implemented with poppler-glib.

It seems that pdftotxt is faster for some files like this, I am not sure but I think maybe poppler-cpp make some kind of transformation from utf8 to utf16 and back again or it can be more related with the fact that pdftotxt use some internal library.

Inside this commit you will find the code related to this library.

Build System

A lot of problems where related to the build system. As it was my first time with this kind of tools it take me a while to learn how to modify it in a proper way, and this delay me a lot. I think that the bigger problem was adding PKG_CHECK_MODULES to the system. This macro is defined at pkg.m4 but the system has problems to find it in some operating systems(such as Mac OS). It was necessary to modify bootstrap and add the different directories where this package could be. Here is a little commit with one of the modifications.

Libe-book

Currently I am coding a handle for e-book formats using libe-book. It seems to work using librevenge. We can read a bit about it here.

One problem was related to the mime types. Most of the files were identify as text/xml and application/octet-stream. So I had to add some mime types to mimemap.tokens. Besides, it seems to have problems with some particular files so I opened a ticket.

It is not possible to add some formats (fb2.zip) to mimemap.tokens. Omindex takes everything after the final . as the extension to look up, so foo.fb2.zip would be looked up as extension zip. We will have to fix this if we want to handle this kind of formats.

Libetonyek

There is an issue with this library on some Mac operating systems. It is possible to install libetonyek through brew install libetonyek, but there is a problem with the dependency liblangtag. Unfortunately, we cannot use brew to install it. It seems that the only options are using MacPorts or installing it from the source code.

To solve this problem I installed liblangtag from the source code on a virtual machine and then compiled omindex to check that everything were okay. It seems to work well, but I am not sure about how to modify the travis file. To install it I followed these steps and modified a line from the configure.ac file (I change libtool --config for glibtool --config at line 329). I reported this issue to brew.

Mimetic

Mimetic is a free, MIT licensed, Email library (MIME) written in C++ designed to be easy to integrate but yet fast and efficient. For using it I have to get familiar with different RFC standards and its details. Although it seems to be a good library, it hasn't bring methods to decode the subject or author of an email. To can read more about this problem here.

To solve this, I implement my own parser but it seems that there are a lot of edge cases (most of them because some clients didn't follow the RFC standard in a proper way), so we decide to discard this library. Here is the commit with this work.

Here is more information about Multipurpose Internet Mail Extensions (MIME). Please read this part.

Tesseract

Tesseract is an OCR (Optical Character Recognition) engine which can be added to Omindex to handle images. It seems to work well with a simple configuration and doesn't threw(yet) any particular problem.Here is the commit.

Documentation

Currently, this is the only documentation about how to add new formats to omindex. So, it is important to update this information and add the new features of omindex to it. In this PR you can see more about this work.

Improve Worker and Assistant

It was necessary to improve the worker class and the assistant process. This modifications allow the assistant process to support non-fatal errors. Also, we improve the error communication to show more information at indexing time.

Another feature that can be useful in the future is giving the assistant process the ability of take arguments from the command line. Here is an example of it.

Note: It is possible for an assistant process to use another assistant if it is required. For example, we can use an OCR engine to extract information from images inside of a pdf file. For that, we can instance a worker process inside a handler an use it to get the desired information.

Gmime 2.6

GMime is a C/C++ library which may be used for the creation and parsing of messages using the Multipurpose Internet Mail Extension (MIME). This library is a bit complex, but solves the problem that we have with mimetic and it seems to work really well. At the moment I have some charset issues, but I hope to solve then this week. Here is the commit.

Last modified 6 days ago Last modified on 12/08/19 13:51:38