Text extraction libraries: timeline
Community Bonding Period: (May 4 - June 1) Module 1,2
- Get to know the community and interact with its members over the entire bonding period.
- Discuss and find a minor issue directly related to the project that needs to be fixed. (2 days)
- Identify a solution to the issue and discuss with members. (2 days)
- Write code to fix the issue, write relevant tests and documentation as necessary. Familiarize with the development workflow. (3 days)
- Submit a PR, write tests/documentation and work with mentors to get PR merged. (4 days)
- Read
https://www.documentliberation.org/projects/#import-libs
andhttp://djvu.sourceforge.net/
to draw a list of potential libraries that can be added to Xapian. (3 days) - Present list to mentors and discuss which libraries are crucial and which ones to focus on for this project. (4 days)
- Discuss and finalize the structure of the code such as where it will be added and what tests and documentation are necessary. (2 days)
Phase 1: (June 1 - June 29) Module 3
- Create a handler, which is a process used by omindex to access a library. (5 days)
- Create a file 'handler_yourlibrary.cc', this will include 'handler.h' and will define function 'extract' (declared in 'xapian-applications/omega/handler.h').
- Use the library to get necessary information and store it in corresponding arguments such as 'dump', 'title', 'author', etc.
- Update the build system. (6 days)
- Modifying 'configure.ac'
- Check if the library is available or not using 'PKG_CHECK_MODULES', which is a macro that provides an easy way to check for the presence of a given package in the system.
- Other macros that may be useful are :
- 'AC_CHECK_HEADERS', which defines a 'HAVE_header-file' if the header-file provided in arguments exists.
- 'AC_DEFINE', which is used to define a C preprocessor symbol that will indicate the results of a feature test.
- 'AC_COMPILE_IFELSE', which is used to check a syntax feature of a particular language's compiler, or to simply try some library feature.
- 'AC_LINK_IFELSE', which is used to compile test programs to test for functions and global variables.
- Modifying 'Makefile.am'
- Add the program to 'EXTRA_PROGRAMS'
- Define variables if necessary :
- 'omindex_yourlibrary_SOURCES'
- 'omindex_yourlibrary_LDADD'
- 'omindex_yourlibrary_CPPFLAGS'
- Modifying 'configure.ac'
- Add a new worker for the MIME type to omindex. (3 days)
- This can be done on the function 'add_default_libraries' at 'index_file.cc'.
- The compilation variable defined in 'configure.ac', 'HAVE_header-file', will be used here. If the variable is defined, a new worker will be created.
- Compile the code to make sure that everything is okay. If the modifications are correct, a new executable 'omindex_yourlibrary' will be present in the working directory.
- Testing and Evaluation (6 days)
- Add unit tests and individual tests for the library, Unit testing here means that I will find some zip files that have a license to freely distribute and verify that the shared library handler works well on it.
- Testing Omega. Omega's testsuite can be run in a similar way to that of xapian-core, 'make check' within the 'omega' directory. It runs several small tests such as 'atomparsetest', 'htmlparsetest', 'utf8convertest', etc.
- Make changes based on feedback and discussion during each of the above steps.
- Submit PR and merge.
- Submit Evaluation for Phase 1.
Phase 2: (June 29 - July 27)
- Libarchive adds support for a variety of file extensions including gzip, bzip2, xz, lzip, etc. This shows the versatility and utility of adding libarchive. As explained in the Modules section of the proposal, I noticed that there is little support for e-book and publishing file formats. Libe-books supports various file extensions such as .epub, .pdb, .fb2, .zvr, etc. This wiki (
https://wiki.documentfoundation.org/DLP/Libraries#Import_Libs
) provides the complete list.
- Libraries proposed :
- (to be decided)(3 weeks)(estimated)
- Create a handler to access the library (3 days)
- Update the build system. (4 days)
- Add worker to omindex. (3 days)
- Compile, test, and document. (5 days)
libpagemaker <https://wiki.documentfoundation.org/DLP/Libraries/libpagemaker>
_ : It is a library that parses the file format of Aldus/Adobe PageMaker documents. (1 week)
- (to be decided)(3 weeks)(estimated)
- Each of these will require me to create their individual handlers, update the build system, and add new workers to omindex in a similar manner.
Phase 3: (July 27 - August 24)
- In this phase, I will focus on adding support libraries related to file-formats for digital drawing and graphics. Specifically, I intend to focus on libzmf, libfreehand, and libcdr. The choice of these libraries is subject to approval from the community during the initial bonding and discussion phase.
- The procedure to add support for these libraries is understood to be similar to each other and should follow the same method as described in detail in Phase 1 for libarchive.
- Libraries proposed :
libcdr <https://wiki.documentfoundation.org/DLP/Libraries/libcdr>
_ : This is for CorelDRAW. This includes file formats like .cdr, .cmx. (8 days)libfreehand <https://wiki.documentfoundation.org/DLP/Libraries/libfreehand>
_ : This is used for Adobe FreeHand. (6 days)libzmf <https://wiki.documentfoundation.org/DLP/Libraries/libzmf>
_ : Zoner Callisto/Draw import library. This includes file extensions such as .zmf. (6 days)- Each of these will require me to create their individual handlers, update the build system, and add new workers to omindex in a similar manner.
Final Week: (August 24 - August 31)
- This is buffer week to complete any pending tasks that may be left over.
- Submit final evaluations and prepare final report.
Last modified
5 years ago
Last modified on 05/05/20 14:35:01
Note:
See TracWiki
for help on using the wiki.