wiki:GSoC2020/TextExtraction/Timeline

Text extraction libraries: timeline

Community Bonding Period: (May 4 - June 1) Module 1,2

  • Get to know the community and interact with its members over the entire bonding period.
  • Discuss and find a minor issue directly related to the project that needs to be fixed. (2 days)
  • Identify a solution to the issue and discuss with members. (2 days)
  • Write code to fix the issue, write relevant tests and documentation as necessary. Familiarize with the development workflow. (3 days)
  • Submit a PR, write tests/documentation and work with mentors to get PR merged. (4 days)
  • Read https://www.documentliberation.org/projects/#import-libs and http://djvu.sourceforge.net/ to draw a list of potential libraries that can be added to Xapian. (3 days)
  • Present list to mentors and discuss which libraries are crucial and which ones to focus on for this project. (4 days)
  • Discuss and finalize the structure of the code such as where it will be added and what tests and documentation are necessary. (2 days)

Phase 1: (June 1 - June 29) Module 3

  • Create a handler, which is a process used by omindex to access a library. (5 days)
    • Create a file 'handler_yourlibrary.cc', this will include 'handler.h' and will define function 'extract' (declared in 'xapian-applications/omega/handler.h').
    • Use the library to get necessary information and store it in corresponding arguments such as 'dump', 'title', 'author', etc.
  • Update the build system. (6 days)
    • Modifying 'configure.ac'
      • Check if the library is available or not using 'PKG_CHECK_MODULES', which is a macro that provides an easy way to check for the presence of a given package in the system.
      • Other macros that may be useful are :
        • 'AC_CHECK_HEADERS', which defines a 'HAVE_header-file' if the header-file provided in arguments exists.
        • 'AC_DEFINE', which is used to define a C preprocessor symbol that will indicate the results of a feature test.
        • 'AC_COMPILE_IFELSE', which is used to check a syntax feature of a particular language's compiler, or to simply try some library feature.
        • 'AC_LINK_IFELSE', which is used to compile test programs to test for functions and global variables.
    • Modifying 'Makefile.am'
      • Add the program to 'EXTRA_PROGRAMS'
      • Define variables if necessary :
        • 'omindex_yourlibrary_SOURCES'
        • 'omindex_yourlibrary_LDADD'
        • 'omindex_yourlibrary_CPPFLAGS'
  • Add a new worker for the MIME type to omindex. (3 days)
    • This can be done on the function 'add_default_libraries' at 'index_file.cc'.
    • The compilation variable defined in 'configure.ac', 'HAVE_header-file', will be used here. If the variable is defined, a new worker will be created.
    • Compile the code to make sure that everything is okay. If the modifications are correct, a new executable 'omindex_yourlibrary' will be present in the working directory.
  • Testing and Evaluation (6 days)
    • Add unit tests and individual tests for the library, Unit testing here means that I will find some zip files that have a license to freely distribute and verify that the shared library handler works well on it.
    • Testing Omega. Omega's testsuite can be run in a similar way to that of xapian-core, 'make check' within the 'omega' directory. It runs several small tests such as 'atomparsetest', 'htmlparsetest', 'utf8convertest', etc.
  • Make changes based on feedback and discussion during each of the above steps.
  • Submit PR and merge.
  • Submit Evaluation for Phase 1.

Phase 2: (June 29 - July 27)

  • Libarchive adds support for a variety of file extensions including gzip, bzip2, xz, lzip, etc. This shows the versatility and utility of adding libarchive. As explained in the Modules section of the proposal, I noticed that there is little support for e-book and publishing file formats. Libe-books supports various file extensions such as .epub, .pdb, .fb2, .zvr, etc. This wiki (https://wiki.documentfoundation.org/DLP/Libraries#Import_Libs) provides the complete list.
  • Libraries proposed :
    • (to be decided)(3 weeks)(estimated)
      • Create a handler to access the library (3 days)
      • Update the build system. (4 days)
      • Add worker to omindex. (3 days)
      • Compile, test, and document. (5 days)
    • libpagemaker <https://wiki.documentfoundation.org/DLP/Libraries/libpagemaker>_ : It is a library that parses the file format of Aldus/Adobe PageMaker documents. (1 week)
  • Each of these will require me to create their individual handlers, update the build system, and add new workers to omindex in a similar manner.

Phase 3: (July 27 - August 24)

  • In this phase, I will focus on adding support libraries related to file-formats for digital drawing and graphics. Specifically, I intend to focus on libzmf, libfreehand, and libcdr. The choice of these libraries is subject to approval from the community during the initial bonding and discussion phase.
  • The procedure to add support for these libraries is understood to be similar to each other and should follow the same method as described in detail in Phase 1 for libarchive.
  • Libraries proposed :
    • libcdr <https://wiki.documentfoundation.org/DLP/Libraries/libcdr>_ : This is for CorelDRAW. This includes file formats like .cdr, .cmx. (8 days)
    • libfreehand <https://wiki.documentfoundation.org/DLP/Libraries/libfreehand>_ : This is used for Adobe FreeHand. (6 days)
    • libzmf <https://wiki.documentfoundation.org/DLP/Libraries/libzmf>_ : Zoner Callisto/Draw import library. This includes file extensions such as .zmf. (6 days)
    • Each of these will require me to create their individual handlers, update the build system, and add new workers to omindex in a similar manner.

Final Week: (August 24 - August 31)

  • This is buffer week to complete any pending tasks that may be left over.
  • Submit final evaluations and prepare final report.
Last modified 5 years ago Last modified on 05/05/20 14:35:01
Note: See TracWiki for help on using the wiki.