wiki:GSoC2020/TextExtraction/Journal

Community Bonding Week 1: May 4–May 10

6 May 2020

Research about Poppler's classes and how support was added in Omega.

7 May 2020

Read https://poppler.freedesktop.org/api/cpp/classpoppler_1_1document.html#ad9cc5b66e864e4f0f15024c4a56c1861 and understand the functions used in handler_poppler.cc

8 May 2020

Understand changes done in configure.ac and Makefile.am for poppler and read https://developer.gnome.org/anjuta-build-tutorial/stable/library-autotools.html.en and setup Omega CGI - https://trac.xapian.org/wiki/OmegaExample

Community Bonding Week 2: May 11-May 17

11 May 2020

Started reading about Libarchive - https://www.libarchive.org/ and https://github.com/libarchive/libarchive/wiki/Examples

14,15 May 2020

Compiled and Installed libarchive 3.4.2 and further read about odf formats and using libarchive and its documentation

16,17 May 2020

Started reading libarchive's internal documentation

Community Bonding Week 3: May 18-May 24

20 May 2020

Read

21 May 2020

Read

22 May 2020

Read the following files in libarchive's internal documentation :

  • File:archive_read.3.html
  • archive_read_data.3.html
  • archive_entry.3.html
  • archive_read_filter.3.html
  • archive_read_format.3.html
  • archive_read_new.3.html
  • archive_read_open.3.html
  • archive_read_set_options.3.html

Community Bonding Week 4: May 25-May 31

25 May 2020

Set up environment for coding, go through coding conventions look into how to extract data from content.xml specifically using libarchive

26 May 2020

Read and understand minitar.c (libarchive)

27 May 2020

Started coding for extracting data from OpenDocument format using libarchive

28 May 2020

Read

Extract data from content.xml and style.xml

29 May 2020

Read about read() system call

Coding Week 1: June 1–June 7

1 June 2020

Completed handler_libarchive, and made required additions in makefile.am and configure.ac

2-3 June 2020

Solved errors regarding libarchive_sources and libpcre, pushed the repo to remote- https://github.com/Exter-dg/xapian/commit/9f0fdaf4ef600d4840f68b111534bc699d242644

4-5 June 2020

Tried to optimise the code and check for any errors. Read-

Coding Week 2: June 8-June 14

8-9 June 2020

  • Try and optimize the handler
  • Solve xapian-check-patch errors
  • Test omindex_libarchive on different systems.
  • Started working on Abiword (.zabw / .abw.gz )

10-11 June 2020

12-14 June 2020

  • Try and solve errors in the handler and index_file.cc

Coding Week 3: June 15-June 21

15 - 19 June 2020

  • Work on socketpair error in worker.cc and fix coding convention errors, Update the PR.

20-21 June 2020

Test for ApacheOffice documents.

Coding Week 4: June 22-June 28

22-24 June 2020

Use code from omindextest PR#280 and created sample tests for formats added using libarchive

25-26 June 2020

Create Class omindexcheck and add functions

Coding Week 5: June 29-July 5 (first evaluation due July 3)

29 - 30 June 2020

Read about IPC and fix errors in omindexcheck and handler.'

1 - 2 July 2020

  • Add tests to PR #300
  • Work on omindexcheck
  • Search for any suitable library for LaTeX documents

3 - 4 July 2020

  • Work on mime-type modifications
  • Read about libspectre (Postscript).

Coding Week 6: July 6-July 12

6 - 9 July 2020

  • Complete and refactor mime-type modifications
  • Create MS 2007 files for testing
  • Work on completing and getting PR 300 merged
  • Read about postscript format

10 - 12 July 2020

  • Get PR 300 merged
  • Open and work on PR 303 - adding OOXML formats to Libarchive

Coding Week 7: July 13-July 19

13 - 14 July 2020

  • Open PR 304 (Add mimetype to handlers)
  • Read about using libextractor as a potential library to extract meta data from audio and video files.
  • Discuss and Update project plan with proposed libraries (Libextractor)

15 - 19 July 2020

  • Make changes to PR 303 and PR 304(closed)
  • Read Libextractor's documentation
  • Solve issues regarding libextractor
  • Create handler_libextractor

Coding Week 8: July 20-July 26

  • Update handler_libextractor
  • Update PR 303
  • Find files for testing libextractor
  • Write tests for libextractor
  • Research new libraries/formats to be added.

Coding Week 9: July 27-August 2 (second evaluation due July 31)

27 July 2020

  • Read libabw and librevenge Documentation
  • Work on handler_libabw

July 28 - July 30

  • Try and solve Libextractor's test issue.
  • Complete handler_libabw

July 31 - August 2

  • Solve libabw's problem while extracting title.
  • Research the libraries for next phase

Coding Week 10: August 3-August 9

August 3 2020

  • Research about Wordperfect and libwpd and understand how does it functions.

August 4 - August 7 2020

  • Research about Corel Draw and cdr format structure
  • Update PR 306 and PR 307
  • Work on handler_libcdr

August 9 2020

  • Make sample cdr and cmx files

Coding Week 11: August 10-August 16

August 10 - August 12 2020

  • Solve issues regarding libcdr and cdr format

August 13 - August 14 2020

  • Update PR 311
  • Research about Zoner DRAW and libzmf

Coding Week 12: August 17-August 23

August 17 - August 18 2020

  • Update PR 311
  • Read about libmwaw and work on handler_libmwaw

August 19 - August 23 2020

  • Solve issues and update PR 306 (libextractor)
  • Make changes for the test to be skipped in omindexcheck
  • Open PR 315 - Libmwaw
  • Research about libzmf

Submit code and evaluations: August 24-August 31

August 24 - August 26 2020

  • Update PR 306 (Change implementation of testcase- allow it to store per-testcase flags)
  • Update PR 315
  • Create a draft for project report
Last modified 4 years ago Last modified on 26/08/20 03:13:16
Note: See TracWiki for help on using the wiki.