GSoC2014/Performance and Optimisation/Journal – Xapian

wiki:GSoC2014/Performance and Optimisation/Journal

Context Navigation

Community Bonding Week 2: April 28-May 4

date 1

entry 1

date 2

entry 2

Community Bonding Week 3: May 5-May 11

Community Bonding Week 4: May 12-May 19

19-05-2014

Plan for Indexing the imdb movie database

-> The movie database consists of multiple directories ,each directory containing multiple XML files.

-> Each XML file describes a movie.

-> The directories and the files follow a logical naming sequence and thus the entire movie database can be processed easily.

->A sample XML file is as follows:- https://trac.xapian.org/attachment/wiki/GSoC2014/Performance%20and%20Optimisation/Journal/44074.xml

-> Now, the major and the first step is to index all these XML files in a Xapian Database.

-> My plan is to treat each movie or XML file as a document.

-> Now, the thing is, the data in each XML file varies. Some files for example, have the plot description, some don't.

->So, I'm thinking of using term prefixes only for title of the movie and save all other tags in raw text form without any term prefix.

-> So, a user can do both free text searches or query on the title of the movie.

-> Now, Python has a very powerful XML parser called lxml which seems easy to use.

->So, what I'm planning to do is :-

Parse the XML files in Python and put the parsed data as a paragraph in a text file.

The output this will be a huge text file with one paragraph for each movie.

I'll then write the indexer in C++. I'll treat each paragraph as a document. The first line of the paragraph will be the title.

of the movie which will be indexed with a term prefix.All the other data will be indexed without term prefixes.

I'll be using a stemmer too and so that will also get timed.

The plan is to time each API function I'll call and write the timing output to a text file .(Format to be discussed once indexing test is ready).

Coding Week 1: May 20-May 25

Coding Week 2: May 26-June 1

Coding Week 3: June 2-June 8

Optimization of DFR and query expansion schemes

-> The optimization of DFR schemes will take place in the following steps. I will definitely finish it before midterm evaluation. :-

a.) Profile all the DFR schemes and document the output by kcachegrind

b.) Try to reduce the number of mathematical calculations, especially logarithms used in them.

c.) Tighten the bounds in get_maxpart() and get_maxextra() which are loose as of now.

d.) Profile all the DFR schemes again , document the output by kcachegrind and compare.

e.) Profile the DFR query expansion scheme and document the output.

f.) Reduce the number of mathematical calculations in them.

g.) Profile query expansion scheme again.

June 7

-> Sent a pull request for an optimized upper bound and weighting calculations for Inl2 and IfB2. -> Working on optimization of upper bound and weighting calculation in DPH.

June 8 ->Based on Olly's extremely good advice, I made changes to Inl2 and Ifb2 by calculating some complex sub expressions in Init() whhich depend on constant values. They are a part of the pull requests.

-> I also optimized IneB2 weighting scheme by :-

a.) Simplifying expressions and cancelling out common factors.

b.) Caclulating common sub-expressions in Init()

c.) Tightening the upper bound.

I have opened a pull request for IneB2 too.

-> I am now working on DPH. I think I have some new ideas after discussing the upper bound optimizations with Olly today.

-> I will also work on timing the automated tests for the old and new code for each scheme and also profile the new scheme with kcachegrind and document the output.

Coding Week 4: June 9-June 15

June 9

-> I've optimized the DPH code and sent a pull request for it.

-> I've tightened the upper bound with the help of some really subtle mathematical calculations and eliminating common and constant subexpressions by calculating them upfront.

->I've run the automated tests on the new DPH code and they work fine.

-> I also made changes to InL2,IfB2 and IneB2 and optimized them more. They can now be merged once I've worked on the feedback I get from the community. I think they are ready and no more optimization can be done on the code.

-> I'll now be working on optimizing DLH,PL2,BB2 and TfIdf.

June 10

-> I have made changes to DPH as per the feedback given by Olly.

-> I've also added a detailed documentation showing the logic and mathematical derivations of the calculations and terms involved in optimizing the upper bound.

-> I think DPH is ready for merging.

-> I have begun work on DLH and made some progress today. My aim is to finish working on DLH and BB2 by Friday/Saturday. These schemes are mathematically intense and so I need four days.

Coding Week 5: June 16-June 22

Coding Week 6: June 23-June 29 (Midterm deadline June 27)

Coding Week 7: June 30-July 6

Coding Week 8: July 7-July 13

Coding Week 9: July 14-July 20

Coding Week 10: July 21-July 27

Coding Week 11: July 28-August 3

Coding Week 12: August 4-August 10

Coding Week 13: August 11-August 18 (Final evaluation based on work up to August 18)

Last modified 10 years ago Last modified on 26/01/16 10:10:43

Attachments (1)

44074.xml (2.4 KB ) - added by Aarsh Shah 11 years ago. A sample xml file of the imdb movie database

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text