Project Work Product
I have been working on my forks of the Xapian project and Xapian-evaluation module on Github:
To more about this project in detail, I recommend visiting the project's main page.
Xapian
Merged
The following lists the work that has been merged in xapian master:
- API support for BM25+ weighting scheme as
Xapian::BM25PlusWeight
. - PL2+ weighting scheme support as
Xapian::PL2PlusWeight
. - Dir+ weighting scheme support as
DIRICHLET_PLUS_SMOOTHING
method inXapian::LMWeight
.
Link to the commits: https://github.com/xapian/xapian/commits/master?author=ivmarkp (Note: Commits before August 14th are from the application period.)
In progress
The following is the work that is in-progress:
- Piv+ normalisation support in
Xapian::TfIdfWeight
.
Link to opened PR: https://github.com/xapian/xapian/pull/115
Xapian-evaluation
Merged
- Support for the evaluation of BM25+ weighting scheme.
Link to the commits: https://github.com/samuelharden/xapian-evaluation/commits/master?author=ivmarkp
In progress
The following PRs enable the evaluation of corresponding weighting schemes in xapian-evaluation module and are under review:
- PR #11 - PL2 weighting scheme.
- PR #12 - PL2+ weighting scheme.
- PR #13 - Dir+ (
DIRICHLET_PLUS_SMOOTHING
) smoothing method. - PR #14 - Tf-Idf normalisations including pivoted normalisation (Piv+).
Evaluation run results
The above mentioned weighting scheme were evaluated with parameters set to their default value based on the experimental observations in the referenced paper. Dataset used for the evaluation runs was obtained from FIRE team. It's a news collection dataset from two major news providers having articles from several different categories such as Sports, Business, Politics etc. collected within a specific period of time. All the evaluation results have been put together on Github Gist and can be easily accessed by visiting here. A brief comparison based on the evaluation results is as follows.
Comparing the MAP to access the retrieval effectiveness :-
- BM25+ : 0.100415 and BM25: 0.101771
BM25 does a slightly better job here but I'd like to highlight the fact that the dataset we are using is a news collection dataset and it is mentioned in the paper that "the MAP improvements of BM25+ over BM25 are much larger on Web collections than on the news collection. In particular, the MAP improvements on all Web collections are statistically significant" so I think we can expect better results on the part of BM25+ if we get a chance to run some evaluations on web collection dataset in the future.
- PL2+: 0.0781953 and PL2: 0.0752646
PL2+ indeed does a better job at retrieving relevant documents although by a small margin. I believe this should reflect much better results at scale in practical use and can show more improvement over PL2 if we are able to run some evaluations on web collection dataset with PL2+ as well.
- LMWeight_Dirplus: 0.100168 and LMWeight_Dir: 0.100168
LMWeight with smoothing Dir and Dirplus retrieved same number of relevant documents and hence the same MAP for both.
Note - Piv+ normalization evaluation will be done later after we have it merged in xapian master.
Xapian-docsprint
I worked on the user guide documentation for adding newer weighting schemes along with some older ones in this fork.
To be merged later
- PR #14 - This PR updates the documentation with these weighting schemes :- BM25+, PL2, PL2+, Dir+, Tf-Idf ( with Piv+ normalization)