wiki:GSoC2016/Weighting/Code

Project Work Product

I have been working on my forks of the Xapian project and Xapian-evaluation module on Github:

To more about this project in detail, I recommend visiting the project's main page.

Xapian

Merged

The following lists the work that has been merged in xapian master:

  • API support for BM25+ weighting scheme as Xapian::BM25PlusWeight.
  • PL2+ weighting scheme support as Xapian::PL2PlusWeight.
  • Dir+ weighting scheme support as DIRICHLET_PLUS_SMOOTHING method in Xapian::LMWeight.

Link to the commits: https://github.com/xapian/xapian/commits/master?author=ivmarkp (Note: Commits before August 14th are from the application period.)

In progress

The following is the work that is in-progress:

  • Piv+ normalisation support in Xapian::TfIdfWeight.

Link to opened PR: https://github.com/xapian/xapian/pull/115

Xapian-evaluation

Merged

  • Support for the evaluation of BM25+ weighting scheme.

Link to the commits: https://github.com/samuelharden/xapian-evaluation/commits/master?author=ivmarkp

In progress

The following PRs enable the evaluation of corresponding weighting schemes in xapian-evaluation module and are under review:

  • PR #11 - PL2 weighting scheme.
  • PR #12 - PL2+ weighting scheme.
  • PR #13 - Dir+ (DIRICHLET_PLUS_SMOOTHING) smoothing method.
  • PR #14 - Tf-Idf normalisations including pivoted normalisation (Piv+).

Evaluation run results

The above mentioned weighting scheme were evaluated with parameters set to their default value based on the experimental observations in the referenced paper. Dataset used for the evaluation runs was obtained from FIRE team. It's a news collection dataset from two major news providers having articles from several different categories such as Sports, Business, Politics etc. collected within a specific period of time. All the evaluation results have been put together on Github Gist and can be easily accessed by visiting here. A brief comparison based on the evaluation results is as follows.

Comparing the MAP to access the retrieval effectiveness :-

  1. BM25+ : 0.100415 and BM25: 0.101771

BM25 does a slightly better job here but I'd like to highlight the fact that the dataset we are using is a news collection dataset and it is mentioned in the paper that "the MAP improvements of BM25+ over BM25 are much larger on Web collections than on the news collection. In particular, the MAP improvements on all Web collections are statistically significant" so I think we can expect better results on the part of BM25+ if we get a chance to run some evaluations on web collection dataset in the future.

  1. PL2+: 0.0781953 and PL2: 0.0752646

PL2+ indeed does a better job at retrieving relevant documents although by a small margin. I believe this should reflect much better results at scale in practical use and can show more improvement over PL2 if we are able to run some evaluations on web collection dataset with PL2+ as well.

  1. LMWeight_Dirplus: 0.100168 and LMWeight_Dir: 0.100168

LMWeight with smoothing Dir and Dirplus retrieved same number of relevant documents and hence the same MAP for both.

Note - Piv+ normalization evaluation will be done later after we have it merged in xapian master.

Xapian-docsprint

I worked on the user guide documentation for adding newer weighting schemes along with some older ones in this fork.

To be merged later

  • PR #14 - This PR updates the documentation with these weighting schemes :- BM25+, PL2, PL2+, Dir+, Tf-Idf ( with Piv+ normalization)
Last modified 3 years ago Last modified on 23/08/16 03:31:56