wiki:GSoC2020/WeightingSchemes/Work Product

Project Work Product

The aim was to add more normalisations for the Tf-Idf Weight. For this, the API of class Tf-Idf Weight required modification to use Enums, maintaining backwards compatibility. (Previously three-character strings were used.) Some normalisations required the stat "wdfdocmax" (The max wdf in a document), it's extraction was also included as part of the project.

For more details about the project (e.g. project plan, timeline etc), please visit project's main page.

Pull Request and Commits

The main pull requests of the project are:

  • Enum parameters for Tf-Idf Weight #302.
  • Support more normalisations #298, #308 and #312.
  • Extract stat "wdfdocmax" #309.
  • Pass "wdfdocmax" to get_sumextra() #310.
  • Update create_from_parameters() for Tf-Idf Weight #314.
  • Add user guide documentation for new normalisations #30.
  • Update README to make it clearer #17.

Link to merged commits in xapian.

Work in progress

Currently, I am working on getting the exact value of unique terms. This can be done by storing unique terms in chunked streams as we do for doclength.

Future Work

  • We need a better plan to extract doc-based stats (like doclength, wdfdocmax, unique terms) compatibly.
  • It will be good to have tests to ensure that parameters passed to get_sumextra() have correct value.

For details on Future Work, please visit Future ideas page.

Evaluation run results

The recently implemented normalisations were evaluated using FIRE dataset and xapian-evaluation.

A brief insight into the results (The values following the normalisation are MEAN AVERAGE PRECISION and MEAN RELEVANCE PRECISION respectively.)

For easy comparison:

  1. NONE NONE NONE 0.0402904 0.0590923
  1. NONE TFIDF NONE 0.0653683 0.100065

Recently added Normalisations:

  1. AUG TFIDF NONE 0.0994039 0.138121
  1. AUG_AVERAGE TFIDF NONE 0.105225 0.151118
  1. AUG_LOG TFIDF NONE 0.109523 0.149767
  1. MAX TFIDF NONE 0.0490318 0.0753716
  1. NONE GLOBAL_FREQ NONE 0.041572 0.0579738
  1. NONE INCREMENTED_GLOBAL_FREQ NONE 0.0417351 0.0578799
  1. NONE LOG_GLOBAL_FREQ NONE 0.0411464 0.0561072
  1. NONE SQRT_GLOBAL_FREQ NONE 0.0417998 0.0566856
  1. SQRT TFIDF NONE 0.108113 0.151806

For more details on the evaluation run, click here.

Last modified 23 months ago Last modified on 28/08/20 11:54:13
Note: See TracWiki for help on using the wiki.