wiki:GSoC2020/WeightingSchemes/Work Product

Context Navigation

Project Work Product

The aim was to add more normalisations for the Tf-Idf Weight. For this, the API of class Tf-Idf Weight required modification to use Enums, maintaining backwards compatibility. (Previously three-character strings were used.) Some normalisations required the stat "wdfdocmax" (The max wdf in a document), it's extraction was also included as part of the project.

For more details about the project (e.g. project plan, timeline etc), please visit project's main page.

Pull Request and Commits

The main pull requests of the project are:

Enum parameters for Tf-Idf Weight #302.
Support more normalisations #298, #308 and #312.
Extract stat "wdfdocmax" #309.
Pass "wdfdocmax" to get_sumextra() #310.
Update create_from_parameters() for Tf-Idf Weight #314.
Add user guide documentation for new normalisations #30.
Update README to make it clearer #17.

Link to merged commits in xapian.

Work in progress

Currently, I am working on getting the exact value of unique terms. This can be done by storing unique terms in chunked streams as we do for doclength.

Future Work

We need a better plan to extract doc-based stats (like doclength, wdfdocmax, unique terms) compatibly.
It will be good to have tests to ensure that parameters passed to get_sumextra() have correct value.

For details on Future Work, please visit Future ideas page.

Evaluation run results

The recently implemented normalisations were evaluated using FIRE dataset and xapian-evaluation.

A brief insight into the results (The values following the normalisation are MEAN AVERAGE PRECISION and MEAN RELEVANCE PRECISION respectively.)

For easy comparison:

NONE NONE NONE 0.0402904 0.0590923

NONE TFIDF NONE 0.0653683 0.100065

Recently added Normalisations:

AUG TFIDF NONE 0.0994039 0.138121

AUG_AVERAGE TFIDF NONE 0.105225 0.151118

AUG_LOG TFIDF NONE 0.109523 0.149767

MAX TFIDF NONE 0.0490318 0.0753716

NONE GLOBAL_FREQ NONE 0.041572 0.0579738

NONE INCREMENTED_GLOBAL_FREQ NONE 0.0417351 0.0578799

NONE LOG_GLOBAL_FREQ NONE 0.0411464 0.0561072

NONE SQRT_GLOBAL_FREQ NONE 0.0417998 0.0566856

SQRT TFIDF NONE 0.108113 0.151806

For more details on the evaluation run, click here.

Last modified 5 years ago Last modified on 28/08/20 11:54:13

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text