Project Work Product
The aim was to add more normalisations for the Tf-Idf Weight. For this, the API of class Tf-Idf Weight required modification to use Enums, maintaining backwards compatibility. (Previously three-character strings were used.) Some normalisations required the stat "wdfdocmax" (The max wdf in a document), it's extraction was also included as part of the project.
For more details about the project (e.g. project plan, timeline etc), please visit project's main page.
Pull Request and Commits
The main pull requests of the project are:
- Enum parameters for Tf-Idf Weight #302.
- Support more normalisations #298, #308 and #312.
- Extract stat "wdfdocmax" #309.
- Pass "wdfdocmax" to get_sumextra() #310.
- Update create_from_parameters() for Tf-Idf Weight #314.
- Add user guide documentation for new normalisations #30.
- Update README to make it clearer #17.
Link to merged commits in xapian.
Work in progress
Currently, I am working on getting the exact value of unique terms. This can be done by storing unique terms in chunked streams as we do for doclength.
Future Work
- We need a better plan to extract doc-based stats (like doclength, wdfdocmax, unique terms) compatibly.
- It will be good to have tests to ensure that parameters passed to get_sumextra() have correct value.
For details on Future Work, please visit Future ideas page.
Evaluation run results
The recently implemented normalisations were evaluated using FIRE dataset and xapian-evaluation.
A brief insight into the results (The values following the normalisation are MEAN AVERAGE PRECISION and MEAN RELEVANCE PRECISION respectively.)
For easy comparison:
- NONE NONE NONE 0.0402904 0.0590923
- NONE TFIDF NONE 0.0653683 0.100065
Recently added Normalisations:
- AUG TFIDF NONE 0.0994039 0.138121
- AUG_AVERAGE TFIDF NONE 0.105225 0.151118
- AUG_LOG TFIDF NONE 0.109523 0.149767
- MAX TFIDF NONE 0.0490318 0.0753716
- NONE GLOBAL_FREQ NONE 0.041572 0.0579738
- NONE INCREMENTED_GLOBAL_FREQ NONE 0.0417351 0.0578799
- NONE LOG_GLOBAL_FREQ NONE 0.0411464 0.0561072
- NONE SQRT_GLOBAL_FREQ NONE 0.0417998 0.0566856
- SQRT TFIDF NONE 0.108113 0.151806
For more details on the evaluation run, click here.