Opened 7 years ago
Closed 6 years ago
#763 closed enhancement (fixed)
Track unique term bounds for documents in the collection
Reported by: | Guruprasad Hegde | Owned by: | Guruprasad Hegde |
---|---|---|---|
Priority: | normal | Milestone: | 1.5.0 |
Component: | Library API | Version: | |
Severity: | normal | Keywords: | |
Cc: | Olly Betts, Gaurav Arora | Blocked By: | |
Blocking: | Operating System: | All |
Description
As mentioned in ticket #756, tracking lower bound and upper bound on unique terms is needed to implement some weight metric accurately.
Raising this ticket to track the implementation of the above-mentioned stat support.
Change History (11)
comment:1 by , 6 years ago
comment:2 by , 6 years ago
The tracking should look a lot like the tracking for the document length bounds. These are stored in the "version" file - i.e. iamglass
for glass. See doclen_lbound
and doclen_ubound
in glass_version.cc
.
Unfortunately that code checks that there's no undecoded data after it has decoded the stats we know about, so we can't just add new stats and have older versions ignore them. In hindsight we should have omitted that check so we could add new stats.
So probably we don't implement this for glass, and instead implement it for honey, which is the next generation backend but still in development. But honey doesn't yet support updating databases - currently you have to compact a glass database to create a honey one, so implementing this for honey without implementing it for glass means that the compacting code which converts from glass to honey needs to calculate these bounds as it loops over all the documents - probably as it does the termlist table.
That code is in backends/honey/honey_compact.cc
, line 1866 currently. That loop needs to count how terms have a non-zero wdf to get the number of unique terms in each document, and then track lower and upper bounds on that as we work through the table (the lower bound should ignore 0, since such documents won't be involved in weighted queries). And then store those in the iamhoney
file.
comment:3 by , 6 years ago
In hindsight we should have omitted that check so we could add new stats.
It's too late to fix that for glass, but I've dropped that check for honey now (in 004e8c1045ef199cfca54274704dfc47e3762d27) so at least we don't end up in this situation again if we want to add new stats once honey's format becomes stable.
comment:4 by , 6 years ago
Thanks olly for the clear details. I have gone through the code a few times and got some idea about what's happening.
Fn call for compacting termlist table line 2234: merge_docid_keyed(out, inputs, offset, t->type);
Should I pass version_file_out
reference to the above function? So I can set values for newly added data members in HoneyVersion
class and then call write()
method.
comment:5 by , 6 years ago
Another question: Once updating Honey database is available (I guess it will be HoneyWritableDatabase
), all the stats will be updated within that class. Hence this fix (within compact()
code) is just a temporary one right? What happens when compact()
is called when source database is Honey
, is it allowed?
comment:6 by , 6 years ago
You definitely should not be calling write()
there - the new version file should only get written once, after everything else in the compaction run is complete. Writing it earlier is (a) wasted effort and (b) makes it appear there's a working database there when it's actually incomplete - especially bad if the run is interrupted at that point (e.g. run xapian-compact
and hit Ctrl+C).
You could pass in a reference or pointer to version_file_out
. Or pass the min and max bound variables by reference.
HoneyWritableDatabase
(or similar) will also need to track these stats, but the fix here isn't just temporary - it's useful to be able to efficiently convert a database to from the old format to the new one, so I think we will keep that feature.
You can compact a honey database to another honey database (currently there are bugs in some cases, but I think a simple compaction works). There it should just copy the bounds from the source database (or merge the bounds if there are multiple sources - i.e. take the minimum of the lower bounds and the maximum of the upper bounds, taking care to skip any sources where unique_terms is always zero).
comment:7 by , 6 years ago
After adding new stats as data members in HoneyVersion
class, I added getters and updated merge_stats()
method.
Problem is in honey_compact.cc
merge_stats()
takes GlassDatabase
version stat getters as arguments which don't have getters for new stats. Hence gives error no matching function call.
I added a default value for new stat parameters. This is the correct fix?
follow-up: 9 comment:8 by , 6 years ago
Also Instead of using current_wdf
of each term, termlist_size
can be used to update bounds for the number of unique terms?
comment:9 by , 6 years ago
Can you show a patch of the changes you're talking about in comment:7?
Replying to gp1308:
Also Instead of using
current_wdf
of each term,termlist_size
can be used to update bounds for the number of unique terms?
The correct count of unique terms ought to exclude those for which wdf == 0
, so to get that we'd need to actually look at current_wdf
- termlist_size
will often be more than the correct value.
However, currently for efficiency we approximate like this:
Xapian::termcount GlassTermList::get_unique_terms() const { LOGCALL(DB, Xapian::termcount, "GlassTermList::get_unique_terms", NO_ARGS); // get_unique_terms() really ought to only count terms with wdf > 0, but // that's expensive to calculate on demand, so for now let's just ensure // unique_terms <= doclen. RETURN(min(termlist_size, doclen)); }
So the bound here needs to based on the same thing, so it's actually a bound on the value that can return.
At some point it's likely we'll start storing the number of unique terms in a similar way to how we store the document length. That's probably not going to happen for glass now though, as it would be hard to start doing so compatibly.
comment:11 by , 6 years ago
Milestone: | → 1.5.0 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
Merged in 8e1d29f6eadae96ea72fec56f168e75abaf1b45f.
olly, Can you please suggest how to start with this task?