Opened 6 years ago

Closed 6 years ago

#763 closed enhancement (fixed)

Track unique term bounds for documents in the collection

Reported by: Guruprasad Hegde Owned by: Guruprasad Hegde
Priority: normal Milestone: 1.5.0
Component: Library API Version:
Severity: normal Keywords:
Cc: Olly Betts, Gaurav Arora Blocked By:
Blocking: Operating System: All

Description

As mentioned in ticket #756, tracking lower bound and upper bound on unique terms is needed to implement some weight metric accurately.

Raising this ticket to track the implementation of the above-mentioned stat support.

Change History (11)

comment:1 by Guruprasad Hegde, 6 years ago

olly, Can you please suggest how to start with this task?

comment:2 by Olly Betts, 6 years ago

The tracking should look a lot like the tracking for the document length bounds. These are stored in the "version" file - i.e. iamglass for glass. See doclen_lbound and doclen_ubound in glass_version.cc.

Unfortunately that code checks that there's no undecoded data after it has decoded the stats we know about, so we can't just add new stats and have older versions ignore them. In hindsight we should have omitted that check so we could add new stats.

So probably we don't implement this for glass, and instead implement it for honey, which is the next generation backend but still in development. But honey doesn't yet support updating databases - currently you have to compact a glass database to create a honey one, so implementing this for honey without implementing it for glass means that the compacting code which converts from glass to honey needs to calculate these bounds as it loops over all the documents - probably as it does the termlist table.

That code is in backends/honey/honey_compact.cc, line 1866 currently. That loop needs to count how terms have a non-zero wdf to get the number of unique terms in each document, and then track lower and upper bounds on that as we work through the table (the lower bound should ignore 0, since such documents won't be involved in weighted queries). And then store those in the iamhoney file.

Last edited 6 years ago by Olly Betts (previous) (diff)

comment:3 by Olly Betts, 6 years ago

In hindsight we should have omitted that check so we could add new stats.

It's too late to fix that for glass, but I've dropped that check for honey now (in 004e8c1045ef199cfca54274704dfc47e3762d27) so at least we don't end up in this situation again if we want to add new stats once honey's format becomes stable.

comment:4 by Guruprasad Hegde, 6 years ago

Thanks olly for the clear details. I have gone through the code a few times and got some idea about what's happening.

Fn call for compacting termlist table line 2234: merge_docid_keyed(out, inputs, offset, t->type);

Should I pass version_file_out reference to the above function? So I can set values for newly added data members in HoneyVersion class and then call write() method.

Last edited 6 years ago by Guruprasad Hegde (previous) (diff)

comment:5 by Guruprasad Hegde, 6 years ago

Another question: Once updating Honey database is available (I guess it will be HoneyWritableDatabase), all the stats will be updated within that class. Hence this fix (within compact() code) is just a temporary one right? What happens when compact() is called when source database is Honey, is it allowed?

comment:6 by Olly Betts, 6 years ago

You definitely should not be calling write() there - the new version file should only get written once, after everything else in the compaction run is complete. Writing it earlier is (a) wasted effort and (b) makes it appear there's a working database there when it's actually incomplete - especially bad if the run is interrupted at that point (e.g. run xapian-compact and hit Ctrl+C).

You could pass in a reference or pointer to version_file_out. Or pass the min and max bound variables by reference.

HoneyWritableDatabase (or similar) will also need to track these stats, but the fix here isn't just temporary - it's useful to be able to efficiently convert a database to from the old format to the new one, so I think we will keep that feature.

You can compact a honey database to another honey database (currently there are bugs in some cases, but I think a simple compaction works). There it should just copy the bounds from the source database (or merge the bounds if there are multiple sources - i.e. take the minimum of the lower bounds and the maximum of the upper bounds, taking care to skip any sources where unique_terms is always zero).

comment:7 by Guruprasad Hegde, 6 years ago

After adding new stats as data members in HoneyVersion class, I added getters and updated merge_stats() method. Problem is in honey_compact.cc merge_stats() takes GlassDatabase version stat getters as arguments which don't have getters for new stats. Hence gives error no matching function call.

I added a default value for new stat parameters. This is the correct fix?

comment:8 by Guruprasad Hegde, 6 years ago

Also Instead of using current_wdf of each term, termlist_size can be used to update bounds for the number of unique terms?

Last edited 6 years ago by Guruprasad Hegde (previous) (diff)

in reply to:  8 comment:9 by Olly Betts, 6 years ago

Can you show a patch of the changes you're talking about in comment:7?

Replying to gp1308:

Also Instead of using current_wdf of each term, termlist_size can be used to update bounds for the number of unique terms?

The correct count of unique terms ought to exclude those for which wdf == 0, so to get that we'd need to actually look at current_wdf - termlist_size will often be more than the correct value.

However, currently for efficiency we approximate like this:

Xapian::termcount
GlassTermList::get_unique_terms() const
{
    LOGCALL(DB, Xapian::termcount, "GlassTermList::get_unique_terms", NO_ARGS);
    // get_unique_terms() really ought to only count terms with wdf > 0, but
    // that's expensive to calculate on demand, so for now let's just ensure
    // unique_terms <= doclen.
    RETURN(min(termlist_size, doclen));
}   

So the bound here needs to based on the same thing, so it's actually a bound on the value that can return.

At some point it's likely we'll start storing the number of unique terms in a similar way to how we store the document length. That's probably not going to happen for glass now though, as it would be hard to start doing so compatibly.

comment:11 by Olly Betts, 6 years ago

Milestone: 1.5.0
Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.