Opened 15 years ago

Closed 14 years ago

#213 closed defect (fixed)

Expose statistics to user defined Xapian::Weight subclasses

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone: 1.1.0
Component: Library API Version: SVN trunk
Severity: normal Keywords:
Cc: Olly Betts Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Currently, The Xapian::Weight::Internal class (which is, as of last night, the class holding the statistics for the whole collection used by the weight objects) is not publically visible. This means that it would be impossible, for example, for a user to write a weighting class equivalent to, say, the BM25Weight class, using the public API, because the statistics aren't available.

After cleaning up the weighting calculation system, I believe the Xapian::Weight::Internal class is now nearly clean enough that it could reasonably be made public, allowing custom weighting classes access to all the statistics currently available.

We might want to make the termfreq and reltermfreq members private, since they're likely to be accessed mainly through the accessor functions anyway. Also we might want to combine them into a single map with entries holding both the termfreq and the reltermfreq, since it's usual to want to access both the termfreq and the reltermfreq for a particular term at the same time.

Also, we might want to call the class Xapian::Stats, instead of Xapian::Weight::Internal, to reflect the Stats being part of the public API, but this would require an ABI change, so would have to wait for 1.1.0. (We could keep the API compatible by making Xapian::Weight::Internal a typedef for Xapian::Stats, I think; currently Stats (with no namespace) is a typedef for Xapian::Weight::Internal).

Change History (13)

comment:1 by Richard Boulton, 15 years ago

Blocking: 120 added

Marking for 1.0.x, since making the interface public could be done without changing the ABI. I don't have any pressing need for this though (it just seems useful), so I'm not going to complain if this is pushed back to 1.1.0, when we'd be able to make it public with a more sensible name.

comment:2 by Olly Betts, 15 years ago

I don't think we need a new public API class for this - we can just add five non-virtual protected methods to Xapian::Weight - for example:

Xapian::doccount Xapian::Weight::get_termfreq() const {

return internal->get_termfreq(tname);


Hmm, or perhaps we should just read these values from the Weight::Internal object once when create() is called - then we wouldn't need to store "internal" or "tname", but instead we'd need to store:

average_length termfreq collection_size rset_size rel_termfreq

But that's mostly orthogonal - it's cleaner to wrap these in getter methods, and then what we store isn't important, except that we can't change what we store until 1.1.0.

comment:3 by Richard Boulton, 15 years ago

Owner: changed from New Bugs to Richard Boulton

Not having to store tname in the Weight class would make various things cleaner.

Currently, we have to pass "" as the tname when making the extra-weight stuff,

and we special-case this in Xapian::Weight::Internal::get_reltermfreq(), which isn't really very nice.

Additionally, to implement the weighting stuff for OP_SYNONYM queries, I'm having to invent terms (and carefully check they're not in the query) so that the weighting object for the SynonymPostlist can get the right statistics. Which is terribly unpleasant.

I think adding members to Xapian::Weight is better than adding accessor methods: because Xapian::Weight::Internal isn't public, the accessor methods couldn't be inline, so a Weight subclass might well want to cache some of them anyway, if it wants to reuse them.

There's no particular reason that we couldn't add members holding the statistics right now, I think; we just can't get rid of tname and internal until 1.1.0 (at the earliest).

comment:4 by Richard Boulton, 15 years ago

Cc: olly@… added
Status: newassigned

Actually, there is a reason that we can't add member variables to Xapian::Weight right now - it'll change the size of the class and break ABI compatibility.

What we _could_ do is change the Xapian::Weight::Internal class _again_ to something which just contains the relevant stats for the weight object (ie, the same global statistics as the current Weight::Internal class, but only the termfreq and reltermfreq for the term which the Weight class is for). This would allow me to do the synonym stuff more easily, and be a step towards making the relevant statistics publically available (which we could then do in 1.1.0 by replacing the internal class with members).

comment:5 by Richard Boulton, 15 years ago

Just to note - I've committed the "change the Xapian::Weight::Internal class _again_ to something which just contains the relevant stats for the weight object" bit in SVN HEAD.

comment:6 by Olly Betts, 15 years ago

Operating System: All
Summary: Make Xapian::Weight::Internal part of the public APIExpose statistics to user defined Xapian::Weight subclasses

Re-titling to reflect what we want to achieve, rather than a particular approach.

comment:7 by Olly Betts, 15 years ago

Once this is done, we should add a better example of a user-defined weighting scheme to docs/sorting.rst.

comment:9 by Richard Boulton, 15 years ago

Description: modified (diff)
Milestone: 1.1

comment:10 by Richard Boulton, 15 years ago

Blocking: 120 removed

(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.

comment:11 by Olly Betts, 14 years ago

I'm currently working on this.

comment:12 by Richard Boulton, 14 years ago

Owner: changed from Richard Boulton to Olly Betts
Status: assignednew

Marking as owned by Olly, since I believe he has a partially written solution for this.

comment:13 by Olly Betts, 14 years ago

Description: modified (diff)
Status: newassigned

Yes - I was working on it yesterday in fact (while I waited for my DSL to be fixed).

comment:14 by Olly Betts, 14 years ago

Resolution: fixed
Status: assignedclosed

Fixed in trunk in r12151.

Note: See TracTickets for help on using tickets.