Ticket #213 (assigned defect)

Opened 14 months ago

Last modified 3 weeks ago

Expose statistics to user defined Xapian::Weight subclasses

Reported by: richard Owned by: richard
Priority: normal Milestone: 1.1.0
Component: Library API Version: SVN trunk
Severity: normal Keywords:
Cc: olly Blocked By:
Operating System: All Blocking:

Description (last modified by richard) (diff)

Currently, The Xapian::Weight::Internal class (which is, as of last night, the class holding the statistics for the whole collection used by the weight objects) is not publically visible. This means that it would be impossible, for example, for a user to write a weighting class equivalent to, say, the BM25Weight class, using the public API, because the statistics aren't available.

After cleaning up the weighting calculation system, I believe the

Xapian::Weight::Internal class is now nearly clean enough that it could reasonably be made public, allowing custom weighting classes access to all the statistics currently available.

We might want to make the termfreq and reltermfreq members private, since they're likely to be accessed mainly through the accessor functions anyway. Also we might want to combine them into a single map with entries holding both the termfreq and the reltermfreq, since it's usual to want to access both the termfreq and the reltermfreq for a particular term at the same time.

Also, we might want to call the class Xapian::Stats, instead of Xapian::Weight::Internal, to reflect the Stats being part of the public API, but this would require an ABI change, so would have to wait for 1.1.0. (We could keep the API compatible by making Xapian::Weight::Internal a typedef for Xapian::Stats, I think; currently Stats (with no namespace) is a typedef for Xapian::Weight::Internal).

Change History

Changed 14 months ago by richard

  • blocking set to 120

Marking for 1.0.x, since making the interface public could be done without changing the ABI. I don't have any pressing need for this though (it just seems useful), so I'm not going to complain if this is pushed back to 1.1.0, when we'd be able to make it public with a more sensible name.

Changed 14 months ago by olly

I don't think we need a new public API class for this - we can just add five non-virtual protected methods to Xapian::Weight - for example:

Xapian::doccount Xapian::Weight::get_termfreq() const {

return internal->get_termfreq(tname);

}

Hmm, or perhaps we should just read these values from the Weight::Internal object once when create() is called - then we wouldn't need to store "internal" or "tname", but instead we'd need to store:

average_length termfreq collection_size rset_size rel_termfreq

But that's mostly orthogonal - it's cleaner to wrap these in getter methods, and then what we store isn't important, except that we can't change what we store until 1.1.0.

Changed 14 months ago by richard

  • owner changed from newbugs to richard

Not having to store tname in the Weight class would make various things cleaner.

Currently, we have to pass "" as the tname when making the extra-weight stuff,

and we special-case this in Xapian::Weight::Internal::get_reltermfreq(), which isn't really very nice.

Additionally, to implement the weighting stuff for OP_SYNONYM queries, I'm having to invent terms (and carefully check they're not in the query) so that the weighting object for the SynonymPostlist? can get the right statistics. Which is terribly unpleasant.

I think adding members to Xapian::Weight is better than adding accessor methods: because Xapian::Weight::Internal isn't public, the accessor methods couldn't be inline, so a Weight subclass might well want to cache some of them anyway, if it wants to reuse them.

There's no particular reason that we couldn't add members holding the statistics right now, I think; we just can't get rid of tname and internal until 1.1.0 (at the earliest).

Changed 14 months ago by richard

  • cc olly@… added
  • status changed from new to assigned

Actually, there is a reason that we can't add member variables to Xapian::Weight right now - it'll change the size of the class and break ABI compatibility.

What we _could_ do is change the Xapian::Weight::Internal class _again_ to something which just contains the relevant stats for the weight object (ie, the same global statistics as the current Weight::Internal class, but only the termfreq and reltermfreq for the term which the Weight class is for). This would allow me to do the synonym stuff more easily, and be a step towards making the relevant statistics publically available (which we could then do in 1.1.0 by replacing the internal class with members).

Changed 14 months ago by richard

Just to note - I've committed the "change the Xapian::Weight::Internal class _again_ to something which just contains the relevant stats for the weight object" bit in SVN HEAD.

Changed 14 months ago by olly

  • summary changed from Make Xapian::Weight::Internal part of the public API to Expose statistics to user defined Xapian::Weight subclasses

Re-titling to reflect what we want to achieve, rather than a particular approach.

Changed 14 months ago by trac

  • platform set to All

Changed 13 months ago by olly

Once this is done, we should add a better example of a user-defined weighting scheme to docs/sorting.rst.

Changed 9 months ago by richard

  • description modified (diff)
  • milestone set to 1.1

Changed 9 months ago by richard

  • blocking deleted

(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.

Changed 3 weeks ago by olly

I'm currently working on this.

Note: See TracTickets for help on using tickets.