Opened 4 years ago
Last modified 4 months ago
#804 assigned enhancement
Improve clustering API
Reported by: | James Aylett | Owned by: | Olly Betts |
---|---|---|---|
Priority: | highest | Milestone: | 1.5.0 |
Component: | Library API | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
The clustering API we have is a reasonable first draft, but can be improved. Here are some initial thoughts, although not all of them necessarily will make things better.
- Could do with iterators, as one of the common things is going to be to iterate over the clusters in the set, then over the documents inside it. My code looks very C-like at the moment :)
We could use the same pattern as MSetIterator
and ESetIterator
with an iterator class holding a (ref-counted PIMPL) container object and (size-index) - that means end iterator is cheap to construct as it's (NULL,0), and a test against the end is just member_var == 0. We could probably make these classes entirely inline. Related, it would be great to support C++11 range for for all our iterators, which is fairly easy for the trivial version that just gives you operator*()
at each position, but then you e.g. can't get wdf as well as the term name from a TermIterator
with a range-for loop, and even have to rewrite your loop if you later do.
- The public API doesn't do a huge amount, because it's unreachable. I can make all sorts of things out of
FreqSource
, but I can't actually use them for clustering AFAICT. So I can't for instance create my own vector space and cluster within that — without subclassingKMeans
. - Access to original weight in
MSet
— unless this isPointType::get_weight()
, in which case the docstring is misleading (says it's TF-IDF). Even then, it'd be nice to access the original order within theMSet
as well. Just looked at the code and it's calculating TF-IDF directly to compute term weight and magnitude. That's probably okay, but it feels a little odd to me that this happens in the Point constructor rather than in theFreqSource
. - There's only one similarity measure, but there doesn't seem to be a way to set another if I implemented my own.
Indeed - CosineDistance
objects are created in xapian-core library code and used, which (a) means that Similarity
and CosineDistance
classes aren't useful in the public API currently and (b) means that these methods being virtual is just needless overhead (unless the compiler is capable of devirtualising all the calls).
- I suspect that being able to specify a term prefix and only initialise the vector space on that would be helpful. You can do this via a stopper, but that's going to be less efficient if clustering lots of docs.
It's also more convenient to just specify the prefix. When the input is Document
termlists (as it currently has to be) then skip_to(prefix)
is just a next()
loop internally because of how these are encoded on disk, so the efficiency gain would mainly be that we could stop decoding once we'd got past the prefix of interest. That's partly an implementation detail - we could store data to allow actually skipping ahead in a document termlist.
- I wonder if we should convert to an integer-indexed (but sparse) vector space on intialisation. Using terms throughout is almost certainly slower? Changing that should mean that Point means a bit more to end users who are controlling things themselves, because they can build a vector space unrelated to terms.
For matching and in backends we stick with terms, but that's mostly because otherwise we'd need to a way to account for differing numeric term ids across databases and that isn't a concern here. Using strings may not be as bad as you might guess since most terms are short and different early on, and with integers you'd have to convert, but I think it's worth trying with integers.
- I don't know if this is feasible, but it'd be nice given a Cluster to be able to get some stats about it, or at least stats about the Points within it. Distance from Centroid, for instance — which I can compute directly via the public API, but would be helpful sometimes. (For instance if you want to name a cluster, you can either run an algorithm to ponder that out of all points, or you can have topics for each document as doc data and just use the one closest to the centroid. I guess you could just use the first doc in the cluster though.)
Change History (6)
comment:1 by , 4 years ago
Description: | modified (diff) |
---|
comment:2 by , 20 months ago
Description: | modified (diff) |
---|
Related to this is the Xapian::Diversify
API - this uses clustering internally, and the result is a Xapian::DocumentSet
object. Two things strike me about this API:
(a) I'm not convinced the Diversify
class needs to exist at all - the usage pattern currently is:
Xapian::MSet mset = enquire.get_mset(0, 100); Xapian::Diversify diversify(k, r, lambda, b, sigma_sqr); Xapian::DocumentSet dset = diversify.get_dmset(mset);
We could just provide a new MSet
method to do this with one fewer object/step:
Xapian::MSet mset = enquire.get_mset(0, 100); Xapian::DocumentSet dset = mset.diversify(k, r, lambda, b, sigma_sqr);
(b) It seems inconvenient that the result of this isn't an MSet
object - currently to implement an application which optionally diversifies results you need to have two versions of result
display, one using MSet
and one using DocumentSet
.
Problem (b) is perhaps relevant to clustering too, though maybe clustered results typically require custom display code anyway.
comment:3 by , 12 months ago
Priority: | normal → highest |
---|---|
Status: | new → assigned |
We really want to resolve this for the next stable release series since neither the clustering nor diversification APIs have been in a stable release yet and it would be much more satisfactory to sort them out before doing so.
comment:4 by , 6 months ago
It looks to me like diversification has an ordering bug.
Diversify::Internal::compute_diff_dmset()
finds the documents which weren't promoted by diversification and returns them and they're added to the returned DocumentSet
after the promoted documents. However it does this by iterating over points
which is an unordered_set
so the iteration order is arbitrary, and this iteration order determines the order in which these documents are added to and thus appear in the DocumentSet
.
The sole testcase we seem to have for diversification only check the process completes and returns a non-empty DocumentSet
so completely misses this problem.
I haven't checked the paper yet, but if there's no specified reordering of non-promoted documents I think we should preserve the relative order from the original ranking.
comment:5 by , 6 months ago
Checking the paper, it seems we should preserve the relative order of promoted documents (which I'm fairly sure isn't the case currently in our implementation).
It seems the expectation is that only the promoted documents ("promoted" is my terminology, not used in the paper) are returned as the final results, but in Xapian I think we want to keep the non-promoted documents in the results since then it's possible to more simply implementing paging through all the results (you could diversify the top 100, then page on from there as you would in the undiversified case). So we effectively want to do a stable partition of the MSet such that all promoted documents are before all non-promoted ones. We may even be able to implement that in-place!
comment:6 by , 4 months ago
Working on making diversify()
into a method of MSet
which reorders the object it's called on, so usage will look like:
Xapian::MSet mset = enquire.get_mset(0, 100); mset.diversify(k, r, lambda, b, sigma_sqr);
I've added some thoughts (inline and in italic to the description).
One thing I'm quite unsure about is how general to try to make this - should it just be able to cluster documents by terms, or cluster documents by anything (e.g. cluster on value slots, or external data referenced by docid, or some mixture)? And should it be able to cluster other things?
More general is potentially more useful, but risks complicating the API and implementation, and because of that might end up being slower and harder to use for the actual main intended use case.