Opened 10 years ago
Last modified 18 months ago
#670 new enhancement
double set_collapse_key with different values
Reported by: | Felix Ostmann | Owned by: | Olly Betts |
---|---|---|---|
Priority: | low | Milestone: | |
Component: | Matcher | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
I am importing product data from many merchants. Each merchant has sometimes the same product more than 1 times.
So i build a value over merchant_id CONCAT product_title
and another value for merchant_id
.
Now i want to collapse_key over merchant_id CONCAT product_title
to get only the best product (if there are more than 1 with the same product_title) for this search from this merchant and after that i want to call collapse_key over merchant_id
with count=5 to get the best 5 products from this merchant.
Currently i have implemented this via a matchdecider (simple check unique value) for the merchant_id CONCAT product_title
, but this gives me the first product, not the best (with the same title from the merchant).
So, perhaps there is a way to call set_collapse_key multiple times?
Change History (3)
comment:1 by , 10 years ago
Summary: | double collapse with different values → double set_collapse_key with different values |
---|
comment:3 by , 5 years ago
Priority: | normal → low |
---|---|
Version: | 1.2.20 → git master |
I still think we could probably support this limited special case for multiple collapsing, though I'm not sure how widely useful it would be. A few others have asked about multiple collapse keys but in the more general case, which I don't think is feasible to support.
In API terms, this could perhaps be done by specifying a collapse on merchant with collapse_max
set to 5 (using the existing API) plus some way to say "when selecting those 5 makes sure they have unique product_title
within each merchant" - perhaps via:
enquire.set_collapse_key(SLOT_MERCHANT, 5, SLOT_PRODUCT_TITLE);
Then internally we'd need to fetch the value in slot SLOT_PRODUCT_TITLE
for candidate documents and eliminate duplicates based on that too.
Sorry for failing to respond for ages.
I read this when it was reported, and thought there was a reason why we don't already support this, but couldn't quite remember all the details. So I left it a while to think about and it slipped my mind until I saw this in the ticket list today.
I'm not sure I can remember all the details still, but this is probably enough to be useful:
For a normal collapse, we just maintain a "prototype" MSet, containing the best N documents we've seen so far. This prototype MSet contains at most
count
documents with any particular collapse key. Any document which isn't as good as the worst entry in this prototype MSet can simply be ignored, and if we're sorting primarily by relevance, there are various short-cuts we can perform based on that minimum weight (e.g. anOR
can be turned into anAND
once the maximum weight either side can return is less than the minimum required to make the prototype MSet).Each time we have a new candidate document which is better ranked the the lowest ranked document in the prototype MSet, we look at its collapse key. If we have fewer than
count
documents with that collapse key, we can just add it to the prototype MSet and drop the lowest ranked document instead. If we havecount
documents already, if it ranks lower than all of them we just discard it - otherwise we add the document and discard the lowest ranked of those we already have with the collapse key.One problem with collapsing on two different aspects is that we lose this one-in/one-out property of the process. With the double collapse you want, each candidate document can knock out two existing documents which are in the prototype MSet, so we'd have to keep a much larger prototype set to make sure we haven't discarded a document we might end up needing. I think we would need to keep as many extra documents as there are potential candidates left to consider. In general, that's up to
Database.get_doccount()
less the number of documents already considered. For some queries we can probably bound that more tightly, but it's still a lot more than we usually keep.And because we're keeping a lot more documents around, the minimum weight required to make it into the prototype MSet will stay much lower, which will limit the ability of the matcher to short-cut.
So there's more data to keep around, and less chance to short-cut the process, so this is likely to incur quite a performance hit compared to a single collapse.
But your situation is actually a special one - one of your collapse keys is a prefix of the other, so the "collapse sets" of one will always be subsets of collapse sets of the other. I've a feeling that means it's possible to handle this case more efficiently than the general multi-collapse case, though I've not looked at it in detail. I'm also not sure what the API for it might look like.