Opened 10 years ago

Last modified 17 months ago

#670 new enhancement

double set_collapse_key with different values

Reported by: Felix Ostmann Owned by: Olly Betts
Priority: low Milestone:
Component: Matcher Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

I am importing product data from many merchants. Each merchant has sometimes the same product more than 1 times.

So i build a value over merchant_id CONCAT product_title and another value for merchant_id.

Now i want to collapse_key over merchant_id CONCAT product_title to get only the best product (if there are more than 1 with the same product_title) for this search from this merchant and after that i want to call collapse_key over merchant_id with count=5 to get the best 5 products from this merchant.

Currently i have implemented this via a matchdecider (simple check unique value) for the merchant_id CONCAT product_title, but this gives me the first product, not the best (with the same title from the merchant).

So, perhaps there is a way to call set_collapse_key multiple times?

Change History (3)

comment:1 by Felix Ostmann, 10 years ago

Summary: double collapse with different valuesdouble set_collapse_key with different values

comment:2 by Olly Betts, 9 years ago

Sorry for failing to respond for ages.

I read this when it was reported, and thought there was a reason why we don't already support this, but couldn't quite remember all the details. So I left it a while to think about and it slipped my mind until I saw this in the ticket list today.

I'm not sure I can remember all the details still, but this is probably enough to be useful:

For a normal collapse, we just maintain a "prototype" MSet, containing the best N documents we've seen so far. This prototype MSet contains at most count documents with any particular collapse key. Any document which isn't as good as the worst entry in this prototype MSet can simply be ignored, and if we're sorting primarily by relevance, there are various short-cuts we can perform based on that minimum weight (e.g. an OR can be turned into an AND once the maximum weight either side can return is less than the minimum required to make the prototype MSet).

Each time we have a new candidate document which is better ranked the the lowest ranked document in the prototype MSet, we look at its collapse key. If we have fewer than count documents with that collapse key, we can just add it to the prototype MSet and drop the lowest ranked document instead. If we have count documents already, if it ranks lower than all of them we just discard it - otherwise we add the document and discard the lowest ranked of those we already have with the collapse key.

One problem with collapsing on two different aspects is that we lose this one-in/one-out property of the process. With the double collapse you want, each candidate document can knock out two existing documents which are in the prototype MSet, so we'd have to keep a much larger prototype set to make sure we haven't discarded a document we might end up needing. I think we would need to keep as many extra documents as there are potential candidates left to consider. In general, that's up to Database.get_doccount() less the number of documents already considered. For some queries we can probably bound that more tightly, but it's still a lot more than we usually keep.

And because we're keeping a lot more documents around, the minimum weight required to make it into the prototype MSet will stay much lower, which will limit the ability of the matcher to short-cut.

So there's more data to keep around, and less chance to short-cut the process, so this is likely to incur quite a performance hit compared to a single collapse.

But your situation is actually a special one - one of your collapse keys is a prefix of the other, so the "collapse sets" of one will always be subsets of collapse sets of the other. I've a feeling that means it's possible to handle this case more efficiently than the general multi-collapse case, though I've not looked at it in detail. I'm also not sure what the API for it might look like.

Last edited 17 months ago by Olly Betts (previous) (diff)

comment:3 by Olly Betts, 5 years ago

Priority: normallow
Version: 1.2.20git master

I still think we could probably support this limited special case for multiple collapsing, though I'm not sure how widely useful it would be. A few others have asked about multiple collapse keys but in the more general case, which I don't think is feasible to support.

In API terms, this could perhaps be done by specifying a collapse on merchant with collapse_max set to 5 (using the existing API) plus some way to say "when selecting those 5 makes sure they have unique product_title within each merchant" - perhaps via:

enquire.set_collapse_key(SLOT_MERCHANT, 5, SLOT_PRODUCT_TITLE);

Then internally we'd need to fetch the value in slot SLOT_PRODUCT_TITLE for candidate documents and eliminate duplicates based on that too.

Note: See TracTickets for help on using tickets.