Opened 16 years ago

Closed 8 years ago

#211 closed enhancement (fixed)

Dynamic summaries / snippets

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 1.3.5
Component: Library API Version: SVN trunk
Severity: minor Keywords:
Cc: daevaorn@… Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Xapian should include features to allow dynamic summaries to be generated from snippets of text around where the query terms occur in a matching document.

This has been asked about several times on the mailing list, for example:

http://thread.gmane.org/gmane.comp.search.xapian.general/5097

Change History (13)

comment:1 by Richard Boulton, 16 years ago

Some python code which implements this is available at: http://xappy.googlecode.com/svn/trunk/xappy/highlight.py

However, the approach taken by this has some shortcomings - in particular, it doesn't have any handling for phrases, so terms which are only present in the query as phrases can be highlighted individually.

Also, of course, it's implemented in python, so isn't accessible to users of other languages.

comment:2 by Olly Betts, 16 years ago

Operating System: All
Status: newassigned

I wonder if we could have some sort of positionlist->postlist adaptor class and rerun the query on the positionlists for a matching document.

Problem is where to put the document boundaries - using certain punctuation (as xappy appears to) makes sense, but we'd need to generate those positions. That could be done at index time and then compressed with interpolative coding I guess...

comment:4 by Olly Betts, 16 years ago

Description: modified (diff)
Owner: changed from New Bugs to Olly Betts
Status: assignednew

comment:5 by Alex Koshelev, 15 years ago

Cc: daevaorn@… added

comment:6 by Olly Betts, 11 years ago

Milestone: 1.3.x

Mihai's gsoc branch implements this, and we should be able to merge that for a 1.3.x release.

comment:7 by Olly Betts, 10 years ago

Milestone: 1.3.x1.3.3

comment:8 by Olly Betts, 10 years ago

Milestone: 1.3.31.3.2

I'm in the process of merging this, starting at r18020, so this will be in 1.3.2.

comment:9 by Olly Betts, 10 years ago

Milestone: 1.3.21.3.3
Status: newassigned

All now merged except for the example. A "how to" section in the "getting started" guide would be good though. Neither of these are worth holding up 1.3.2 for at this point though.

comment:10 by Olly Betts, 9 years ago

Milestone: 1.3.31.3.4

comment:11 by Olly Betts, 8 years ago

Would be good to resolve for 1.4.0 - the implementation on trunk hasn't been in a stable release yet - once it has been, compatibility becomes a concern.

comment:12 by Olly Betts, 8 years ago

Current status:

The approach from the paper Mihai doesn't directly consider the query, but instead looks at the top few documents matched and builds a document language model. In theory this is a nice approach - it has a sound theoretical basis, and it will consider interesting terms outside of the query. In practice, this turns out to have a serious drawback - it sometimes selects a snippet which doesn't contain any of the query terms, and users find that surprising (quite reasonably I think). It's also slower than is ideal.

We also had a patch for generating snippets from fastmail, but that has different drawbacks - for example: its segmenting of text doesn't exactly match what TermGenerator produces, so it fails to highlight in some cases; also it considers each term in turn, so doesn't prefer a snippet containing more terms from the query.

So I've taken the best ideas from each, and implemented a new snippet generating algorithm. A key design choice is that it makes a single pass over the text we're generating the snippet from (with scope to terminate early). It prefers occurrences of the query terms in contexts containing "interesting" non-query terms. And it also handles exact phrases and wildcards (both selecting snippets based on them, and highlighting them).

This code is in production with one of my clients and seems to be working well, so I think we should merge this for 1.4.0. Leaving pegged on 1.3.4 for now, but I'm happy to slip it to a later 1.3.x if it's holding up 1.3.4.

comment:13 by Olly Betts, 8 years ago

Milestone: 1.3.41.3.5

Ticket retargeted after milestone closed

comment:14 by Olly Betts, 8 years ago

Resolution: fixed
Status: assignedclosed

Committed the new implementation to git master in [b2f228d7c64d36f38ad17ea34c618e3f7744a86d].

There are still a few rough edges, but it's time to get this code out there.

Note: See TracTickets for help on using tickets.