Opened 20 years ago

Last modified 12 months ago

#22 assigned enhancement

Handle single characters components of hyphenated words specially

Reported by: Olly Betts Owned by: Olly Betts
Priority: low Milestone: 2.0.0
Component: QueryParser Version: git master
Severity: normal Keywords:
Cc: Robert Pollak, Richard Boulton Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Some common punctuation (notably -) is treated as a word break when indexing, and as a phrase generator when searching. The problem is that many common cases end up creating phrase searches with one or two character terms which are very common, and these search are slow with a big database.

Examples include: e-mail cd-r d-i-y

This could perhaps be addressed by a smarter word identifying algorithm. When indexing and searching, we could decide never to generate a single character term in certain circumstances (maybe also apply the same rules for two character terms).

So "e-mail" would be indexed as "email" not "e" and "mail". And similarly for searching. In general the extra conflation this gives seems useful (although email is apparently dutch for enamel...)

The query parser probably wouldn't apply this rule to quoted phrase searches - otherwise searching for "o freddled gruntbuggly" would search for "ofreddled gruntbuggly" and tragically not find any matches (I'm sure there are less esoteric examples - a search for "i robot" say...)

Change History (31)

comment:1 by Olly Betts, 20 years ago

Status: newassigned, robert.pollak@fabasoft.com

Underscored identifiers (e.g. O_STREAMING) and filenames (e.g. /usr/local/bin) are currently phrase searches. We could also make these terms, but it's also useful to be able to search for individual components of these... Same for email addresses.

We could consider indexing them as a single term and split (and perhaps also the set of prefixes, so that "/usr/local" matches "/usr/local/bin" - not sure if that's better or not). The query parser would generate a single term.

A version number (e.g. 2.4.25) could also be handle specially by indexing as terms 2.4 and 2.4.25 (a bit like prefixes for filenames suggested above)...

comment:2 by Olly Betts, 19 years ago

No calls to c_str in xapian-examples!

comment:3 by Olly Betts, 19 years ago

Tsk, wrong bug...

comment:4 by Olly Betts, 17 years ago

Version: 0.7.5SVN HEAD

SVN HEAD now treats "'" as a word character.

comment:5 by Richard Boulton, 17 years ago

Blocking: 118 added

Added this as a dependency for 1.0 release. (Though maybe the system is good enough already - feel free to remove the dependency if you think so, Olly.)

comment:6 by Olly Betts, 17 years ago

This still definitely needs doing, though the apostrophe changes do tackle one common case (at least for English).

I think this and "making indextext.cc into a class" are potentially candidates for delaying a little though. I was going to look at them once the "definites" were done and see (a) how long that takes and (b) how hard this looks once I make a start.

comment:7 by Richard Boulton, 17 years ago

Cc: richard@… added

comment:8 by Olly Betts, 17 years ago

Component: Library APIQueryParser
Priority: highhighest

SVN HEAD now just treats '_' as a word character. After looking at the terms generated from real documents this seemed the best approach.

'/' and '\' are still treated as phrase generators. I think that's probably the best approach overall for those - they don't seem to be problematic cases or very common (and adding a lot of terms for them has its downside). The same holds for e-mail addresses.

Still to do:

Some special handling for '-' and '.' with one and perhaps two character components is still to think about.

Ditto for floating point numbers and version numbers.

I also want to make the categories of characters configurable - so if for example you wanted filenames to be single terms you could set that up. We need good defaults categories though.

comment:9 by Richard Boulton, 17 years ago

Can I suggest also handling for commas in numeric terms; these should be ignored, so that (eg) 600,000 is indexed as 600000.

Of course this wouldn't work so well in locales where the decimal point is a comma; so would need to be configurable as well, I suppose.

comment:10 by Olly Betts, 17 years ago

I think including ',' and '.' in otherwise numeric terms (or perhaps terms which start with a digit or something?) is a good idea, but it's probably better overall to just keep them in (it certainly avoids opening a can of worms at this point...)

I'm going to use the wiki to track individual todo items:

http://wiki.xapian.org/BraveNewTerms

comment:11 by Olly Betts, 17 years ago

Blocking: 118 removed
Priority: highestnormal
Severity: normalenhancement

1.0' is now a single term, as are 2.6.18', 1,024', 5,25', `1,234.567', etc.

So the remaining issue here is that of applying some special handling to cases with hyphens, like e-mail', cd-r', `d-i-y', etc.

Some grepping of text in /usr/share/doc suggests that this isn't a clear win - some cases are better, but others are worse, so this needs careful investigation and so I'm not intending to implement it for 1.0.0 now.

Therefore, I'm removing the blocker status.

comment:12 by Olly Betts, 16 years ago

Blocking: 160 added
Operating System: All

We should consider this before 1.1 in case we want to make further indexing strategy changes.

comment:13 by Olly Betts, 16 years ago

Bug#23 is now fixed, which will help a lot of these cases.

Once 1.0.4 is out, we should try to gather examples of slow queries to get an idea of whether further work is required, and if so where. Omega can easily be set up to log queries which take more than N seconds, which is useful for this.

comment:14 by Richard Boulton, 16 years ago

I've had some reports from a customer of very slow phrase searches with hyphenated words, so I'm thinking about this issue again.

One thing which occurred to me is that we could try storing the data in the positionlist table differently. Currently, we key position lists by "did + tname". We might do better if we keyed the entries by "tname + did". Here's my logic:

"did + tname" is good for indexing, because when we add new documents with sequentially allocated IDs, we're just appending to the end of the table. However, when searching, we'll probably be accessing lots of documents, but be accessing a very small proportion of the terms. "did + tname" means that all the position lists for each document are stored together, so we have to skip over all the position lists for terms we're not interested in in every document between the current one, and the next document we're checking.

If we keyed by "tname + did" instead, a given search would only have to skip over the position lists for the term we're interested in for each document between the current one and the next document we're checking, so we'd get much better locality of reference.

On the downside, updating the index would probably be quite a bit harder - we'd be performing a merge operation, similar to that done to generate the posting list table.

comment:15 by Olly Betts, 16 years ago

"did + tname" means that all the position lists for each document are stored together, so we have to skip over all the position lists for terms we're not interested in in every document between the current one, and the next document we're checking.

We don't literally "skip over" here - we use "FlintTable::get_exact_entry()", so the cost of moving to a new entry isn't linear in the distance between entries. There's a cost which is roughly the log of that distance, though that's reading parents blocks which we'll generally have cached so this is quite likely to be swamped by other costs anyway.

Improving the locality of references may improve search performance, but it's clearly going to seriously affect indexing performance, and will also make postition.DB larger (more dead space due to less sequential updating) so I think we'd need to get a big improvement to search performance to justify making this change.

Do you have some examples of bad queries? I've wondered if hyphenated terms with a single character component should be handled specially (e.g. e-mail, cd-r, d-i-y could be indexed and searched as email, cdr, diy) which addresses a lot of the examples I've seen, and I think will generally improve results.

comment:16 by Richard Boulton, 16 years ago

The main reported problem was with "a-table", in a database of furniture, so both table and "a" are common terms; so your comments would apply, and fix that problem.

I'm getting hold of the data, and will try out indexing with "tname + did", and see how it affects things.

comment:18 by Richard Boulton, 16 years ago

Description: modified (diff)
Milestone: 1.1

comment:19 by Richard Boulton, 16 years ago

Blocking: 160 removed

comment:20 by Olly Betts, 16 years ago

Description: modified (diff)

Update description.

comment:21 by Olly Betts, 15 years ago

Did you get any numbers for whether tname + did is better than did + tname?

Other than that, if we're going to make any changes to indexing, 1.1.0 would be better than 1.1.x so leaving milestone as that for now.

comment:22 by Olly Betts, 15 years ago

Description: modified (diff)

comment:23 by Richard Boulton, 15 years ago

I didn't get any numbers, no. IIRC, we changed the term parsing to avoid performing this particular phrase search, instead.

comment:24 by Olly Betts, 15 years ago

Summary: Eliminate common cases which cause a slow phrase searchHandle single characters components of hyphenated words specially

You mean changed it specifically for them? Because Xapian handles hyphenated words the same as it did in 1.0.4...

I don't think the change will speed up this case much if at all, and it will prevent sequential update of the positions Btree when appending documents which will tend to slow down positional search with uncompacted databases, so in the absence of benchmarks, I'm inclined to forget about the idea.

I think for 1.1.0 we should consider special handling of single characters and hyphens though.

comment:25 by Olly Betts, 15 years ago

Milestone: 1.1.01.1.1

I think that the single character hyphen handling needs more consideration that we can give it before 1.1.0 - my investigations before showed it made some cases worse as well as some better. We may still be able to do it for 1.1.x anyway - e.g. a purely search time change would be fine, as would adding different indexing "strategies". So -> milestone:1.1.1

comment:26 by Olly Betts, 15 years ago

Milestone: 1.1.11.1.7

Triaging milestone:1.1.1 bugs.

comment:27 by Olly Betts, 15 years ago

Severity: minormajor

This would be an incompatible change to make in 1.2.x, so it either needs to be 1.1.x or 1.3.x. Not a release blocker though.

comment:28 by Olly Betts, 15 years ago

Priority: normalhigh
Severity: majornormal

It would be better to set priority to indicate the priority!

comment:29 by Olly Betts, 15 years ago

Milestone: 1.1.71.3.0

Bumping to stay on track.

We could potentially add this in 1.2.x if it was done in an optional way, though I suspect this is 1.3.x material now.

comment:30 by Olly Betts, 12 years ago

Milestone: 1.3.01.3.x

comment:31 by Olly Betts, 8 years ago

Milestone: 1.3.x1.4.x

Phrase performance is much improved in master (especially with glass), so the overhead of handling these cases as phrases is less of an issue than it was. In the interests of releasing 1.4.0, I think this needs to be bumped.

comment:32 by Olly Betts, 12 months ago

Milestone: 1.4.x2.0.0
Priority: highlow
Version: SVN trunkgit master

There have been numerous further improvements in phrase performance and I haven't seen reports of slow phrase performance for a while.

I think we should still look into special handling here - it seems likely to improve results by conflating d-i-y and diy, e-mail and email, cd-r and cdr, etc but it's not the high priority is once was because we've addressed the slow phrase search cases.

Note: See TracTickets for help on using tickets.