Opened 21 years ago
Last modified 20 months ago
#22 assigned enhancement
Handle single characters components of hyphenated words specially
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | low | Milestone: | 2.0.0 |
Component: | QueryParser | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Robert Pollak, Richard Boulton | Blocked By: | |
Blocking: | Operating System: | All |
Description (last modified by )
Some common punctuation (notably -
) is treated as a word break when indexing, and as
a phrase generator when searching. The problem is that many common cases
end up creating phrase searches with one or two character terms which
are very common, and these search are slow with a big database.
Examples include: e-mail
cd-r
d-i-y
This could perhaps be addressed by a smarter word identifying algorithm. When indexing and searching, we could decide never to generate a single character term in certain circumstances (maybe also apply the same rules for two character terms).
So "e-mail" would be indexed as "email" not "e" and "mail". And similarly for searching. In general the extra conflation this gives seems useful (although email is apparently dutch for enamel...)
The query parser probably wouldn't apply this rule to quoted phrase searches - otherwise searching for "o freddled gruntbuggly" would search for "ofreddled gruntbuggly" and tragically not find any matches (I'm sure there are less esoteric examples - a search for "i robot" say...)
Change History (31)
comment:1 by , 21 years ago
Status: | new → assigned, robert.pollak@fabasoft.com |
---|
comment:5 by , 18 years ago
Blocking: | 118 added |
---|
Added this as a dependency for 1.0 release. (Though maybe the system is good enough already - feel free to remove the dependency if you think so, Olly.)
comment:6 by , 18 years ago
This still definitely needs doing, though the apostrophe changes do tackle one common case (at least for English).
I think this and "making indextext.cc into a class" are potentially candidates for delaying a little though. I was going to look at them once the "definites" were done and see (a) how long that takes and (b) how hard this looks once I make a start.
comment:7 by , 18 years ago
Cc: | added |
---|
comment:8 by , 18 years ago
Component: | Library API → QueryParser |
---|---|
Priority: | high → highest |
SVN HEAD now just treats '_' as a word character. After looking at the terms generated from real documents this seemed the best approach.
'/' and '\' are still treated as phrase generators. I think that's probably the best approach overall for those - they don't seem to be problematic cases or very common (and adding a lot of terms for them has its downside). The same holds for e-mail addresses.
Still to do:
Some special handling for '-' and '.' with one and perhaps two character components is still to think about.
Ditto for floating point numbers and version numbers.
I also want to make the categories of characters configurable - so if for example you wanted filenames to be single terms you could set that up. We need good defaults categories though.
comment:9 by , 18 years ago
Can I suggest also handling for commas in numeric terms; these should be ignored, so that (eg) 600,000 is indexed as 600000.
Of course this wouldn't work so well in locales where the decimal point is a comma; so would need to be configurable as well, I suppose.
comment:10 by , 18 years ago
I think including ',' and '.' in otherwise numeric terms (or perhaps terms which start with a digit or something?) is a good idea, but it's probably better overall to just keep them in (it certainly avoids opening a can of worms at this point...)
I'm going to use the wiki to track individual todo items:
comment:11 by , 18 years ago
Blocking: | 118 removed |
---|---|
Priority: | highest → normal |
Severity: | normal → enhancement |
1.0' is now a single term, as are
2.6.18', 1,024',
5,25', `1,234.567', etc.
So the remaining issue here is that of applying some special handling to cases
with hyphens, like e-mail',
cd-r', `d-i-y', etc.
Some grepping of text in /usr/share/doc suggests that this isn't a clear win - some cases are better, but others are worse, so this needs careful investigation and so I'm not intending to implement it for 1.0.0 now.
Therefore, I'm removing the blocker status.
comment:12 by , 17 years ago
Blocking: | 160 added |
---|---|
Operating System: | → All |
We should consider this before 1.1 in case we want to make further indexing strategy changes.
comment:13 by , 17 years ago
Bug#23 is now fixed, which will help a lot of these cases.
Once 1.0.4 is out, we should try to gather examples of slow queries to get an idea of whether further work is required, and if so where. Omega can easily be set up to log queries which take more than N seconds, which is useful for this.
comment:14 by , 17 years ago
I've had some reports from a customer of very slow phrase searches with hyphenated words, so I'm thinking about this issue again.
One thing which occurred to me is that we could try storing the data in the positionlist table differently. Currently, we key position lists by "did + tname". We might do better if we keyed the entries by "tname + did". Here's my logic:
"did + tname" is good for indexing, because when we add new documents with sequentially allocated IDs, we're just appending to the end of the table. However, when searching, we'll probably be accessing lots of documents, but be accessing a very small proportion of the terms. "did + tname" means that all the position lists for each document are stored together, so we have to skip over all the position lists for terms we're not interested in in every document between the current one, and the next document we're checking.
If we keyed by "tname + did" instead, a given search would only have to skip over the position lists for the term we're interested in for each document between the current one and the next document we're checking, so we'd get much better locality of reference.
On the downside, updating the index would probably be quite a bit harder - we'd be performing a merge operation, similar to that done to generate the posting list table.
comment:15 by , 17 years ago
"did + tname" means that all the position lists for each document are stored together, so we have to skip over all the position lists for terms we're not interested in in every document between the current one, and the next document we're checking.
We don't literally "skip over" here - we use "FlintTable::get_exact_entry()", so the cost of moving to a new entry isn't linear in the distance between entries. There's a cost which is roughly the log of that distance, though that's reading parents blocks which we'll generally have cached so this is quite likely to be swamped by other costs anyway.
Improving the locality of references may improve search performance, but it's clearly going to seriously affect indexing performance, and will also make postition.DB larger (more dead space due to less sequential updating) so I think we'd need to get a big improvement to search performance to justify making this change.
Do you have some examples of bad queries? I've wondered if hyphenated terms with a single character component should be handled specially (e.g. e-mail, cd-r, d-i-y could be indexed and searched as email, cdr, diy) which addresses a lot of the examples I've seen, and I think will generally improve results.
comment:16 by , 17 years ago
The main reported problem was with "a-table", in a database of furniture, so both table and "a" are common terms; so your comments would apply, and fix that problem.
I'm getting hold of the data, and will try out indexing with "tname + did", and see how it affects things.
comment:18 by , 17 years ago
Description: | modified (diff) |
---|---|
Milestone: | → 1.1 |
comment:19 by , 17 years ago
Blocking: | 160 removed |
---|
comment:21 by , 16 years ago
Did you get any numbers for whether tname + did
is better than did + tname
?
Other than that, if we're going to make any changes to indexing, 1.1.0 would be better than 1.1.x so leaving milestone as that for now.
comment:22 by , 16 years ago
Description: | modified (diff) |
---|
comment:23 by , 16 years ago
I didn't get any numbers, no. IIRC, we changed the term parsing to avoid performing this particular phrase search, instead.
comment:24 by , 16 years ago
Summary: | Eliminate common cases which cause a slow phrase search → Handle single characters components of hyphenated words specially |
---|
You mean changed it specifically for them? Because Xapian handles hyphenated words the same as it did in 1.0.4...
I don't think the change will speed up this case much if at all, and it will prevent sequential update of the positions Btree when appending documents which will tend to slow down positional search with uncompacted databases, so in the absence of benchmarks, I'm inclined to forget about the idea.
I think for 1.1.0 we should consider special handling of single characters and hyphens though.
comment:25 by , 16 years ago
Milestone: | 1.1.0 → 1.1.1 |
---|
I think that the single character hyphen handling needs more consideration that we can give it before 1.1.0 - my investigations before showed it made some cases worse as well as some better. We may still be able to do it for 1.1.x anyway - e.g. a purely search time change would be fine, as would adding different indexing "strategies". So -> milestone:1.1.1
comment:27 by , 15 years ago
Severity: | minor → major |
---|
This would be an incompatible change to make in 1.2.x, so it either needs to be 1.1.x or 1.3.x. Not a release blocker though.
comment:28 by , 15 years ago
Priority: | normal → high |
---|---|
Severity: | major → normal |
It would be better to set priority to indicate the priority!
comment:29 by , 15 years ago
Milestone: | 1.1.7 → 1.3.0 |
---|
Bumping to stay on track.
We could potentially add this in 1.2.x if it was done in an optional way, though I suspect this is 1.3.x material now.
comment:30 by , 13 years ago
Milestone: | 1.3.0 → 1.3.x |
---|
comment:31 by , 9 years ago
Milestone: | 1.3.x → 1.4.x |
---|
Phrase performance is much improved in master (especially with glass), so the overhead of handling these cases as phrases is less of an issue than it was. In the interests of releasing 1.4.0, I think this needs to be bumped.
comment:32 by , 20 months ago
Milestone: | 1.4.x → 2.0.0 |
---|---|
Priority: | high → low |
Version: | SVN trunk → git master |
There have been numerous further improvements in phrase performance and I haven't seen reports of slow phrase performance for a while.
I think we should still look into special handling here - it seems likely to improve results by conflating d-i-y
and diy
, e-mail
and email
, cd-r
and cdr
, etc but it's not the high priority is once was because we've addressed the slow phrase search cases.
Underscored identifiers (e.g. O_STREAMING) and filenames (e.g. /usr/local/bin) are currently phrase searches. We could also make these terms, but it's also useful to be able to search for individual components of these... Same for email addresses.
We could consider indexing them as a single term and split (and perhaps also the set of prefixes, so that "/usr/local" matches "/usr/local/bin" - not sure if that's better or not). The query parser would generate a single term.
A version number (e.g. 2.4.25) could also be handle specially by indexing as terms 2.4 and 2.4.25 (a bit like prefixes for filenames suggested above)...