Ticket #22 (assigned enhancement)
Handle single characters components of hyphenated words specially
| Reported by: | olly | Owned by: | olly |
|---|---|---|---|
| Priority: | normal | Milestone: | 1.1.0 |
| Component: | QueryParser | Version: | SVN trunk |
| Severity: | minor | Keywords: | |
| Cc: | robert.pollak, richard | Blocked By: | |
| Operating System: | All | Blocking: |
Description (last modified by olly) (diff)
Some common punctuation (notably -) is treated as a word break when indexing, and as a phrase generator when searching. The problem is that many common cases end up creating phrase searches with one or two character terms which are very common, and these search are slow with a big database.
Examples include: e-mail cd-r d-i-y
This could perhaps be addressed by a smarter word identifying algorithm. When indexing and searching, we could decide never to generate a single character term in certain circumstances (maybe also apply the same rules for two character terms).
So "e-mail" would be indexed as "email" not "e" and "mail". And similarly for searching. In general the extra conflation this gives seems useful (although email is apparently dutch for enamel...)
The query parser probably wouldn't apply this rule to quoted phrase searches - otherwise searching for "o freddled gruntbuggly" would search for "ofreddled gruntbuggly" and tragically not find any matches (I'm sure there are less esoteric examples - a search for "i robot" say...)
