Opened 17 years ago
Last modified 12 months ago
#225 assigned defect
Spelling algorithm should consider frequency and not just edit-distance
Reported by: | Philip Neustrom | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Library API | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
As described here: http://thread.gmane.org/gmane.comp.search.xapian.general/5740/focus=5743
If the spelling correction algorithm considered frequency and edit-distance
(using some reasonable heuristic) we would see dramatically better results.
The current spelling algorithm will only correct words that never appear in the
spelling index. (Since 1.2.3, it will offer a correction for a word when the correction has a higher frequency than the word)
Attachments (3)
Change History (19)
by , 17 years ago
Attachment: | spelling_frequency.diff added |
---|
comment:1 by , 17 years ago
Component: | Backend-Flint → Library API |
---|---|
op_sys: | Linux → All |
Operating System: | → All |
rep_platform: | PC → All |
Status: | new → assigned |
comment:3 by , 17 years ago
Description: | modified (diff) |
---|---|
Owner: | changed from | to
Status: | assigned → new |
comment:4 by , 16 years ago
Description: | modified (diff) |
---|---|
Milestone: | → 1.1.1 |
Status: | new → assigned |
It would be good to improve this during the 1.1.x series.
[Changed link to list discussion to point to gmane for easier browsing of the thread.]
comment:6 by , 15 years ago
Priority: | normal → low |
---|
My main reservation here is that the algorithm seems rather arbitrary - it's more satisfactory to have a mathematical model or other justification. I worry if one term indexes most documents it might get suggested too often. It also seems this might require significantly more work, but it might not make a measurable difference even if it does, and better suggestions are worth some extra work.
Perhaps we should consider suggestions which require only one (or perhaps two) extra edits - still arbitrary, but at least the behaviour is more constrained. And this would allow us to still cull a lot of entries which can't have a low enough edit distance.
Setting priority low for now.
comment:8 by , 14 years ago
Priority: | low → high |
---|
I'd like to sort this out fairly soon so priority -> high.
comment:9 by , 14 years ago
I've sat down and thought through a simple mathematical model for spelling mistakes - I'll attach an RST write up of this.
by , 14 years ago
Attachment: | spelling-formulae.rst added |
---|
Simple mathematical model of spelling errors
comment:10 by , 14 years ago
It seem trac messes up the UTF-8 characters when previewing that rst file - view the rst file itself for a better version.
Thinking about it, p is going to vary by user, but we're probably talking something like 0.01 to 0.001. It might well be the actual best value isn't the same as the true probability (since we make various simplifying assumptions) so perhaps it is best to tune p for best results rather than try too hard to determine a "true" value.
To efficiently implement this model, it would be useful to track an upper bound on the spelling frequency, which is easy to do, but we don't currently, and seems like it will need an incompatible database format change.
But it's easy to address the specific point about not returning any correction if the word is in the spelling dictionary (as it may be if misspelled in the indexed documents) - I've addressed that in trunk r14859.
comment:11 by , 14 years ago
Priority: | high → normal |
---|
comment:12 by , 12 years ago
Milestone: | 1.2.x → 1.3.x |
---|
May well be addressed by nsmetanin's gsoc branch.
Marking for 1.3.x now.
comment:13 by , 9 years ago
Milestone: | 1.3.x → 1.3.4 |
---|
Changing the algorithm can be done in 1.4.x, but tracking the upper bound on spelling frequency is something we should try to do before that, as that's probably going to need a database format tweak, and I'd like to get the glass format stable sooner rather than later, so marking that for 1.3.4.
comment:14 by , 9 years ago
Milestone: | 1.3.4 → 1.4.x |
---|
[8d64696118cf8a1e777d5914dc4a133db5e44de2] implements tracking of an upper bound on the spelling word frequency. The rest can be done in 1.4.x.
comment:15 by , 7 years ago
Description: | modified (diff) |
---|
comment:16 by , 5 years ago
Milestone: | 1.4.x → 1.5.0 |
---|---|
Version: | SVN trunk → git master |
comment:17 by , 12 months ago
Description: | modified (diff) |
---|---|
Milestone: | 1.5.0 → 2.0.0 |
It'd be really good to address the remainder of this, but it now doesn't require a database format change so postponing.
An example implementaiton