Opened 16 years ago

Last modified 5 months ago

#225 assigned defect

Spelling algorithm should consider frequency and not just edit-distance

Reported by: Philip Neustrom Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: Library API Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

As described here: http://thread.gmane.org/gmane.comp.search.xapian.general/5740/focus=5743

If the spelling correction algorithm considered frequency and edit-distance (using some reasonable heuristic) we would see dramatically better results. The current spelling algorithm will only correct words that never appear in the spelling index. (Since 1.2.3, it will offer a correction for a word when the correction has a higher frequency than the word)

Attachments (3)

spelling_frequency.diff (2.6 KB ) - added by Philip Neustrom 16 years ago.
An example implementaiton
spelling_frequency_updated.patch (2.6 KB ) - added by Olly Betts 14 years ago.
Patch updated for trunk
spelling-formulae.rst (1.0 KB ) - added by Olly Betts 14 years ago.
Simple mathematical model of spelling errors

Download all attachments as: .zip

Change History (19)

by Philip Neustrom, 16 years ago

Attachment: spelling_frequency.diff added

An example implementaiton

comment:1 by Olly Betts, 16 years ago

Component: Backend-FlintLibrary API
op_sys: LinuxAll
Operating System: All
rep_platform: PCAll
Status: newassigned

comment:3 by Olly Betts, 16 years ago

Description: modified (diff)
Owner: changed from New Bugs to Olly Betts
Status: assignednew

comment:4 by Olly Betts, 15 years ago

Description: modified (diff)
Milestone: 1.1.1
Status: newassigned

It would be good to improve this during the 1.1.x series.

[Changed link to list discussion to point to gmane for easier browsing of the thread.]

comment:5 by Olly Betts, 15 years ago

Milestone: 1.1.11.1.4

Triaging milestone:1.1.1 bugs.

comment:6 by Olly Betts, 15 years ago

Priority: normallow

My main reservation here is that the algorithm seems rather arbitrary - it's more satisfactory to have a mathematical model or other justification. I worry if one term indexes most documents it might get suggested too often. It also seems this might require significantly more work, but it might not make a measurable difference even if it does, and better suggestions are worth some extra work.

Perhaps we should consider suggestions which require only one (or perhaps two) extra edits - still arbitrary, but at least the behaviour is more constrained. And this would allow us to still cull a lot of entries which can't have a low enough edit distance.

Setting priority low for now.

comment:7 by Olly Betts, 15 years ago

Milestone: 1.1.41.2.0

Bumping to stay on schedule.

comment:8 by Olly Betts, 14 years ago

Priority: lowhigh

I'd like to sort this out fairly soon so priority -> high.

by Olly Betts, 14 years ago

Patch updated for trunk

comment:9 by Olly Betts, 14 years ago

I've sat down and thought through a simple mathematical model for spelling mistakes - I'll attach an RST write up of this.

by Olly Betts, 14 years ago

Attachment: spelling-formulae.rst added

Simple mathematical model of spelling errors

comment:10 by Olly Betts, 14 years ago

It seem trac messes up the UTF-8 characters when previewing that rst file - view the rst file itself for a better version.

Thinking about it, p is going to vary by user, but we're probably talking something like 0.01 to 0.001. It might well be the actual best value isn't the same as the true probability (since we make various simplifying assumptions) so perhaps it is best to tune p for best results rather than try too hard to determine a "true" value.

To efficiently implement this model, it would be useful to track an upper bound on the spelling frequency, which is easy to do, but we don't currently, and seems like it will need an incompatible database format change.

But it's easy to address the specific point about not returning any correction if the word is in the spelling dictionary (as it may be if misspelled in the indexed documents) - I've addressed that in trunk r14859.

Last edited 7 years ago by Olly Betts (previous) (diff)

comment:11 by Olly Betts, 13 years ago

Priority: highnormal

comment:12 by Olly Betts, 11 years ago

Milestone: 1.2.x1.3.x

May well be addressed by nsmetanin's gsoc branch.

Marking for 1.3.x now.

comment:13 by Olly Betts, 8 years ago

Milestone: 1.3.x1.3.4

Changing the algorithm can be done in 1.4.x, but tracking the upper bound on spelling frequency is something we should try to do before that, as that's probably going to need a database format tweak, and I'd like to get the glass format stable sooner rather than later, so marking that for 1.3.4.

comment:14 by Olly Betts, 8 years ago

Milestone: 1.3.41.4.x

[8d64696118cf8a1e777d5914dc4a133db5e44de2] implements tracking of an upper bound on the spelling word frequency. The rest can be done in 1.4.x.

comment:15 by Olly Betts, 7 years ago

Description: modified (diff)

comment:16 by Olly Betts, 4 years ago

Milestone: 1.4.x1.5.0
Version: SVN trunkgit master

comment:17 by Olly Betts, 5 months ago

Description: modified (diff)
Milestone: 1.5.02.0.0

It'd be really good to address the remainder of this, but it now doesn't require a database format change so postponing.

Note: See TracTickets for help on using tickets.