1

This is generic algorithms stuff too so please dont stop reading if you see solr in text (please skip first 3 lines)

In Solr, For spell checking component I set extendedResults to get the frequencies of the corrected word and then select the word with the best frequency. I understand the spell check algorithm based on Edit Distance. For an example:

Query to Solr: Marien

Spell Check Text Returned: Marine (Freq: 120), Market (Freq: 900) and others. My dictionary here is based on indexed words.

So I chose Market (more frequency) however which is wrong as my intent was marine. Both have Edit Distance of 2.

Now how can I improve this Algorithm to select marine instead of market (based on something more than edit distance and frequency stuff)?

Do I have to incorporate some "soundex" algorithms too?

I am looking for simple stuff which I can quickly implement.

I even tried using Peter Norvig's spell corrector Algorithm (which is great) but again I ran in same problems.

1
  • adding soundex sounds good to me. for a given word if you find a good "soundex", then propose that to the user. if not, proceed with the two other variables: freq and distance. Commented Mar 1, 2012 at 11:58

3 Answers 3

3

In this particular case, you could improve the results by using a metric which recognises transpositions - 'marien' differs from 'marine' by two substitutions, but only one transposition, so if you do that, it seems closer than 'market'.

The classic Levenshtein edit distance measure only deals with insertions, deletions, and substitutions. However, the Damerau–Levenshtein distance deals with transposition as well.

Sign up to request clarification or add additional context in comments.

Comments

0

All this is ranking problem. What you need to do is create a method which will take several signals & using some formula give importance to each word. The ranker will come into the picture after the user is typing & after you have fetched words. At this stage you need to order your results thats where the ranker comes into the picture.

Now to address your specific problem. Lets say your ranking function just takes 2 signals (frequency & soundex). If you want marine instead of market all you need to do is give more weightage to soundex signals & less weightage to frequency (lets say 70/30). These weightages can be emperically adjusted based on trial & error or they can be machine learned. That way frequency of word occurrence which is accurate in other cases is not totally ignored it still has some say.

1 Comment

Thanks for your answer! I know it is ranking problem. I was specifically asking will soundex improve my spell check Algorithm or can I use some other technique to improve it. I think here you are talking about "Learning to Rank" stuff which is not really required in this case. +1
0

I have used soundex/metaphone Algorithm on top of Edit Distance+ Transposition & it is working great.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.