0

I've city names, like Newyork. Somebody may misspell it and write it as Newyoork. I want an algorithm or library which can match such patterns with some confidence level.

Any idea how to do it or if any pattern recognition library is present?

It is same as when we type something in Google with spelling mistake in it but instead it automatically corrects the mistake.

1
  • 4
    "Newyork" is also misspelled. The correct spelling is "New York City". Just sayin'. Commented Feb 17, 2012 at 11:04

3 Answers 3

5

Comparing two strings A and B :-

Levenshtein comparison counts the number of steps (adding, replacing or removing a single character at a time) to change A into B.

Ratcliff comparison gives a match score (100 for two equal strings) based on the longest common sequence between A and B, with recurrence to compare the remaining sub-strings after each iteration.

Here are my C# implementations as string extender functions

public static class StringExtender
{
    public static int CompareLevenshtein(this string A, string B)
    { return A.CompareLevenshtein(B, false); }

    public static int CompareLevenshtein(this string A, string B, bool CaseSensitive)
    {
        int LA = A.Length;
        int LB = B.Length;
        if (LA == 0)
            return LB;
        else if (LB == 0)
            return LA;
        else
        {
            string A1 = CaseSensitive ? A : A.ToUpperInvariant();
            string B1 = CaseSensitive ? B : B.ToUpperInvariant();
            int[,] D = new int[LA + 1, LB + 1];
            for (int i = 0; i <= LA; i++)
                D[i, 0] = i;
            for (int i = 0; i <= LB; i++)
                D[0, i] = i;
            for (int i = 1; i <= LA; i++)
            {
                char AI = A1[i - 1];
                for (int j = 1; j <= LB; j++)
                {
                    char BJ = B1[j - 1];
                    int cost = (AI == BJ ? 0 : 1);
                    D[i, j] = MimumumOf3(D[i - 1, j] + 1, 
                                         D[i, j - 1] + 1, 
                                         D[i - 1, j - 1] + cost);
                }
            }
            return D[LA, LB];
        }
    }

    private static int MimumumOf3(int i, int j, int k)
    { return Math.Min(Math.Min(i, j), k); }

    public static float CompareStringsRatcliff(this string A, string B)
    { return CompareStringsRatcliff(A, B, false); }

    public static float CompareStringsRatcliff(this string A, string B, bool CaseSensitive)
    {
        int LA = A.Length;
        int LB = B.Length;

        if ((LA == 0) || (LB == 0))
            return 0.0f;
        else
        {
            string A1 = (CaseSensitive ? A : A.ToUpperInvariant());
            string B1 = (CaseSensitive ? B : B.ToUpperInvariant());
            if (A1.Equals(B1))
                return 100.0f;
            else
                return CompareStringsRatcliffInternal(A1, 0, LA - 1, B1, 0, LB - 1) * 200.0f / (LA + LB);
        }
    }

    private static int CompareStringsRatcliffInternal(string A, int StartIndexA, int EndIndexA, string B, int StartIndexB, int EndIndexB)
    {
        if ((StartIndexA > EndIndexA) || (StartIndexB > EndIndexB))
            return 0;
        else
        {
            int NewStartIndexA = 0;
            int NewStartIndexB = 0;
            int BestMatches = 0;
            for (int Ai = StartIndexA; Ai <= EndIndexA; Ai++)
                for (int Bi = StartIndexB; Bi <= EndIndexB; Bi++)
                {
                    int i = 0;
                    int Matches = 0;
                    while ((Ai + i <= EndIndexA) && (Bi + i <= EndIndexB) && (A[Ai + i] == B[Bi + i]))
                    {
                        Matches++;

                        if (Matches > BestMatches)
                        {
                            NewStartIndexA = Ai;
                            NewStartIndexB = Bi;
                            BestMatches = Matches;
                        }
                        i++;
                    }
                }

            if (BestMatches > 0)
            {
                BestMatches += CompareStringsRatcliffInternal(A, NewStartIndexA + BestMatches, EndIndexA, B, NewStartIndexB + BestMatches, EndIndexB);
                BestMatches += CompareStringsRatcliffInternal(A, StartIndexA, NewStartIndexA - 1, B, StartIndexB, NewStartIndexB - 1);
            }

            return BestMatches;
        }
    }
}
1
2

You already quoted in your own question one service to use: Google itself. Looking at Google Maps API may be helpful too, I believe, but I don't remember if there was an auto-correction feature in their geolocation API.

Another possibility is to use a spell checker (on Windows where you have Microsoft Office, you can use the API provided by Microsoft Office; on other platforms, analog products are available). Get all the suggestions for a word from a spell checker, and compare them to the list of cities, to be sure that the suggestion is a geographic location, not a word from a dictionary.

3
  • +1, but obviously a spell checker won't be much help as they usually don't contain many city names. A list of known cities is needed instead. Commented Feb 17, 2012 at 10:33
  • Depending on some licensed api may be something I'd like to avoid. Commented Feb 17, 2012 at 10:52
  • +1 for the suggestion to use an existing spell checker. Something like aspell is easy to incorporate into software and you could use custom dictionaries with just city names. Commented Feb 17, 2012 at 14:12
1

You want Soundex, which is intended to do exactly this: recognize proper names that people have misspelt in various ways.

a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms (in part because it is a standard feature of popular database software such as PostgreSQL, MySQL, MS SQL Server and Oracle)...

2
  • 1
    Depends on what kind of misspelling they are. If they are keyed by people who don't know how those names should be spelled, than Soundex is appropriate. But if they were just keyed in by sloppy typists, Levenshtein distance is more appropriate. I suspect Google uses the later. Commented Feb 17, 2012 at 10:35
  • I suspect Google uses something more sophisticated than either - after all, they have quite some natural-language know-how (see also Google Translate). They're probably using some sort of adaptive Bayesian algorithm (with feedback from search result usage), with lots of manual tweaks and exceptions, that is, statistics with some hard knowledge sprinkled on top. Commented Feb 17, 2012 at 11:06

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.