1

I need to implement a string-matching algorithm to determine which strings most closely match. I see the the Hamming distance is a good matching algorithm when this fixed-length is obtainable.

Is there any advantage in the quality of matching if I were to use the Levenshtein distance formula instead? I know this method is less efficient, given that it accounts for variable-length strings, but what I'm really concerned with here are the quality of the matches. Also, are there any better algorithms out there I may want to consider? I work in Java if that makes any difference.

http://en.wikipedia.org/wiki/Levenshtein_distance

http://en.wikipedia.org/wiki/Hamming_distance

Much thanks

2
  • Can you describe how you would grade the quality of a match? That's a subjective measure, so you will get better answers if you can describe you goal. Commented Dec 7, 2009 at 16:28
  • For 2 strings, say AHDJD and KDLOS, I want to judge how 'close' they are to one another. So AAAAA and AAAAA would be a 100% match. BAAAA and AAAAA would be like 97-ish percent, KAAAA and AAAAA would be like 93% close... BJKDZ and AAAAA would be hardly alike... Does that help? Commented Dec 7, 2009 at 16:35

2 Answers 2

3

Consider the strings: "abcdefg" and "bcdefgh".

The Levenshtein distance is 2. The Hamming distance (operating on characters rather than bits) is 7.

So it really depends whether you want to treat those strings as being similar, or not. Hamming distance has its appropriate uses, but "will these strings look similar to a human being?" is not one of them.

Sign up to request clarification or add additional context in comments.

3 Comments

I see. Sounds like Hamming distance is out then, as your illustration makes Levenshtein appear more appropriate. Are there any other algorithms you know of that I may want to consider?
You say in another comment that you want d("aaaaa","kaaaa") > d("aaaaa","baaaa"). Levenshtein doesn't do that, it's 1 in both cases. I'm afraid I don't know any other algorithms that are relevant. But maybe by breaking each character up into (say) 2-bit chunks, and calculating Levenshtein distance over those chunks instead of over chars, you might get something a little closer to what you want. Still not right, though, since the change O -> P flips more bits than the change P -> Q.
Ah, how about this: you could calculate the root mean square distance between corresponding characters. ('a' - 'k')^2 is 100, ('a'-'b')^2 is only 1. This would mean that "abc" is much closer to "cba" than "amz" is to "zam", though, so you still might not get results you like.
1

You may find interesting the Bitap algorithm.

The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates-Gonnet algorithm) is a fuzzy string searching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance — if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.