Best way string-matching algorithm for same-length strings?

Question

I need to implement a string-matching algorithm to determine which strings most closely match. I see the the Hamming distance is a good matching algorithm when this fixed-length is obtainable.

Is there any advantage in the quality of matching if I were to use the Levenshtein distance formula instead? I know this method is less efficient, given that it accounts for variable-length strings, but what I'm really concerned with here are the quality of the matches. Also, are there any better algorithms out there I may want to consider? I work in Java if that makes any difference.

http://en.wikipedia.org/wiki/Levenshtein_distance

http://en.wikipedia.org/wiki/Hamming_distance

Much thanks

Can you describe how you would grade the quality of a match? That's a subjective measure, so you will get better answers if you can describe you goal. — Drew Dormann
– Drew Dormann, Commented Dec 7, 2009 at 16:28
For 2 strings, say AHDJD and KDLOS, I want to judge how 'close' they are to one another. So AAAAA and AAAAA would be a 100% match. BAAAA and AAAAA would be like 97-ish percent, KAAAA and AAAAA would be like 93% close... BJKDZ and AAAAA would be hardly alike... Does that help? — Cuga
– Cuga, Commented Dec 7, 2009 at 16:35

Steve Jessop · Accepted Answer · 2009-12-07 16:23:08Z

3

Consider the strings: "abcdefg" and "bcdefgh".

The Levenshtein distance is 2. The Hamming distance (operating on characters rather than bits) is 7.

So it really depends whether you want to treat those strings as being similar, or not. Hamming distance has its appropriate uses, but "will these strings look similar to a human being?" is not one of them.

answered Dec 7, 2009 at 16:23

Steve Jessop

281k40 gold badges473 silver badges709 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Cuga Over a year ago

I see. Sounds like Hamming distance is out then, as your illustration makes Levenshtein appear more appropriate. Are there any other algorithms you know of that I may want to consider?

Steve Jessop Over a year ago

You say in another comment that you want d("aaaaa","kaaaa") > d("aaaaa","baaaa"). Levenshtein doesn't do that, it's 1 in both cases. I'm afraid I don't know any other algorithms that are relevant. But maybe by breaking each character up into (say) 2-bit chunks, and calculating Levenshtein distance over those chunks instead of over chars, you might get something a little closer to what you want. Still not right, though, since the change O -> P flips more bits than the change P -> Q.

Steve Jessop Over a year ago

Ah, how about this: you could calculate the root mean square distance between corresponding characters. ('a' - 'k')^2 is 100, ('a'-'b')^2 is only 1. This would mean that "abc" is much closer to "cba" than "amz" is to "zam", though, so you still might not get results you like.

Nick Dandoulakis · Accepted Answer · 2009-12-07 16:38:33Z

You may find interesting the Bitap algorithm.

The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates-Gonnet algorithm) is a fuzzy string searching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance — if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast.

Collectives™ on Stack Overflow

Best way string-matching algorithm for same-length strings?

2 Answers 2

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Related