Return to Revisions

1 of 5

answered Aug 23, 2017 at 12:51

584.9k
96
1.1k
1.7k

The Levenshtein distance is a useful metric to give an idea of the amount of different between two strings. It measures the number of insertions, deletions and substitutions to get from one string to the other.

For instance, if you compare abcdef and bcdef, all characters are different if you compare them one to one, but only one deletion is need to get from one to the other.

So you could make your percentage like: distance / max_length:

perl -MList::Util=max -MText::LevenshteinXS -le '
  ($x, $y) = @ARGV
  print 100 * distance($x, $y) / max(length $x , length $x)
  ' -- "$string1" "$string2"

Or in awk:

awk '
    function min(x, y) {
      return x < y ? x : y
    }
    function max(x, y) {
      return x > y ? x : y
    }
    function lev(s,t) {
      m = length(s)
      n = length(t)

      for(i=0;i<=m;i++) d[i,0] = i
      for(j=0;j<=n;j++) d[0,j] = j

      for(i=1;i<=m;i++) {
        for(j=1;j<=n;j++) {
          c = substr(s,i,1) != substr(t,j,1)
          d[i,j] = min(d[i-1,j]+1,min(d[i,j-1]+1,d[i-1,j-1]+c))
        }
      }

      return d[m,n]
    }

    BEGIN {
      print 100 * lev(ARGV[1], ARGV[2]) / max(length(ARGV[1]), length(ARGV[2]))
      exit
    }' "$string1" "$string2"

answered Aug 23, 2017 at 12:51

Stéphane Chazelas

584.9k
96
1.1k
1.7k