Fuzzy diff n by n files

Question

I want to know the individual difference between n files, so similar to this:

parallel --tag 'diff {1} {2} | wc -l' ::: * ::: *

A big problem here is binary files, and a single megalong line will count the same as a short line.

How do I generate a fuzzy diff over n files?

Ole Tange · Accepted Answer · 2018-01-22 08:01:09Z

2

Use ssdeep to generate a hash file:

ssdeep `find .  -type f` > hash

This will give the pairs with 90% <= similarity < 100%:

ssdeep -m hash `find .  -type f` | grep -E '9[0-9].$'

This only works if long stretches (blocks of around 1% of file size) are the same.

answered Jan 21, 2018 at 1:35

Ole Tange

37.5k34 gold badges119 silver badges226 bronze badges

1 Answer 1