Bash command to list all files based on content

Question

What I am trying to accomplish is to take all files in a directory and list / sort them by uniqueness of their content

Example:

Say we have these 7 files in a directory

uniquefile1.txt, uniquefile2.txt, samefile1.txt, samefile2.txt, equalfile1.txt, equalfile2.txt, equalfile3.txt

where uniquefile1 and uniquefile2 have different content, all samefile.txt's have the same content as each other, and all equalfile.txt's have the same content as each other

Expected output:

uniquefile1.txt
uniquefile2.txt
samefile1.txt, samefile2.txt
equalfile1.txt, equalfile2.txt, equalfile3.txt

I have been messing around with hashing and using md5sum, but have not been able to get anything to accomplish exactly that

I want to accomplish this using utilities like grep, xargs, sed, awk, find, and locate mixed with some other coreutils if necessary.

and how unique files with different names should be sorted? I.e. what if there reg-file.txt , somefile.txt — RomanPerekhrest
– RomanPerekhrest, Commented Oct 4, 2017 at 18:36
md5sum * | sort is not quite what you are asking for, but it is simple and it will bring groups of identical files together - which is often all one needs - but it does need postprocessing to do exactly what you want. — NickD
– NickD, Commented Oct 4, 2017 at 19:52

Kusalananda · Accepted Answer · 2017-10-04 19:30:54Z

This is a modified part of an answer I wrote yesterday:

$ cksum file* | awk '{ ck[$1$2] = ck[$1$2] ? ck[$1$2] ", " $3 : $3 } END { for (i in ck) print ck[i] }'
file3, file5
file1, file2, file4

In your case you would use *.txt or even * (if all you have in the directory are the file you'd like to compare) rather than file*.

The result tells us that file3 and file5 have the same contents, as does file1, file2, and file4 (in this example).

The standard cksum utility will output three columns for each file. The first is a checksum, the second is a file size, and the third is a filename.

The awk code will use the checksum and size as a key in the array ck and store the filenames that have the same key in a comma-separated string for that key. At the end, the filenames (comma-separated string) are printed out.

The funny looking

ck[$1$2] = ck[$1$2] ? ck[$1$2] ", " $3 : $3

just means "if ck[$1$2] is set to anything, then assign ck[$1$2] ", " $3 to ck[$1$2] (appending a filename with a comma in-between), otherwise just assign $3 (it's the first filename with this key)".

To sort the output on the number of items in each list, pass the output through

awk -F, '{ print NF, $0 }' | sort -n | cut -d ' ' -f 2-

... as a post-processing stage. This will obviously break if any filename contains a comma.

Or use

cksum file* | awk '{ n[$1$2]++; ck[$1$2] = ck[$1$2] ? ck[$1$2] ", " $3 : $3 } END { for (i in ck) print n[i], ck[i] }' | sort -n | cut -d ' ' -f 2-

which does not have any issues with commas in filenames.

Leave the cut out if you'd like to see the number of filenames on each line of output.

For a huge number of files, you may want to use

find . -type f -exec cksum {} +

rather than just

cksum *

Instead of just listing them in the order they are checked, how would I go about sorting them in order from unique files to largest group of same files — Hopsain
– Hopsain, Commented Oct 4, 2017 at 19:02

Stéphane Chazelas · Accepted Answer · 2017-10-06 07:21:10Z

I'd use perl:

perl -MDigest::SHA -le '
  for $f (@ARGV) {
    $d = Digest::SHA->new(256);
    $d->addfile($f);
     push @{$h{$d->digest}}, $f
  }
  print join ", ", @{$h{$_}} for keys %h' -- *.txt

We're building an associative array whose keys are the sha256 hash of the files and the value the list of files with that hash.

It makes it easy to sort the output by number of occurrences for instance with:

perl -MDigest::SHA -le '
  for $f (@ARGV) {
    $d = Digest::SHA->new(256);
    $d->addfile($f);
     push @{$h{$d->digest}}, $f
  }
  print join ", ", @{$h{$_}} for sort {@{$h{$a}} <=> @{$h{$b}}} keys %h' -- *.txt

Or even sort the list of files in each set by file name:

perl -MDigest::SHA -le '
  for $f (@ARGV) {
    $d = Digest::SHA->new(256);
    $d->addfile($f);
     push @{$h{$d->digest}}, $f
  }
  print join ", ", sort {$a cmp $b} @{$h{$_}} for 
    sort {@{$h{$a}} <=> @{$h{$b}}} keys %h' -- *.txt

Stack Exchange Network

Bash command to list all files based on content

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Bash command to list all files based on content

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions