The docstring in group_equivalent_files should be self-explanatory for what I'm trying to do. I'm specifically trying to group by equivalent files as opposed to just finding pairs of equivalent files.
Example:
Filesystem with only hello1.txt and hello2.txt equivalent:
ls ./
hello1.txt foo.txt dir_one
ls ./dir_one
hello2.txt bar.txt
Expected Output:
[['./hello1.txt', 'dir_one/hello2.txt']]
Pointing out any potential improvements or errors would be really appreciated. I'm most hesitant about my implementation of group_by_equality. The files could be compared in chunks, and I'm thinking that the File class might be unnecessary.
import collections
import hashlib
import itertools
import os
import os.path
def group_equivalent_files():
"""
Overview:
Find all groups of files under pwd whose byte contents are exactly
equivalent. pwd is understood to be directory from which this script was
run. The filenames of equivalent files are printed to stdout.
Algorithm:
group_equivalent_files attempts to successively group files. First by file
size. Then by a hash of each file's first 100 bytes. The remaining,
possibly equivalent files are then checked byte by byte for equivalency.
The goal is to generally minimize the total number of bytes read and
compared.
Assumptions:
- the combined size of all the files under pwd can fit into RAM
Example:
Filesystem with only hello1.txt and hello2.txt equivalent:
ls ./
hello1.txt foo.txt dir_one
ls ./dir_one
hello2.txt bar.txt
Expected Output:
[['./hello1.txt', 'dir_one/hello2.txt']
"""
def walk_files():
"""
Recursively walk through and process every file underneath pwd. Also
group processed files by file size. Return a dictionary of all the
files, with file size as the key and a list of filenames with the
associated size as the value
"""
files_by_size = collections.defaultdict(list)
for root, _, files in os.walk("."):
for filename in files:
full_filename = os.path.join(root, filename)
files_by_size[os.path.getsize(full_filename)].append(
full_filename)
return files_by_size
def get_n_bytes(filename, n):
"""
Return the first n bytes of filename in a bytes object. If n is -1 or
greater than size of the file, return all of the file's bytes.
"""
in_file = open(filename, "rb")
file_contents = in_file.read(n)
in_file.close()
return file_contents
def group_by_hash(files_by_size):
"""
files_by_size is a dictionary with file size as key and a list of
associated full filenames as value
Group by the files referred to in files_by_size according to hash of
file's first 100 bytes. Return dictionary with file hash as key and
list of associated files as value.
"""
def get_hash(file_contents):
return hashlib.sha256(file_contents).digest()
files_by_hash = collections.defaultdict(list)
for file_size, files in files_by_size.items():
for filename in files:
file_hash = get_hash(get_n_bytes(filename, 100))
files_by_hash[file_hash].append(filename)
return files_by_hash
def group_by_equality(files_by_hash):
"""
files_by_hash is a dictionary with file hash as key and list of
associated files as value.
Group the files referred to in files_by_hash according to byte
equality. Return list of lists of filenames whose entire byte contents
are exactly equivalent.
"""
class File():
def __init__(self, filename, file_contents):
self.filename = filename
self.file_contents = file_contents
def __eq__(self, other):
return self.file_contents == other.file_contents
files_by_equality = list()
for file_hash, filenames in files_by_hash.items():
files = [
File(filename, get_n_bytes(filename, -1))
for filename in filenames]
for i, g in itertools.groupby(files):
equal_files = list(g)
if len(equal_files) >= 2:
files_by_equality.append([f.filename for f in equal_files])
return files_by_equality
files_by_size = walk_files()
files_by_hash = group_by_hash(files_by_size)
equal_files = group_by_equality(files_by_hash)
print(equal_files)
group_equivalent_files()