Grouping Equivalent Files in Filesystem

Question

The docstring in group_equivalent_files should be self-explanatory for what I'm trying to do. I'm specifically trying to group by equivalent files as opposed to just finding pairs of equivalent files.

Example:

Filesystem with only hello1.txt and hello2.txt equivalent:

ls ./

hello1.txt foo.txt dir_one

ls ./dir_one

hello2.txt bar.txt

Expected Output:

[['./hello1.txt', 'dir_one/hello2.txt']]

Pointing out any potential improvements or errors would be really appreciated. I'm most hesitant about my implementation of group_by_equality. The files could be compared in chunks, and I'm thinking that the File class might be unnecessary.

import collections
import hashlib
import itertools
import os
import os.path


def group_equivalent_files():
    """
    Overview:
    Find all groups of files under pwd whose byte contents are exactly
    equivalent. pwd is understood to be directory from which this script was
    run. The filenames of equivalent files are printed to stdout.

    Algorithm:
    group_equivalent_files attempts to successively group files. First by file
    size. Then by a hash of each file's first 100 bytes. The remaining,
    possibly equivalent files are then checked byte by byte for equivalency.
    The goal is to generally minimize the total number of bytes read and
    compared.

    Assumptions:
    - the combined size of all the files under pwd can fit into RAM

    Example:
    Filesystem with only hello1.txt and hello2.txt equivalent:
    ls ./
    hello1.txt foo.txt dir_one

    ls ./dir_one
    hello2.txt bar.txt

    Expected Output:
    [['./hello1.txt', 'dir_one/hello2.txt']
    """

    def walk_files():
        """
        Recursively walk through and process every file underneath pwd. Also
        group processed files by file size. Return a dictionary of all the
        files, with file size as the key and a list of filenames with the
        associated size as the value
        """
        files_by_size = collections.defaultdict(list)
        for root, _, files in os.walk("."):
            for filename in files:
                full_filename = os.path.join(root, filename)
                files_by_size[os.path.getsize(full_filename)].append(
                    full_filename)
        return files_by_size

    def get_n_bytes(filename, n):
        """
        Return the first n bytes of filename in a bytes object. If n is -1 or
        greater than size of the file, return all of the file's bytes.
        """
        in_file = open(filename, "rb")
        file_contents = in_file.read(n)
        in_file.close()
        return file_contents

    def group_by_hash(files_by_size):
        """
        files_by_size is a dictionary with file size as key and a list of
        associated full filenames as value

        Group by the files referred to in files_by_size according to hash of
        file's first 100 bytes. Return dictionary with file hash as key and
        list of associated files as value.
        """
        def get_hash(file_contents):
            return hashlib.sha256(file_contents).digest()

        files_by_hash = collections.defaultdict(list)
        for file_size, files in files_by_size.items():
            for filename in files:
                file_hash = get_hash(get_n_bytes(filename, 100))
                files_by_hash[file_hash].append(filename)
        return files_by_hash

    def group_by_equality(files_by_hash):
        """
        files_by_hash is a dictionary with file hash as key and list of
        associated files as value.

        Group the files referred to in files_by_hash according to byte
        equality. Return list of lists of filenames whose entire byte contents
        are exactly equivalent.
        """
        class File():
            def __init__(self, filename, file_contents):
                self.filename = filename
                self.file_contents = file_contents

            def __eq__(self, other):
                return self.file_contents == other.file_contents

        files_by_equality = list()
        for file_hash, filenames in files_by_hash.items():
            files = [
                File(filename, get_n_bytes(filename, -1))
                for filename in filenames]
            for i, g in itertools.groupby(files):
                equal_files = list(g)
                if len(equal_files) >= 2:
                    files_by_equality.append([f.filename for f in equal_files])
        return files_by_equality

    files_by_size = walk_files()
    files_by_hash = group_by_hash(files_by_size)
    equal_files = group_by_equality(files_by_hash)
    print(equal_files)


group_equivalent_files()

You might be interested in this answer to a very similar question. — Graipher
– Graipher, Commented Jun 9, 2017 at 9:33

200_success · Accepted Answer · 2017-06-09 19:31:36Z

First impressions

This is very neatly organized code. My main criticism is that it's verbose: a function contains helper functions, one of which defines a class, which in turn has its own methods.

The function would be more usable if it accepted a parameter (the root directory) and returned its results instead of printing them.

Bug

itertools.groupby() doesn't help you, because it doesn't behave the way you probably expect.

From the documentation:

Make an iterator that returns consecutive keys and groups from the iterable. … It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order.

Therefore, your group_by_equality(files_by_hash) function is broken. If it encounters three files with the same size and digest, with contents like

'a' * 100 + 'x'
'a' * 100 + 'y'
'a' * 100 + 'x'

… it would fail to recognize files 1 and 3 as equal, since groupby() breaks them up into three groups.

Further consideration

You might be able to improve performance a little more, by changing the order in which the classification is done. For each group of files that look similar, analyze it fully as soon as possible, to take advantage of filesystem caching.

characteristic = f(path); new_group_dict[summary + (characteristic,)].append(path): so when reading the whole file, the whole file contents gets added to the dict key? I hope you don't have large duplicates files — Maarten Fabré
– Maarten Fabré, Commented Jun 10, 2017 at 17:28
As an alternative, what about using every characteristic except entire file contents as part of the dict key? So after the files have been classified, group the files by equality in a separate step. — John McCann
– John McCann, Commented Jun 10, 2017 at 21:16
If file size and memory usage were concerns, then I wouldn't bother doing a byte-by-byte comparison at all. Rather, for the final test, I would hash them using a SHA-256 on the entire file. Considering that there are 2^256 possible hash values, all equally likely, with no known collision attacks, it's virtually impossible to make two different files with the same size collide. (If such a false positive ever happens, I'd blame a stray cosmic ray. Then I'd publish a paper about it.) — 200_success
– 200_success, Commented Jun 11, 2017 at 20:56

Maarten Fabré · Accepted Answer · 2017-06-11 20:55:38Z

This is an interesting question, and is a problem similar to one I tackled not too long ago either. I don't have tons of expierience with IO, but I'll see what I can do

methods

Why make all the different methods submethods of group_equivalent_files(). Is there a specific reason for this?

walk_files

you hardcoded the entrypoint for the filewalking to '.'. I would pass this as an argument to the function, with this as default.

I don't like the unclear variable _ either, but that's a matter of taste And I would call this group_by_size in the same manner as group_by_hash and group_by_equality

In total this boils down to:

def group_by_size(startpoint='.'):
# docstring can mostly remain the same
    files_by_size = collections.defaultdict(list)
    for root, dirnames, files in os.walk(startpoint):
        for filename in files:
            full_filename = os.path.join(root, filename)
            files_by_size[os.path.getsize(full_filename)].append(
                full_filename)
    return files_by_size

get_n_bytes

use a context manager here

def get_n_bytes(filename, n):
    """
    Return the first n bytes of filename in a bytes object. If n is -1 or
    greater than size of the file, return all of the file's bytes.
    """
    with open(filename, "rb") as in_file
        return in_file.read(n)

group_by_hash

I would only parse those file_size where len(files) > 1
you need to use (filesize, file_hash) as key for the files_by_hash dict to prevent files with different sizes, but same beginnings colliding
If you read less bytes than the hash256 will produce, why hash? If you need something in an str-format you can use something like base64 encoding
I would make the number of bytes read an argument (perhaps with 100 as default)

Together I come up with something like this

def group_by_hash(files_by_size, bytes_to_check=100):

    def get_hash(file_contents):
        return hashlib.sha256(file_contents).hexdigest()

    files_by_hash = collections.defaultdict(list)
    for key, files in files_by_size.items():
        if len(files) > 1:
            for filename in files:
                file_hash = get_hash(get_n_bytes(filename, bytes_to_check))
                # or alternatively
                #filehash = base64.a85encode(get_n_bytes(filename, bytes_to_check))
                files_by_hash[(key, file_hash)].append(filename)
    return files_by_hash

group_by_equality

You read the whole file into memory. I think it would be better to iterate over the different file simultaneously. Since this is if there are more than 2 files with the same size and beginning, I split those in couples with itertools.combinations

def group_by_equality(files_by_hash):
    for key, filenames in files_by_hash.items():
        num_files = len(filenames)
        if num_files > 1:
            yield key, files_are_equal(filenames)

Just passes the sets of files longer than 1 on to files_are_equal to compare them

def file_iterator(filename, chunksize=512):
    # inspired by https://stackoverflow.com/a/1035360/1562285
    with open(filename, 'rb') as f:
        b = f.read(chunksize)
        while b:
            yield b
            b = f.read(chunksize)

iterates over a file

files_are_equal

This is the magic method that simultaneously iterates over all similar files

def files_are_equal(filenames):
    numfiles = len(filenames)
    file_combinations = set(frozenset(i) for i in (itertools.combinations(range(numfiles), r=2)))
    results = [{i for i in range(numfiles)} for j in range(numfiles)]
    file_iterators = itertools.zip_longest(*(file_iterator(filename) for filename in filenames))
    for file_contents in file_iterators:
        for i, j in file_combinations:
            if file_contents[i] != file_contents[j]:
                file_combinations -= frozenset((i, j))
                results[i] -= {j}
                results[j] -= {i}
                if not any(len(row) > 1 for row in results):
                    return None
    # print('r: ', results)

    return {tuple(filenames[i] for i in sorted(s)) for s in results if len(s) > 1}

Iterates in chunks over all filenames together.
Each file has a set (in the results list) of files they still agree with
For each combination of 2 files, the chunks get compared.
- if they agree, just go on
- if the differ
- remove this couple from the combinations to compare
- remove the file from each other set in results
- Check whether there are files which agree so far, if not, return None If there are still files agreeing when the iteration ends, make a tuple of each set in results with more than 1 element, and use set to remove duplicates

Tying it together

imports

import itertools
import os
import collections
# import base64  # used alternatively
import hashlib

these go on top obviously

calling everything

files_by_size = group_by_size()
files_by_hash = group_by_hash(files_by_size)
equal_files = group_by_equality(files_by_hash)

Testing

I made a lot of dummy files win a tempfile.TemporaryDirectory with differences in the length, beginning en end of the file to test this.

def get_all_files(startpoint= '.'):
    all_files = {0: list()}
    for root, dirnames, files in os.walk(startpoint):
        for filename in files:
            full_filename = os.path.join(root, filename)
            all_files[0].append(full_filename)
    return all_files

Is used in testing as an alternative to pass on an less prepared collection to group_by_equality

import tempfile
with tempfile.TemporaryDirectory() as tempdir:
    dummycontent = ''.join(str(i) for i in list(range(99)) * 10)
    for i in range(12):
        for j in range(i):
            suffix = str(100*i+j).zfill(4)
            with open('%s/file%s'%(tempdir, suffix), 'w') as fh:
                fh.write(str(int(j > 5)))  # something different at the beginning of the file
                fh.write(dummycontent)
                fh.write(str(i)) # something different at the end of the file

    files_by_size = group_by_size(tempdir)
    files_by_hash = group_by_hash(files_by_size)
    files_by_hash2 = group_by_hash(files_by_size,-1)
    files_by_hash3 = group_by_hash(files_by_hash, -1)
    equal_files = list(group_by_equality(files_by_hash))
    all_files = get_all_files(tempdir)
    equal_files2 = list(group_by_equality(all_files))

Results are too large to paste here, but it seems to work

_ is the conventional name for a "throwaway" variable, whose value does not matter. — 200_success
– 200_success, Commented Jun 9, 2017 at 22:08
I know this only gives a list of duplicate couples, so if there are n duplicates they all get read n-1, but at least they don't need to be read into memory in whole — Maarten Fabré
– Maarten Fabré, Commented Jun 10, 2017 at 17:30
and you might speed up things considerably by reading the files in larger chunks than 1 byte in the file_iterator(filename): I suggest adding this as an argument to the function — Maarten Fabré
– Maarten Fabré, Commented Jun 10, 2017 at 17:32
I wanted to isolate those submethods within group_equivalent_files since they're currently only used by that function. — John McCann
– John McCann, Commented Jun 10, 2017 at 21:32
I improved the method to compare complete file contents considerably and added a methodology to test the algoritms — Maarten Fabré
– Maarten Fabré, Commented Jun 11, 2017 at 21:00

Stack Exchange Network

Grouping Equivalent Files in Filesystem

2 Answers 2

First impressions

Bug

Suggested solution

Further consideration

methods

walk_files

get_n_bytes

group_by_hash

group_by_equality

files_are_equal

Tying it together

imports

calling everything

Testing

You must log in to answer this question.

Linked

Hot Network Questions

Grouping Equivalent Files in Filesystem

2 Answers 2

First impressions

Bug

Suggested solution

Further consideration

methods

walk_files

get_n_bytes

group_by_hash

group_by_equality

files_are_equal

Tying it together

imports

calling everything

Testing

You must log in to answer this question.

Linked

Related

Hot Network Questions