group Group duplicate files

Problem statement:,

I am given a list of files e.g. ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"],. I want to group all files which have the exact same content together. Suppose in this example, file "1.txt", "2.txt", "3.txt" are the same, file "4.txt", "5.txt", "6.txt" have common header, but "4.txt", "6.txt" are exactly the same whole content. Then, thenthe output should be two groups "1.txt", "2.txt", "3.txt" and "4.txt", "6.txt".

My major idea:,

To avoid readreading full content for each file, I generate a hash code for a file header (in my example, I define the file header to be the first totwo bytes of a file)
If more than 2 (>=2) files have the same header, I will read the full content and see if they are the exactlyexact same whole content,

Source code in Python 2.7,

group duplicate files

Problem statement,

I am given a list of files e.g. ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"], I want to group all files which have exact content together. Suppose in this example, file "1.txt", "2.txt", "3.txt" are the same, file "4.txt", "5.txt", "6.txt" have common header, but "4.txt", "6.txt" are exactly the same whole content, then output should be two groups "1.txt", "2.txt", "3.txt" and "4.txt", "6.txt".

My major idea,

To avoid read full content for each file, I generate a hash code for file header (in my example, I define file header to be first to bytes of a file)
If more than 2 (>=2) files have the same header, I will read full content and see if they are the exactly same whole content,

Source code in Python 2.7,

Group duplicate files

Problem statement:

I am given a list of files e.g. ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"]. I want to group all files which have the exact same content. Suppose in this example, file "1.txt", "2.txt", "3.txt" are the same, file "4.txt", "5.txt", "6.txt" have common header, but "4.txt", "6.txt" are exactly the same whole content. Then, the output should be two groups "1.txt", "2.txt", "3.txt" and "4.txt", "6.txt".

My major idea:

To avoid reading full content for each file, I generate a hash code for a file header (in my example, I define the file header to be the first two bytes of a file)
If more than 2 (>=2) files have the same header, I will read the full content and see if they are the exact same whole content

Source Link

asked Mar 20, 2017 at 0:22

Lin Ma

3.5k
3
34
58

group duplicate files

Here is my source code which group duplicate files together, written in Python 2.7. Any advice on smarter ideas to group duplicate files together more efficient, it will be great. Any general advice on code bugs and code style are also appreciated.

Problem statement,

I am given a list of files e.g. ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"], I want to group all files which have exact content together. Suppose in this example, file "1.txt", "2.txt", "3.txt" are the same, file "4.txt", "5.txt", "6.txt" have common header, but "4.txt", "6.txt" are exactly the same whole content, then output should be two groups "1.txt", "2.txt", "3.txt" and "4.txt", "6.txt".

My major idea,

To avoid read full content for each file, I generate a hash code for file header (in my example, I define file header to be first to bytes of a file)
If more than 2 (>=2) files have the same header, I will read full content and see if they are the exactly same whole content,

Source code in Python 2.7,

from collections import defaultdict
import itertools
def group_duplicate_files(files):
    hash_buf = defaultdict(list)
    for f in files:
        hash_buf[hash_header(f)].append(f)
    result = []
    for file_list in hash_buf.values():
        if len(file_list) < 2:
            continue
        for file_pair in itertools.combinations(file_list,2):
            if compare_full_content(file_pair[0], file_pair[1]):
                n = set()
                n.add(file_pair[0])
                n.add(file_pair[1])
                found = False
                # try to merge with existing set
                for i,r in enumerate(result):
                    if len(r.intersection(n)) > 0:
                        found = True
                        break
                if found:
                    n = n.union(result[i])
                    result.remove(result[i])
                    result.append(n)
                else:
                    result.append(n)
    return result

def hash_header(filename):
    f = open(filename, "r")
    header = f.read(10)
    return hash(header)
def compare_full_content(filename1, filename2):
    file1 = open(filename1, 'r')
    file2 = open(filename2, 'r')
    content1 = file1.read()
    content2 = file2.read()
    i = 0
    j = 0
    while i < len(content1) and j < len(content2):
        if content1[i] != content2[j]:
            return False
        i += 1
        j += 1
    if i < len(content1):
        return False
    if j < len(content2):
        return False
    return True

if __name__ == "__main__":
    files = ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"]
    print group_duplicate_files(files)

python algorithm python-2.x

Stack Exchange Network

Return to Question

group Group duplicate files

group duplicate files

Group duplicate files

group duplicate files