Here is my source code which group duplicate files together, written in Python 2.7. Any advice on smarter ideas to group duplicate files together more efficient, it will be great. Any general advice on code bugs and code style are also appreciated.
Problem statement,
I am given a list of files e.g. ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"], I want to group all files which have exact content together. Suppose in this example, file "1.txt", "2.txt", "3.txt" are the same, file "4.txt", "5.txt", "6.txt" have common header, but "4.txt", "6.txt" are exactly the same whole content, then output should be two groups "1.txt", "2.txt", "3.txt" and "4.txt", "6.txt".
My major idea,
- To avoid read full content for each file, I generate a hash code for file header (in my example, I define file header to be first to bytes of a file)
- If more than 2 (>=2) files have the same header, I will read full content and see if they are the exactly same whole content,
Source code in Python 2.7,
from collections import defaultdict
import itertools
def group_duplicate_files(files):
hash_buf = defaultdict(list)
for f in files:
hash_buf[hash_header(f)].append(f)
result = []
for file_list in hash_buf.values():
if len(file_list) < 2:
continue
for file_pair in itertools.combinations(file_list,2):
if compare_full_content(file_pair[0], file_pair[1]):
n = set()
n.add(file_pair[0])
n.add(file_pair[1])
found = False
# try to merge with existing set
for i,r in enumerate(result):
if len(r.intersection(n)) > 0:
found = True
break
if found:
n = n.union(result[i])
result.remove(result[i])
result.append(n)
else:
result.append(n)
return result
def hash_header(filename):
f = open(filename, "r")
header = f.read(10)
return hash(header)
def compare_full_content(filename1, filename2):
file1 = open(filename1, 'r')
file2 = open(filename2, 'r')
content1 = file1.read()
content2 = file2.read()
i = 0
j = 0
while i < len(content1) and j < len(content2):
if content1[i] != content2[j]:
return False
i += 1
j += 1
if i < len(content1):
return False
if j < len(content2):
return False
return True
if __name__ == "__main__":
files = ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"]
print group_duplicate_files(files)