3

I need to compare two large csv files. But the thing is I have to iterate each line of file1 with all other lines of file2 and do some computation for different columns.

Part of my code that I tried in python:

import csv

def getOverlap(a,b):
    return max(0, min(a[1], b[1]) - max(a[0], b[0]))


masterlist = [row for row in c2]

for hosts_row in c1:
    chr1 = hosts_row[3]
    a1 = [int(hosts_row[4]),int(hosts_row[5])]
    found = False
    for master_row in masterlist:
        if hosts_row[7] == master_row[7]:
            c3.writerow(hosts_row)

            chr2 = master_row[3]

            b1 = [int(master_row[4]),int(master_row[5])]
            if getOverlap(a1,b1) != 0 and chr1 == chr2:
                c5.writerow(hosts_row)
            else:
                c6.writerow(hosts_row)


            found = True
            break
    if not found:
        c4.writerow(hosts_row)
        found2 = False
        for master_row2 in masterlist:
            chr2 = master_row[3]
            b1 = [int(master_row[4]),int(master_row[5])]
            if getOverlap(a1,b1) != 0 and chr1 == chr2:
                c7.writerow(hosts_row)
                found2 = True
                break
        if not found2:
            c8.writerow(hosts_row)

But it takes about 5 to 6 hours of running. Is there any quicker way for it. I have 16gb ram.

2
  • It may or may not help to know the sizes of the files. Commented Mar 21, 2014 at 12:12
  • why not load them in DB and run SQL query ? Commented Mar 21, 2014 at 12:49

2 Answers 2

2

The point is not how big are your files, it's a question of your goal and algorithm design.

  • one point is to define what are differences.
  • If the rows are ordered in the same way on both files, then two different rows would be having different columns.

So, maybe you should first consider sorting the csv files so the row order is identical, and then you can simply use the module filecpmp.

I realize this answer is not really adding any code, but it offers some thinking materail. It just to long for a single comment.

Sign up to request clarification or add additional context in comments.

Comments

1

Use Meld application in linux(ubuntu) to compare to files line by line

1 Comment

This will overflow the memory of the computer if the files are VERY large. If each is about 6 GB and the Computer has only 8GB RAM (like an average laptop today)... you can see what is going to happen. So this answer is quite problematic.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.