Comparing Two CSV in Python

Question

I am formatting CSV in Python to get the desired result but my code doesn't seem to work properly.

I have the first CSV file in format:

2,a
1,a
4,a
5,a
3,a
1,a
3,b
2,b
1,a

The second CSV file in format:

1,a,123
1,a,234
2,a,456
2,b,345
3,a,789
3,b,232
4,a,987

As the first CSV file is not sorted, the second CSV file is sorted in increasing order with respect to first column

I want the output in format:

2,a,456
1,a,123
4,a,987
5,a
3,a,789
1,a,234
3,b,232
2,b,345
1,a

The result is printed in respect to the first CSV, if the first CSV file combination is not found in the second CSV File, for example if 5,a is not in the second CSV file then just 5,a is printed in its respected position. The first CSV file contains many duplicates, while in the second CSV file each row is unique.

Here is my Code

for (num,alpha) in first_csv:
    value_found = True
    for (num1,alpha1,num2) in second_csv:
        if (num == num1 and alpha == alpha1):
            csv_out +=  str(num) + ',' + str(alpha) + ',' + str(number)
            value_found = False
    if value_found:
        count+=1
        if count == 1:
            csv_out += str(num) + ',' + str(alpha)

first_csv and second_csv are tuples I have created after reading the CSV files with the code:

with open('first_csv.csv') as f:
    f.readline()
    first_csv = tuple(csv.reader(f, delimiter=','))


with open('second_csv.csv') as f:
    f.readline()
    second_csv = tuple(csv.reader(f, delimiter=','))

But it is not printing the desired output, where am I doing wrong?

Have you considered, instead, using SQLite to store the data into two tables, then perform a join operation on those two tables to get the results you want? Trying to hamstring it together with Python alone will get painful fast. — Makoto
– Makoto, Commented Jan 21, 2014 at 7:42
@makoto Do you have query which can do this, i am not good at database. — user3218088
– user3218088, Commented Jan 21, 2014 at 8:03
The example indicates that each match in second file is used only once -- the third 1,a matches nothing. Could you discuss this requirement in more detail? — Janne Karila
– Janne Karila, Commented Jan 21, 2014 at 8:08
@janneKarila as the third 1,a has no match in second csv file so it is just placed at its relative position of first csv file. the third entry of 1,a matches nothing in second csv file so it is empty and kept at correct position of first csv file. — user3218088
– user3218088, Commented Jan 21, 2014 at 8:15
@Makoto i made the query but in the end it is automatically sorting the result on the first column, i don't why. I am using POSTGRES. I want output in the same ordering as first csv file. — user3218088
– user3218088, Commented Jan 21, 2014 at 8:17

Janne Karila · Accepted Answer · 2014-01-21 09:23:26Z

This collects the num2 values from second file into a dictionary of deques. When matches are found, they are removed using deque.popleft to use each match only once, in the same order as they occurred in the second file.

from collections import defaultdict, deque

with open('second_csv.csv') as f:
    next(f) #skip header
    dic = defaultdict(deque)
    for num1,alpha1,num2 in csv.reader(f, delimiter=','):
        dic[num1, alpha1].append(num2)

with open('first_csv.csv') as f, open('out.csv', 'wb') as fout:
    next(f) #skip header
    csv_out = csv.writer(fout)
    for num,alpha in csv.reader(f, delimiter=','):
        try:
            num2 = dic[num,alpha].popleft()
            csv_out.writerow([num,alpha,num2])
        except IndexError:
            csv_out.writerow([num,alpha])

a deque retains the order of items that you add by append and remove by popleft.
a dict does fast lookup by key which can be a tuple
a defaultdict(deque) is a dict that creates an empty deque automatically when you access a missing key, so you can append to it directly.

thanks, i know this would have taken some of your time, i am new to stackoverflow so i don't have points to upvote you for your answer, i this is the first time i am seeing the things you have used, could you please shed some light and let me know what i was doing wrong. Again many thanks.
@user3218088 The main problem in your approach is that you don't keep track of matches. Using a tuple to hold second_csv makes lookup unnecessarily difficult and does not allow you to remove matches.

Steinar Lima · Accepted Answer · 2014-01-21 08:28:27Z

This should do the trick. Note that, for each iteration over first_csv, the worst-case scenario is that it would have to iterate to the end of what's left of the data from the second csv file (the rows gets pop-ed if matched).

import csv

with open("second_csv.csv") as in_file:
    reader = csv.reader(in_file)
    lookup = list(reader)

with open("first_csv.csv") as in_file, open('output.csv', 'wb') as out_file:
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    for row in reader:
        for i, data in enumerate(lookup):
            if row == data[:2]:
                row = lookup.pop(i)
                break
        writer.writerow(row)

output.csv

2,a,456
1,a,123
4,a,987
5,a
3,a,789
1,a,234
3,b,232
2,b,345
1,a

thanks Steinar: this works too. Doesn't have reputation to upvote your answer, sorry mate.

Collectives™ on Stack Overflow

Comparing Two CSV in Python

2 Answers 2

2 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Related