Comparing 2 Huge csv Files in Python

Question

I have 2 csv files.

File1:

EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vinoth,12,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"

File2:

EmployeeName,Age,Salary,Address
Karthick,10,10.245,"140,North Street,India"
Vivek,20,2000,"USA"
Vinoth,12,2548.245,"140,North Street,India"

I want to compare these 2 files and report the differences into another csv file. I've used the below python code ( version 2.7)

#!/usr/bin/env python
import difflib
import csv

with open('./Input/file1', 'r' ) as t1:
    fileone = t1.readlines()
with open('./Input/file2', 'r' ) as t2:
    filetwo = t2.readlines()

with open('update.csv', 'w') as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)

    for line in fileone:
        if line not in filetwo:
            outFile.write(line)

When I execute, below is the output I got:

Actual Output

Vivek,20,2000,"USA"

But my expected output is below since the Records for "Vinoth" in file1 is present 2 times, but only present 1 time in file2.

Expected Output

Vinoth,12,2548.245,"140,North Street,India"
Vivek,20,2000,"USA"

Questions

Please let me know how to get the expected output.
Also , how to get the Filename and line number of the difference record to the output file?

A couple of questions: 1) Is it huge files as in large than available memory? 2) How many GB of data in each file? — FredrikHedman
– FredrikHedman, Commented Jan 17, 2020 at 10:40
I don't understand your criteria. If Karthick is not present in your new file, why Vinoth should be? Could you explain a little more please? — Javier Lopez Tomas
– Javier Lopez Tomas, Commented Jan 17, 2020 at 10:53
@JavierLópezTomás Karthick is found once in the two files, while there's only one Vinoth line in file2 and two in file1. He also wants to consider the number of time a line appears. — Plopp
– Plopp, Commented Jan 17, 2020 at 10:57
@FredrikHedman Yes, the files are huge. Approximately, it's 3.5 GB — Vinoth Karthick
– Vinoth Karthick, Commented Jan 18, 2020 at 7:37

James · Accepted Answer · 2020-01-18 13:37:19Z

The issue you are running into is that the in keyword only checks for the presence of an item, not if the item exists twice. If you are open to using an external package, you can do this pretty quickly with pandas.

import pandas as pd

df1 = pd.read_csv('Input/file1.csv')
df2 = pd.read_csv('Input/file2.csv')

# create a new column with the count of how many times the row exists
df1['count'] = 0
df2['count'] = 0
df1['count'] = df1.groupby(df1.columns.to_list()[:-1]).cumcount() + 1
df2['count'] = df2.groupby(df2.columns.to_list()[:-1]).cumcount() + 1

# merge the two data frames with and outer join, add an indicator variable
# to show where each row (including the count) exists.
df_all = df1.merge(df2, on=df1.columns.to_list(), how='outer', indicator='exists')
print(df_all)
# prints:
  EmployeeName  Age    Salary                 Address  count      exists
0       Vinoth   12  2548.245  140,North Street,India      1        both
1       Vinoth   12  2548.245  140,North Street,India      2   left_only
2     Karthick   10    10.245  140,North Street,India      1        both
3        Vivek   20  2000.000                     USA      1  right_only

# clean up exists column and export the rows do not exist in both frames
df_all['exists'] = (df_all.exists.str.replace('left_only', 'file1')
                                 .str.replace('right_only', 'file2'))
df_all.query('exists != "both"').to_csv('update.csv', index=False)

Edit: non-pandas version

You can check for difference in identical line counts using the row as a key and the count as the value.

from collection import defaultdict

c1 = defaultdict(int)
c2 = defaultdict(int)

with open('./Input/file1', 'r' ) as t1:
    for line in t1:
        c1[line.strip()] += 1

with open('./Input/file2', 'r' ) as t2:
    for line in t2:
        c2[line.strip()] += 1

# create a set of all rows
all_keys = set()
all_keys.update(c1)
all_keys.update(c2)

# find the difference in the number of instances of the row
out = []
for k in all_keys:
    diff = c1[k] - c2[k]
    if diff == 0:
        continue
    if diff > 0:
        out.extend([k + ',file1'] * diff) # add which file it came from
    if diff < 0:
        out.extend([k + ',file2'] * abs(diff)) # add which file it came from

with open('update.csv', 'w') as outFile:
    outFile.write('\n'.join(out))

We do not have pandas Module, Is there any way to do without using external package
Sure, see the updated answer. The collections module is part of the standard library.

I value -u 2 · Accepted Answer · 2021-05-04 20:02:33Z

1

use panda compare

import pandas as pd

f1 = pd.read_csv(file_1.csv)
f2 = pd.read_csv(file_2.csv)

changed = f1.compare(f2)
change = f1[f1.index.isin(changed.index)]   
print(change)

answered May 4, 2021 at 20:02

I value -u 2

1071 silver badge5 bronze badges

Collectives™ on Stack Overflow

Comparing 2 Huge csv Files in Python

2 Answers 2

Edit: non-pandas version

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Edit: non-pandas version

2 Comments

Comments

Related