Fastest way to compare two huge csv files in python(numpy)

Question

I am trying find the intesect sub set between two pretty big csv files of phone numbers(one has 600k rows, and the other has 300mil). I am currently using pandas to open both files and then converting the needed columns into 1d numpy arrays and then using numpy intersect to get the intersect. Is there a better way of doing this, either with python or any other method. Thanks for any help

import pandas as pd
import numpy as np

df_dnc = pd.read_csv('dncTest.csv', names = ['phone'])
df_test = pd.read_csv('phoneTest.csv', names = ['phone'])

dnc_phone = df_dnc['phone']
test_phone = df_test['phone']

np.intersect1d(dnc_phone, test_phone)

Are your CSV files the same in structure? (i.e. are you looking for same lines, or same CSV fields)? — zwer
– zwer, Commented Jun 8, 2017 at 23:37
How much time does each step currently take? Which step is the bottleneck? What is the current total running time, and what total running time are you targeting? By the way you can set squeeze=True to get a Series straight away and skip the dnc_phone = df_dnc['phone'] part. — John Zwinck
– John Zwinck, Commented Jun 8, 2017 at 23:38
@zwer, yes there are the same structure and comparing the same csv fields. which are phone numbers — TimCodes
– TimCodes, Commented Jun 8, 2017 at 23:39
If this is a one of thing, I would stick with pandas, if you need to do this at least a few times or need to do other comparisons, I would load this up into a database — e4c5
– e4c5, Commented Jun 8, 2017 at 23:42
pd.read_csv is reputed to be the fastest Python csv reader. numpy intersect probably works with a mix of merging, sorting, and checking for duplicates. If the numbers are already sorted and unique you may be able to streamline that. — hpaulj
– hpaulj, Commented Jun 9, 2017 at 0:28

ovanes · Accepted Answer · 2017-06-13 09:24:36Z

I will give you general solution with some Python pseudo code. What you are trying to solve here is the classical problem from the book "Programming Pearls" by Jon Bentley.

This is solved very efficiently with just a simple bit array, hence my comment, how long is (how many digits does have) the phone number.

Let's say the phone number is at most 10 digits long, than the max phone number you can have is: 9 999 999 999 (spaces are used for better readability). Here we can use 1bit per number to identify if the number is in set or not (bit is set or not set respectively), thus we are going to use 9 999 999 999 bits to identify each number, i.e.:

bits[0] identifies the number 0 000 000 000
bits[193] identifies the number 0 000 000 193
having a number 659 234-4567 would be addressed by the bits[6592344567]

Doing so we'd need to pre-allocate 9 999 999 999 bits initially set to 0, which is: 9 999 999 999 / 8 / 1024 / 1024 = around 1.2 GB of memory.

I think that holding the intersection of numbers at the end will use more space than the bits representation => at most 600k ints will be stored => 64bit * 600k = around 4.6 GB (actually int is not stored that efficiently and might use much more), if these are string you'll probably end with even more memory requirements.

Parsing a phone number string from CSV file (line by line or buffered file reader), converting it to a number and than doing a constant time memory lookup will be IMO faster than dealing with strings and merging them. Unfortunately, I don't have these phone number files to test, but would be interested to hear your findings.

from bitstring import BitArray


max_number = 9999999999

found_phone_numbers = BitArray(length=max_number+1)


# replace this function with the file open function and retrieving
#  the next found phone number
def number_from_file_iteator(dummy_data):
    for number in dummy_data:
        yield number


def calculate_intersect():
    # should be open a file1 and getting the generator with numbers from it
    #  we use dummy data here
    for number in number_from_file_iteator([1, 25, 77, 224322323, 8292, 1232422]):
        found_phone_numbers[number] = True

    # open second file and check if the number is there
    for number in number_from_file_iteator([4, 24, 224322323, 1232422, max_number]):
        if found_phone_numbers[number]:
            yield number


number_intersection = set(calculate_intersect())

print number_intersection

I used BitArray from bitstring pip package and it needed around 2 secs to initialize the entire bitstring. Afterwards, scanning the file will use constant memory. At the end I used a set to store the items.

Note 1: This algorithm can be modified to just use the list. In that case a second loop as soon as bit number matches this bit must be reset, so that duplicates do not match again.

Note 2: Storing in the set/list occurs lazy, because we use the generator in the second for loop. Runtime complexity is linear, i.e. O(N).

alexis · Accepted Answer · 2017-06-08 23:39:19Z

1

Read the 600k phone numbers into a set.
Input the larger file row by row, checking each row against the set.
Write matches to an output file immediately.

That way you don't have to load all the data in memory at once.

answered Jun 8, 2017 at 23:39

alexis

50.4k18 gold badges107 silver badges172 bronze badges

2 Comments

John Zwinck Over a year ago

This will have disastrous performance compared with using the compiled-code facilities provided in Pandas and NumPy.

alexis Over a year ago

Other things being equal, certainly. But it minimizes the memory footprint. If the OP is already getting decent performance there's no need to try it, I agree. (Unfortunately the question didn't explain what problem needs to be solved.)

Collectives™ on Stack Overflow

Fastest way to compare two huge csv files in python(numpy)

2 Answers 2

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Linked

Related