Improving python performance in comparing two lists

Question

I have two lists of email addresses: new_emails.tsv and old_emails.tsv

There are about 10 million rows in old_emails.tsv and about 1.5 million rows in new_emails.tsv. I want to create a new .tsv file of email addresses that are in the old_emails.tsv but not in the new_emails.tsv. The reason for this is because in a later step I need to remove that set of emails from my MySQL database.

The two files have different headers, i.e.:

new_emails.tsv has ['ACCTNUM', 'CUST_ID', 'EMAIL', 'CODE']
old_emails.tsv has ['ACCTNUM', 'EMAIL', 'OPTION']

so to solve this I pull the email field from both files into their own lists, and compare the lists, convert to sets, and find the difference (overloaded '-' operator). With the list of emails now in an exclusion_emails list, I need to use this list to pull the rows from the old_emails.tsv and put those rows in a new file called exclusion_emails.tsv. However, turning my exclusion_emails list into a list of the rows taken from old_emails.tsv is an extremely tedious process. Is there a way to improve this performance? My full code is here:

import csv

def csv_to_list(file):
    output_list = []
    with open(file, 'rb') as f_new_emails:
        reader = csv.reader(f_new_emails, delimiter='\t')
        for line in reader:
            output_list.append(line)
    return output_list

new_emails_list = csv_to_list('new_emails.tsv')
old_emails_list = csv_to_list('old_emails.tsv')

# Get the index for the email field
def get_email_index(alist):
    if 'EMAIL' in alist:
        return alist.index('EMAIL')
    elif 'email' in alist:
        return alist.index('email')

s_new_emails = set([row[get_email_index(new_emails_list[0])] for row in new_emails_list])
s_old_emails = set([row[get_email_index(old_emails_list[0])] for row in old_emails_list])

exclusion_emails = [email for email in (s_old_emails - s_new_emails)]

# print("%s emails in the new list" % len(new_emails_list))
# print("%s emails in the old list" % len(old_emails_list))
# print("%s emails in the old list but not in the new list" % len(exclusion_emails))


# Creating the new file
exclusion_rows = []
operations = 0
with open('exclusions.tsv', 'wb') as tsvfile:
    writer = csv.writer(tsvfile, delimiter='\t')

    for email in exclusion_emails:
        for row in old_emails_list:
            operations += 1
            if email in row:
                writer.writerow(row)
                break

print(len(exclusion_rows))

Any help would be appreciated!

given unsorted lists, this is at best O(n*log n). Try putting them into at python set then it becomes a O(n) operation. — Frank Bryce
– Frank Bryce, Commented Apr 12, 2016 at 20:43
set(old_emails_list) failes with a TypeError: unhashable type: 'list' — wubbalubba
– wubbalubba, Commented Apr 12, 2016 at 20:51
Try using a bloom filter. This can reduce the lookup complexity considerably. pypi.python.org/pypi/pybloom/1.0.2 — th3an0maly
– th3an0maly, Commented Apr 12, 2016 at 20:53
Without looking closely, I'm guessing this error is because old_emails_list is actually a list of lists. You might try set([tuple(x) for x in old_emails_list]) — Frank Bryce
– Frank Bryce, Commented Apr 12, 2016 at 20:58
This appears to be slower. Using wc -l and timing for 60 seconds, my method writes ~686 rows. Setting old_emails_list = set([tuple(x) for x in old_emails_list]) does about ~300 rows per minute. — wubbalubba
– wubbalubba, Commented Apr 12, 2016 at 21:08

Matt Jordan · Accepted Answer · 2016-04-12 21:18:13Z

This is taking a long time because you are comparing every excluded email with every old email record (7.5 million times 10 million is 75 TRILLION loop iterations...).

There are three main improvements.

First, the indexes cannot change for each record, so do not recompute them for every iteration, and since the first entry in each list is the header, exclude that from your searches (you could end up with multiple headers randomly within the output file if e.g. the old and new email header field is capitalized differently):

new_email_index = get_email_index(new_emails_list[0])
old_email_index = get_email_index(old_emails_list[0])
s_new_emails = set([row[new_email_index] for row in new_emails_list[1:]])
s_old_emails = set([row[old_email_index] for row in old_emails_list[1:]])

and the old_email_index will be useful when building the final record list.

Leave the exclusions as a set,

exclusion_emails = s_old_emails - s_new_emails

And when you build the exclusions.tsv, iterate over the old claim list so you can grab the email address (using the index you already saved above rather than recompute every time),

with open('exclusions.tsv', 'wb') as tsvfile:
    writer = csv.writer(tsvfile, delimiter='\t')

    for row in old_emails_list[1:]:
        print str(row[old_email_index])
        if row[old_email_index] in exclusion_emails:
            writer.writerow(row)

And finally, exclusion_rows is an empty list at the end, since your code doesn't actually add anything to exclusion_rows, so don't let the length of 0 it prints confuse you.

Clarifying the final point (iterating over the old claim list): the primary speedup you're going for is that instead of iterating over a set (O(n)) and searching an array (O(n)), you iterate over the array (O(n)) and lookup in the set (O(1)).
@Asif: if you have patience, it would like to see a table with the times taken for all the solutions (inicial, Matt Jordan, and mine)

JJoao · Accepted Answer · 2016-04-12 21:28:55Z

0

This awk version is just to show the O(n) idea; if it works fast enough, rewrite it in python.

awk -F"\t" 'NR==FNR { a[$3]=1; next; }; 
            $2 && !a[$2] {print} ' new_emails.tsv old_emails.tsv > new.tsv

Explanation:

line 1 save emails of new_emails em array a
line 2 if an (non empty) email of old_emails is not in a, write its record

(if possible compare the times taken, and show us...)

edited Apr 12, 2016 at 21:28

answered Apr 12, 2016 at 21:06

JJoao

5,4971 gold badge22 silver badges21 bronze badges

3 Comments

JJoao Over a year ago

@Asif: I added an explanation; if possible, test if this work, and haw fast it runs

wubbalubba Over a year ago

It ran extremely fast, but it seems that it is including rows with empty 'EMAIL' fields

JJoao Over a year ago

@Asif: added a "non empty" constrain. Is this what you were saying?

Collectives™ on Stack Overflow

Improving python performance in comparing two lists

2 Answers 2

3 Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Related