I have two lists of email addresses: new_emails.tsv and old_emails.tsv
There are about 10 million rows in old_emails.tsv and about 1.5 million rows in new_emails.tsv. I want to create a new .tsv file of email addresses that are in the old_emails.tsv but not in the new_emails.tsv. The reason for this is because in a later step I need to remove that set of emails from my MySQL database.
The two files have different headers, i.e.:
new_emails.tsv has ['ACCTNUM', 'CUST_ID', 'EMAIL', 'CODE']
old_emails.tsv has ['ACCTNUM', 'EMAIL', 'OPTION']
so to solve this I pull the email field from both files into their own lists, and compare the lists, convert to sets, and find the difference (overloaded '-' operator). With the list of emails now in an exclusion_emails list, I need to use this list to pull the rows from the old_emails.tsv and put those rows in a new file called exclusion_emails.tsv. However, turning my exclusion_emails list into a list of the rows taken from old_emails.tsv is an extremely tedious process. Is there a way to improve this performance? My full code is here:
import csv
def csv_to_list(file):
    output_list = []
    with open(file, 'rb') as f_new_emails:
        reader = csv.reader(f_new_emails, delimiter='\t')
        for line in reader:
            output_list.append(line)
    return output_list
new_emails_list = csv_to_list('new_emails.tsv')
old_emails_list = csv_to_list('old_emails.tsv')
# Get the index for the email field
def get_email_index(alist):
    if 'EMAIL' in alist:
        return alist.index('EMAIL')
    elif 'email' in alist:
        return alist.index('email')
s_new_emails = set([row[get_email_index(new_emails_list[0])] for row in new_emails_list])
s_old_emails = set([row[get_email_index(old_emails_list[0])] for row in old_emails_list])
exclusion_emails = [email for email in (s_old_emails - s_new_emails)]
# print("%s emails in the new list" % len(new_emails_list))
# print("%s emails in the old list" % len(old_emails_list))
# print("%s emails in the old list but not in the new list" % len(exclusion_emails))
# Creating the new file
exclusion_rows = []
operations = 0
with open('exclusions.tsv', 'wb') as tsvfile:
    writer = csv.writer(tsvfile, delimiter='\t')
    for email in exclusion_emails:
        for row in old_emails_list:
            operations += 1
            if email in row:
                writer.writerow(row)
                break
print(len(exclusion_rows))
Any help would be appreciated!

O(n*log n). Try putting them into at pythonsetthen it becomes aO(n)operation.old_emails_listis actually a list of lists. You might tryset([tuple(x) for x in old_emails_list])