3
\$\begingroup\$

Any time that a row ID (oddly placed in column 8, i.e. row[7]) is repeated after the first instance, I want to write those rows into a second file. The code I'm using is extremely slow -- it's a 40-column CSV with about a million rows. This is what I have:

def in_out_gorbsplit(inf, outf1, outf2):
    outf1 = csv.writer(open(outf1, 'wb'), delimiter=',', lineterminator='\n')
    outf2 = csv.writer(open(outf2, 'wb'), delimiter=',', lineterminator='\n')
    inf1 = csv.reader(open(inf, 'rbU'), delimiter=',')
    inf1.next()
    checklist = []
    for row in inf1:
        id_num = str(row[7])
        if id_num not in checklist:
            outf1.writerow(row)
            checklist.append(id_num)
        else:
            outf2.writerow(row)
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

Since checklist is a list, a "not in" operation has to iterate over all elements to give the correct answer. In other words, it has a complexity of \$O(n)\$. Use a set() instead, to lower the complexity of the operation to \$O(1)\$, making it much faster.

Also don't forget to close open file handles.

\$\endgroup\$
1
  • 1
    \$\begingroup\$ This made the slight difference between unoptimized code taking 45 minutes and optimized code taking... 5.5 seconds. I knew something was off! \$\endgroup\$ Commented Nov 30, 2014 at 10:13

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.