How to process a 1GB text file with python

Question

I am trying to process a text file of more than 1GB and saving the data in to Mysql database using python.

I had pasted some sample code below

import os
import MySQLdb as mdb

conn = mdb.connect(user='root', passwd='redhat', db='Xml_Data', host='localhost', charset="utf8")

file_path = "/home/local/user/Main/Module-1.0.4/file_processing/part-00000.txt"

file_open = open('part-00000','r')

for line in file_open:
    result_words = line.split('\t')
    query = "insert into PerformaceReport (campaignID, keywordID, keyword, avgPosition)"
    query += " VALUES (%s,%s,'%s',%s) " % (result_words[0],result_words[1],result_words[2],result_words[3])
    cursor = conn.cursor()
    cursor.execute( query )
    conn.commit()

Actually there are more than 18 columns the data is being inserted in to, i had just pasted only four(for example)

So when i run the above code the execution time is taking some hours

All my doubts are

Is there any alternate way for processing the 1GB text file in python very fastly ?
Is there any framework that process the 1GB text file and saves the data in to database very fastly ?
How to process a text file of large size(1GB) within minutes(is it possible) and save data in to database? All my concern about is , we need to process the 1GB file as fast as possible but not in hours

Edited Code

query += " VALUES (%s,%s,'%s',%s) " % (int(result_words[0] if result_words[0] != '' else ''),int(result_words[2] if result_words[2] != '' else ''),result_words[3] if result_words[3] != '' else '',result_words[4] if result_words[4] != '' else '')

Actually i am submitting the values in the above format(by checking the result existence)

Have you tried this approach and measured how well it performs? If you don't know what's the bottleneck in your program (disk, parsing the file, or storage in the db) then blindly optimizing one of them isn't going to give much of a speedup. — Fred Foo
– Fred Foo, Commented Nov 19, 2012 at 10:10
yeah i had tried above approach and it is taking more than 7 hrs ... so approached SO for exact way to process.... — Shiva Krishna Bavandla
– Shiva Krishna Bavandla, Commented Nov 19, 2012 at 10:13
Why are you doing that stuff in your edit? That's what parsing the values as a parameter to the cursor is for - you don't have to worry about types -- don't use string formatting to build SQL queries — Jon Clements
– Jon Clements, Commented Nov 19, 2012 at 10:36
actually, this is not a valid code. int('') raises ValueError — SilentGhost
– SilentGhost, Commented Nov 19, 2012 at 10:38
actually sometimes,i am getting integer values from the list after splitting,hence i am converting that in to int data type(which i had created as a datatype for that field in database) — Shiva Krishna Bavandla
– Shiva Krishna Bavandla, Commented Nov 19, 2012 at 10:42

Tim · Accepted Answer · 2012-11-19 10:08:56Z

5

Bit of a wild guess, but I'd say the conn.commit() for every line in the file would make a big difference. Try moving it outside the loop. You also don't need to recreate the cursor in each iteration of the loop - just do it once before the loop.

answered Nov 19, 2012 at 10:08

Tim

12.2k4 gold badges45 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jon Clements · Accepted Answer · 2012-11-19 10:24:58Z

As well as what Tim has said, I would have a look at MySQL's LOAD DATA INFILE. Do any necessary pre-processing in Python and write that to a separate file which MySQL has access to, then execute the appropriate query and let MySQL do the loading.

Or, possibly re-write the Python code to what it should be anyway (you should be passing the parameters as values, not doing string manipulation - SQL injection attacks for one):

query = 'insert into something(a, b, c, d) values(%s, %s, %s, %s)'
with open('file.tab') as fin:
    values = (row.split('\t')[:4] for row in fin)
    cursor.executemany(query, values)

Kien Truong · Accepted Answer · 2012-11-19 11:17:33Z

0

import os
import MySQLdb as mdb
import csv

def read_file():
    file_path = "/home/local/user/Main/Module-1.0.4/file_processing/part-00000.txt"
    with open('part-00000','r') as infile:
        file_open= csv.reader(infile, delimiter='\t')
        cache = []
        for line in file_open:
            cache.append(line)
            if len(cache) > 500:
                yield cache
                cache = []
        yield cache 

conn = mdb.connect(user='root', passwd='redhat', db='Xml_Data', host='localhost', charset="utf8")
cursor = conn.cursor()
query = "insert into PerformaceReport (campaignID, keywordID, keyword, avgPosition) VALUES (%s,%s,%s,%s)"
for rows in read_file():
    try:
        cursor.executemany(query, rows)
    except mdb.Error:
        conn.rollback()
    else:
        conn.commit()

The code is untested and might contain minor errors, but should be faster, not as fast as using LOAD DATA INFILE though.

edited Nov 19, 2012 at 11:17

answered Nov 19, 2012 at 10:22

Kien Truong

11.4k2 gold badges33 silver badges36 bronze badges

9 Comments

Kien Truong Over a year ago

Fixed, thank guys. I'm also not sure if executemany can take generator, will check back later.

Shiva Krishna Bavandla Over a year ago

here in the above code u r are yielding a line but we need data from that line as i told u, i have many columns,i.e., by splitting the line, taking the individual elements and submitting to destination field as values. How thats works with just yield ?

Jon Clements Over a year ago

@Kouripm instead of yield line you would change it to yield line[0], line[23] whatever...

Shiva Krishna Bavandla Over a year ago

@jon:oh fine, then i had pasted my original code which doesn't suits i think please have a look at above edited code

Shiva Krishna Bavandla Over a year ago

Also when i tried above example it is showing the error except mdb.Errors: AttributeError: 'module' object has no attribute 'Errors'

|

Collectives™ on Stack Overflow

How to process a 1GB text file with python

3 Answers 3

Comments

Comments

9 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

9 Comments

Related