0

I am trying to process a text file of more than 1GB and saving the data in to Mysql database using python.

I had pasted some sample code below

import os
import MySQLdb as mdb

conn = mdb.connect(user='root', passwd='redhat', db='Xml_Data', host='localhost', charset="utf8")

file_path = "/home/local/user/Main/Module-1.0.4/file_processing/part-00000.txt"

file_open = open('part-00000','r')

for line in file_open:
    result_words = line.split('\t')
    query = "insert into PerformaceReport (campaignID, keywordID, keyword, avgPosition)"
    query += " VALUES (%s,%s,'%s',%s) " % (result_words[0],result_words[1],result_words[2],result_words[3])
    cursor = conn.cursor()
    cursor.execute( query )
    conn.commit()

Actually there are more than 18 columns the data is being inserted in to, i had just pasted only four(for example)

So when i run the above code the execution time is taking some hours

All my doubts are

  1. Is there any alternate way for processing the 1GB text file in python very fastly ?
  2. Is there any framework that process the 1GB text file and saves the data in to database very fastly ?
  3. How to process a text file of large size(1GB) within minutes(is it possible) and save data in to database? All my concern about is , we need to process the 1GB file as fast as possible but not in hours

Edited Code

query += " VALUES (%s,%s,'%s',%s) " % (int(result_words[0] if result_words[0] != '' else ''),int(result_words[2] if result_words[2] != '' else ''),result_words[3] if result_words[3] != '' else '',result_words[4] if result_words[4] != '' else '')

Actually i am submitting the values in the above format(by checking the result existence)

12
  • Have you tried this approach and measured how well it performs? If you don't know what's the bottleneck in your program (disk, parsing the file, or storage in the db) then blindly optimizing one of them isn't going to give much of a speedup. Commented Nov 19, 2012 at 10:10
  • yeah i had tried above approach and it is taking more than 7 hrs ... so approached SO for exact way to process.... Commented Nov 19, 2012 at 10:13
  • Why are you doing that stuff in your edit? That's what parsing the values as a parameter to the cursor is for - you don't have to worry about types -- don't use string formatting to build SQL queries Commented Nov 19, 2012 at 10:36
  • actually, this is not a valid code. int('') raises ValueError Commented Nov 19, 2012 at 10:38
  • actually sometimes,i am getting integer values from the list after splitting,hence i am converting that in to int data type(which i had created as a datatype for that field in database) Commented Nov 19, 2012 at 10:42

3 Answers 3

5

Bit of a wild guess, but I'd say the conn.commit() for every line in the file would make a big difference. Try moving it outside the loop. You also don't need to recreate the cursor in each iteration of the loop - just do it once before the loop.

Sign up to request clarification or add additional context in comments.

Comments

2

As well as what Tim has said, I would have a look at MySQL's LOAD DATA INFILE. Do any necessary pre-processing in Python and write that to a separate file which MySQL has access to, then execute the appropriate query and let MySQL do the loading.

Or, possibly re-write the Python code to what it should be anyway (you should be passing the parameters as values, not doing string manipulation - SQL injection attacks for one):

query = 'insert into something(a, b, c, d) values(%s, %s, %s, %s)'
with open('file.tab') as fin:
    values = (row.split('\t')[:4] for row in fin)
    cursor.executemany(query, values)

Comments

0
import os
import MySQLdb as mdb
import csv

def read_file():
    file_path = "/home/local/user/Main/Module-1.0.4/file_processing/part-00000.txt"
    with open('part-00000','r') as infile:
        file_open= csv.reader(infile, delimiter='\t')
        cache = []
        for line in file_open:
            cache.append(line)
            if len(cache) > 500:
                yield cache
                cache = []
        yield cache 

conn = mdb.connect(user='root', passwd='redhat', db='Xml_Data', host='localhost', charset="utf8")
cursor = conn.cursor()
query = "insert into PerformaceReport (campaignID, keywordID, keyword, avgPosition) VALUES (%s,%s,%s,%s)"
for rows in read_file():
    try:
        cursor.executemany(query, rows)
    except mdb.Error:
        conn.rollback()
    else:
        conn.commit()

The code is untested and might contain minor errors, but should be faster, not as fast as using LOAD DATA INFILE though.

9 Comments

Fixed, thank guys. I'm also not sure if executemany can take generator, will check back later.
here in the above code u r are yielding a line but we need data from that line as i told u, i have many columns,i.e., by splitting the line, taking the individual elements and submitting to destination field as values. How thats works with just yield ?
@Kouripm instead of yield line you would change it to yield line[0], line[23] whatever...
@jon:oh fine, then i had pasted my original code which doesn't suits i think please have a look at above edited code
Also when i tried above example it is showing the error except mdb.Errors: AttributeError: 'module' object has no attribute 'Errors'
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.