python - performance issue on db inserts

Ask Question

Asked 2 years, 10 months ago

Modified 2 years, 10 months ago

Viewed 253 times

I have a script that reads a file as

with open(filepath + filename, 'r') as inFile: 
   for line in inFile:
      ...

the above can read (and process) a 7gb file with its different columns in ~3mn. My problem is in the inserts to the database, they are too slow. I tried mimicking the way a dataflow in SSIS performs db inserts but I didn't succeed (idk what they use too). I tried with pyodbc (classic execute and executemany but they're too slow even with fast_executemany), I also used bcpandas (the equivalent of bulk inserts with ~5K rows/s but I have also a lack of performance because I have to convert my data to a dataframe which takes half the time of the overall process or so. I also tried printing the data and running the script from ssis to use the inserts but the python prints is not meant for that and is too slow. I tried creating a new file to read in ssis but it takes a much time as an executemany.

Is there any other way of doing this that I missed? what is the way SSIS uses to perform dataflow inserts? my db: mssql

import pyodbc # tested executemany()
import time
import pandas as pd
import bcpandas
import sqlalchemy

# start time
startTime = time.time()

## file parameters
filepath = "C:/"
filename = "file.txt"

alchemy_eng = sqlalchemy.create_engine('mssql+pyodbc:///?odbc_connect=DRIVER={ODBC Driver 17 for SQL Server};Server=localhost;Database=test;Trusted_Connection=yes;')
bcpandas_eng = bcpandas.SqlCreds.from_engine(alchemy_eng)

### process file
counter = 0
insert_params = []

with open(filepath + filename, 'r') as infile:
  for line in inFile:
      # temp table partial structure
      col1,col2,col3,.. = line.split('|')
      if col2 == "valueA":
        col2 = None
      # couple more ..
      col3 = col3.replace(',', '.')
      # couple more ..
      # commit every million not all to avoid memory crash
      if counter == 1000000:
        df = pd.DataFrame(insert_params)
        df.columns = ['col1',...]
        bcpandas.to_sql(df, 'table1', bcpandas_eng, if_exists="append")
        # reset
        insert_params = []
        counter = 0
        break
      if counter == 1:
        cursor.execute(insert_query, insert_params)
        break
      # list of elements
      insert_params.append((col1,col2,col3,..),)
      counter += 1

# end time
endTime = time.time()
# elapsed time
print("elapsed time", endTime - startTime)

edited Dec 14, 2022 at 10:05

asked Dec 14, 2022 at 9:04

Sleepy

2105 silver badges17 bronze badges

Please explain better what you are trying to do and give a code you tried so far for insertion

gtomer
– gtomer

2022-12-14 09:14:16 +00:00
Commented Dec 14, 2022 at 9:14
kind of a lame solution but what if you process your file then Save it to disk and then use bulk insert learn.microsoft.com/en-us/sql/t-sql/statements/…

Atanas Atanasov
– Atanas Atanasov

2022-12-14 09:41:35 +00:00
Commented Dec 14, 2022 at 9:41
@AtanasAtanasov I tried saving the file but I waste too much time on just saving

Sleepy
– Sleepy

2022-12-14 09:44:55 +00:00
Commented Dec 14, 2022 at 9:44
@gtomer which solution would you like to see? i tried 3

Sleepy
– Sleepy

2022-12-14 09:46:53 +00:00
Commented Dec 14, 2022 at 9:46
What do you mean by 'inserts to the database'? Adding rows?

gtomer
– gtomer

2022-12-14 09:50:09 +00:00
Commented Dec 14, 2022 at 9:50

| Show 7 more comments

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

python - performance issue on db inserts

0

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Linked