fastest way of inserting data into a table

Question

I have a Postgres database, and I have inserted some data into the table. Because of issues with the internet connection, some of the data couldn't be written.The file that I am trying to write into the database is large (about 330712484 rows - even the ws -l command takes a while to complete.

Now, the column row_id is the (integer) primary key, and is already indexed. Since some of the rows could not be inserted into the table, I wanted to insert these specific rows into the table. (I estimate only about 1.8% of the data isn't inserted into the table ...) As a beginning, I tried to see of the primary keys were inside the database like so:

conn      = psycopg2.connect(connector)
cur       = conn.cursor()

with open(fileName) as f:

    header = f.readline().strip()
    header = list(csv.reader([header]))[0]
    print(header)
    for i, l in enumerate(f):
        if i>10: break
        print(l.strip())

        row_id = l.split(',')[0]

        query = 'select * from raw_data.chartevents where row_id={}'.format(row_id)
        cur.execute(query)
        print(cur.fetchall())

cur.close()
conn.close()

Even for the first few rows of data, checking to see whether the primary key exists takes a really large amount of time.

What would be the fastest way of doing this?

Unfortunately I am not sure of that. The data is anonymized, and the row_id's are part of the data. I would like to say yes, but unfortunately, they are not in order ... — ssm
– ssm, Commented Aug 15, 2017 at 14:55

Dimitri Fontaine · Accepted Answer · 2017-08-15 10:53:52Z

2

The fastest way to insert data in PostgreSQL is using the COPY protocol, which is implemented in psycopg2. COPY will not allow you to check if target id already exists, tho. Best option is to COPY your file content's into a temporary table then INSERT or UPDATE from this, as in the Batch Update article I wrote on my http://tapoueh.org blog a while ago.

With a recent enough version of PostgreSQL you may use

INSERT INTO ...
SELECT * FROM copy_target_table
    ON CONFICT (pkey_name) DO NOTHING

answered Aug 15, 2017 at 10:53

Dimitri Fontaine

2662 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

ssm Over a year ago

Thanks Dimitri. I had used COPY as a first attempt, but my shitty wireless kept dropping, and had to abandon that idea. Thats why I had to split the file into smaller chunks, and then commit batches from each of these small chunks ...

ssm Over a year ago

I did go through your blog. Very impressive ideas. I will try incorporating them when needed.

ssm Over a year ago

I just deleted the entire table and started another batch upload. Only now, I will be saving the data that is not committed to the table so I know which ones are bad.

Dimitri Fontaine Over a year ago

If you need to use COPY and triage bad data, see pgloader, which implements that exactly.

ssm Over a year ago

Thanks! This looks like what I am looking for!

Gurmokh · Accepted Answer · 2017-08-15 08:52:01Z

0

Can i offer a work around. ?

The index will be checked for each row inserted, also Postgres performs the whole insert in a single transaction so you are effectively storing all this data to disk before its being written.

Could i suggest you drop the indexes to avoid this slow down, then split the file into smaller files using head -n [int] > newfile or something similar. then performing the copy commands separately for each one.

answered Aug 15, 2017 at 8:52

Gurmokh

2,09127 silver badges28 bronze badges

3 Comments

ssm Over a year ago

I uploaded the entire file by initially splitting the file into smaller files. However I didn't log values that had problems when inserting the data. So this is what I am stuck with. I have half a mind of deleting the table and recreating the entire thing again ...

ssm Over a year ago

I was hoping that someone would be able to tell me some cool way of doing this instead of recreating the entire tables ...

Gurmokh Over a year ago

I feel your pain, been there many times. Sometimes its easier to get the data to where you want it then clean it up after.

Collectives™ on Stack Overflow

fastest way of inserting data into a table

2 Answers 2

5 Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

3 Comments

Related