1

I have a Postgres database, and I have inserted some data into the table. Because of issues with the internet connection, some of the data couldn't be written.The file that I am trying to write into the database is large (about 330712484 rows - even the ws -l command takes a while to complete.

Now, the column row_id is the (integer) primary key, and is already indexed. Since some of the rows could not be inserted into the table, I wanted to insert these specific rows into the table. (I estimate only about 1.8% of the data isn't inserted into the table ...) As a beginning, I tried to see of the primary keys were inside the database like so:

conn      = psycopg2.connect(connector)
cur       = conn.cursor()

with open(fileName) as f:

    header = f.readline().strip()
    header = list(csv.reader([header]))[0]
    print(header)
    for i, l in enumerate(f):
        if i>10: break
        print(l.strip())

        row_id = l.split(',')[0]

        query = 'select * from raw_data.chartevents where row_id={}'.format(row_id)
        cur.execute(query)
        print(cur.fetchall())

cur.close()
conn.close()

Even for the first few rows of data, checking to see whether the primary key exists takes a really large amount of time.

What would be the fastest way of doing this?

2
  • Are the ids sequential with no gaps? Commented Aug 15, 2017 at 12:08
  • Unfortunately I am not sure of that. The data is anonymized, and the row_id's are part of the data. I would like to say yes, but unfortunately, they are not in order ... Commented Aug 15, 2017 at 14:55

2 Answers 2

2

The fastest way to insert data in PostgreSQL is using the COPY protocol, which is implemented in psycopg2. COPY will not allow you to check if target id already exists, tho. Best option is to COPY your file content's into a temporary table then INSERT or UPDATE from this, as in the Batch Update article I wrote on my http://tapoueh.org blog a while ago.

With a recent enough version of PostgreSQL you may use

INSERT INTO ...
SELECT * FROM copy_target_table
    ON CONFICT (pkey_name) DO NOTHING
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks Dimitri. I had used COPY as a first attempt, but my shitty wireless kept dropping, and had to abandon that idea. Thats why I had to split the file into smaller chunks, and then commit batches from each of these small chunks ...
I did go through your blog. Very impressive ideas. I will try incorporating them when needed.
I just deleted the entire table and started another batch upload. Only now, I will be saving the data that is not committed to the table so I know which ones are bad.
If you need to use COPY and triage bad data, see pgloader, which implements that exactly.
Thanks! This looks like what I am looking for!
0

Can i offer a work around. ?

The index will be checked for each row inserted, also Postgres performs the whole insert in a single transaction so you are effectively storing all this data to disk before its being written.

Could i suggest you drop the indexes to avoid this slow down, then split the file into smaller files using head -n [int] > newfile or something similar. then performing the copy commands separately for each one.

3 Comments

I uploaded the entire file by initially splitting the file into smaller files. However I didn't log values that had problems when inserting the data. So this is what I am stuck with. I have half a mind of deleting the table and recreating the entire thing again ...
I was hoping that someone would be able to tell me some cool way of doing this instead of recreating the entire tables ...
I feel your pain, been there many times. Sometimes its easier to get the data to where you want it then clean it up after.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.