I have a Postgres database, and I have inserted some data into the table. Because of issues with the internet connection, some of the data couldn't be written.The file that I am trying to write into the database is large (about 330712484 rows - even the ws -l command takes a while to complete.
Now, the column row_id is the (integer) primary key, and is already indexed. Since some of the rows could not be inserted into the table, I wanted to insert these specific rows into the table. (I estimate only about 1.8% of the data isn't inserted into the table ...) As a beginning, I tried to see of the primary keys were inside the database like so:
conn = psycopg2.connect(connector)
cur = conn.cursor()
with open(fileName) as f:
header = f.readline().strip()
header = list(csv.reader([header]))[0]
print(header)
for i, l in enumerate(f):
if i>10: break
print(l.strip())
row_id = l.split(',')[0]
query = 'select * from raw_data.chartevents where row_id={}'.format(row_id)
cur.execute(query)
print(cur.fetchall())
cur.close()
conn.close()
Even for the first few rows of data, checking to see whether the primary key exists takes a really large amount of time.
What would be the fastest way of doing this?
row_id's are part of the data. I would like to say yes, but unfortunately, they are not in order ...