Python load huge csv to postgresql

Question

Here is the program i am developing in python -

Step1 - We will get JSON file ( size could be in GBs e.g 50 GB or more ) from source to our server - Step2 - I use Pandas Dataframe to load JSON in to DF using df = pd.read_json(jsonfile,index=False, header=False

Step3 - I use df.to_csv(temp_csvfile,..)

Steps4 - I use Python psycopg2 to make Postgresql connection and cursor ..

curr=conn.cursor() ```
Step5 - Read the CSV and load using copy_from 
 with open(temp_csvfile,'r') as f:
     curr.copy_from(f,..)
 conn.commit()

I seek feedback on below points -

a. Will this way of loading JSON to Pandas Dataframe not cause out of memory issue if my system memory is < size of the JSON file ..

b. At step 5 again i am opening file in read mode will same issue come here as it might load file in memory ( am i missing anything here )

c. Is there any better way of doing this ..

d. Can Python DASK will be used as it provides reading data in chunks ( i am not familiar with this).

Please advise

You are right about doubting this approach, and it will lead to 'out of memory' issues certainly. Can you not get the data in multiple json files, instead one huge file. Necessarily the data source system also would be better off if it generates multiple dataset files. — Anand Sowmithiran
– Anand Sowmithiran, Commented Nov 13, 2021 at 18:07
Thanks Anand but right now we can not control Source systems. — CIGEEK
– CIGEEK, Commented Nov 14, 2021 at 8:55
Ok, have you tried it and did it result in any error, can you share more details ? — Anand Sowmithiran
– Anand Sowmithiran, Commented Nov 14, 2021 at 8:56
Also, you can read such large json files in terms of chunks, refer to this answer(stackoverflow.com/a/67126154/14973743) — Anand Sowmithiran
– Anand Sowmithiran, Commented Nov 14, 2021 at 9:01
Thanks Anand, i am trying the same with pandas chuncksize option.. — CIGEEK
– CIGEEK, Commented Nov 14, 2021 at 11:28

Anand Sowmithiran · Accepted Answer · 2021-11-14 09:48:04Z

2

You could split your input json file into many smaller files, and also use the chunk size parameter while reading file content into pandas dataframe. Also, use the psycopg2 copy_from function which supports a buffer size parameter. In fact you could use execute_batch() to get batches of rows inserted into your Postgresql table, as in article mentioned in reference below.

References :

Loading 20gb json file in pandas
Loading dataframes data into postgresql table article
Read a large json file into pandas

edited Nov 14, 2021 at 9:48

answered Nov 14, 2021 at 9:27

Anand Sowmithiran

2,9583 gold badges14 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python load huge csv to postgresql

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related