0

Here is the program i am developing in python -

Step1 - We will get JSON file ( size could be in GBs e.g 50 GB or more ) from source to our server - Step2 - I use Pandas Dataframe to load JSON in to DF using df = pd.read_json(jsonfile,index=False, header=False

Step3 - I use df.to_csv(temp_csvfile,..)

Steps4 - I use Python psycopg2 to make Postgresql connection and cursor ..

curr=conn.cursor() ```
Step5 - Read the CSV and load using copy_from 
 with open(temp_csvfile,'r') as f:
     curr.copy_from(f,..)
 conn.commit()

I seek feedback on below points -

a. Will this way of loading JSON to Pandas Dataframe not cause out of memory issue if my system memory is < size of the JSON file ..

b. At step 5 again i am opening file in read mode will same issue come here as it might load file in memory ( am i missing anything here )

c. Is there any better way of doing this ..

d. Can Python DASK will be used as it provides reading data in chunks ( i am not familiar with this).

Please advise

5
  • You are right about doubting this approach, and it will lead to 'out of memory' issues certainly. Can you not get the data in multiple json files, instead one huge file. Necessarily the data source system also would be better off if it generates multiple dataset files. Commented Nov 13, 2021 at 18:07
  • Thanks Anand but right now we can not control Source systems. Commented Nov 14, 2021 at 8:55
  • Ok, have you tried it and did it result in any error, can you share more details ? Commented Nov 14, 2021 at 8:56
  • Also, you can read such large json files in terms of chunks, refer to this answer(stackoverflow.com/a/67126154/14973743) Commented Nov 14, 2021 at 9:01
  • Thanks Anand, i am trying the same with pandas chuncksize option.. Commented Nov 14, 2021 at 11:28

1 Answer 1

2

You could split your input json file into many smaller files, and also use the chunk size parameter while reading file content into pandas dataframe. Also, use the psycopg2 copy_from function which supports a buffer size parameter. In fact you could use execute_batch() to get batches of rows inserted into your Postgresql table, as in article mentioned in reference below.

References :

  1. Loading 20gb json file in pandas
  2. Loading dataframes data into postgresql table article
  3. Read a large json file into pandas
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.