Reading nested json to pandas dataframe

Question

I have below URL that has a JSON response. I need to read this json into a pandas dataframe and perform operations on top of it . This is a case of nested JSON which consists of multiple lists and dicts within dicts.

URL: 'http://api.nobelprize.org/v1/laureate.json'

I have tried below code:

import json, pandas as pd,requests
resp=requests.get('http://api.nobelprize.org/v1/laureate.json')
df=pd.json_normalize(json.loads(resp.content),record_path =['laureates'])
print(df.head(5))

Output-

  id       firstname    surname        born        died  \
0  1  Wilhelm Conrad    Röntgen  1845-03-27  1923-02-10   
1  2      Hendrik A.    Lorentz  1853-07-18  1928-02-04   
2  3          Pieter     Zeeman  1865-05-25  1943-10-09   
3  4           Henri  Becquerel  1852-12-15  1908-08-25   
4  5          Pierre      Curie  1859-05-15  1906-04-19   

             bornCountry bornCountryCode                bornCity  \
0  Prussia (now Germany)              DE  Lennep (now Remscheid)   
1        the Netherlands              NL                  Arnhem   
2        the Netherlands              NL              Zonnemaire   
3                 France              FR                   Paris   
4                 France              FR                   Paris   

       diedCountry diedCountryCode   diedCity gender  \
0          Germany              DE     Munich   male   
1  the Netherlands              NL        NaN   male   
2  the Netherlands              NL  Amsterdam   male   
3           France              FR        NaN   male   
4           France              FR      Paris   male   

                                              prizes  
0  [{'year': '1901', 'category': 'physics', 'shar...  
1  [{'year': '1902', 'category': 'physics', 'shar...  
2  [{'year': '1902', 'category': 'physics', 'shar...  
3  [{'year': '1903', 'category': 'physics', 'shar...  
4  [{'year': '1903', 'category': 'physics', 'shar...

But in this prizes comes as a list. If I create a separate dataframe for prizes, it has affiliations as list.I want all columns to come as separate columns. Some entires may/may not have prizes. So need to handle that case as well.

I went through this article https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd. Looks like we'll have to use meta and error=ignore here, but not able to fix it. Appreciate your inputs here. Thanks.

Akshay Sehgal · Accepted Answer · 2022-01-30 16:15:21Z

1

You would have to do this in few steps.

The first step would be to extract the first record_path = ['laureates']
The second one would be record_path = ['laureates', 'prizes'] for the nested json records with meta path as the id from the parent record
Combine the two datasets by joining on the id column.
Drop the unnecessary columns and store

import json, pandas as pd, requests

resp = requests.get('http://api.nobelprize.org/v1/laureate.json')
df0 = pd.json_normalize(json.loads(resp.content),record_path = ['laureates'])
df1 = pd.json_normalize(json.loads(resp.content),record_path = ['laureates','prizes'], meta = [['laureates','id']])
output = pd.merge(df0, df1, left_on='id', right_on='laureates.id').drop(['prizes','laureates.id'], axis=1, inplace=False)

print('Shape of data ->',output.shape)
print('Columns ->',output.columns)

Shape of data -> (975, 18)

Columns -> Index(['id', 'firstname', 'surname', 'born', 'died', 'bornCountry',
       'bornCountryCode', 'bornCity', 'diedCountry', 'diedCountryCode',
       'diedCity', 'gender', 'year', 'category', 'share', 'motivation',
       'affiliations', 'overallMotivation'],
      dtype='object')

answered Jan 30, 2022 at 16:15

Akshay Sehgal

19.4k3 gold badges26 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Shivangi._k Over a year ago

Thanks. This helps. But ideally we should have 968 records after all the merge as per the json. This is more than that. Also, can you share the logic for affiliations too. Facing similar issue in it. Its a list within prizes. I have also shared an alternate solution in below answer. Got from some other slack query.

Akshay Sehgal Over a year ago

The number of records are higher because the nested JSON of the prizes contain a larger total number of prizes (few laureates received more than 1 prize, each prize has a unique row, some laureates have more than 1 occurrence in data).

Akshay Sehgal Over a year ago

I tried affiliations, but there are some data issues. Namely, some of the empty affiliations have [[]] instead of a nested json. This causes some issues when working with pd.json_normalize

Akshay Sehgal Over a year ago

Do mark if the answer helped. cheers.

Shivangi._k Over a year ago

Yes, ideal way is json_normalize, but affiliations is causing an issue. Will see for further inputs from others. Thanks

Shivangi._k · Accepted Answer · 2022-01-30 18:32:18Z

0

Found an alternate solution as well with lesser code. This works.

from flatten_json import flatten
data = winners['laureates']
dict_flattened = (flatten(record, '.') for record in data)
df = pd.DataFrame(dict_flattened)
print(df.shape)

(968, 43)

answered Jan 30, 2022 at 18:32

Shivangi._k

114 bronze badges

Collectives™ on Stack Overflow

Reading nested json to pandas dataframe

2 Answers 2

5 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Related