JSON to Python Pandas dataframe

Question

I'm new to python, and have been focusing on learning pandas and xlxswriter to help automate some workflows. I've attached a snippet of a JSON file that i got access to, and have been unable to transform into a pandas dataframe.

If i use a pd.read_json(filename): it messes up the variationProducts & productAttributes by lumping their content together in one cell.

Question: How would i take this JSON file and make it look like the pandas dataframe output at the bottom:

[
  {
    "ID": "12345",
    "productName": "Product A ",
    "minPrice": "$89.00",
    "maxPrice": "$89.00",
    "variationProducts": [
      {
        "variantColor": "JJ0BVE7",
        "variantSize": "080",
        "sellingPrice": "$89.00",
        "inventory": 3,
      },
      {
        "variantColor": "JJ0BVE7",
        "variantSize": "085",
        "sellingPrice": "$89.00",
        "inventory": 6,
      }
    ],
    "productAttributes": [
        {
        "ID": "countryOfOrigin",
        "value": "Imported"
      },
      {
        "ID": "csProductCode",
        "value": "1100"
      }
    ]
  },
  {
    "ID": "23456",
    "productName": "Product B",
    "minPrice": "$29.99",
    "maxPrice": "$69.00",
    "variationProducts": [
      {
        "variantColor": "JJ169Q0",
        "variantSize": "050",
        "sellingPrice": "$69.00",
        "inventory": 55,
      },
      {
        "variantColor": "JJ123Q0",
        "variantSize": "055",
        "sellingPrice": "$69.00",
        "inventory": 5,
      }
    ],
   "productAttributes": [
        {
        "ID": "countryOfOrigin",
        "value": "Imported"
      },
      {
        "ID": "csProductCode",
        "value": "1101"
      }
    ]
  }
]

I made this example output in excel, the variationProducts are summed up at the variantColor level - so for Product A the inventory is a summation of both variants, despite them having diffent variantSizes:

     ID      productName maxPrice minPrice countryOfOrigin csProductCode variantColor inventory
    12345   Product A   $89     $89         Imported        1100    JJ0BVE7    9
    23456   Product B   $69     $30         Imported        1101    JJ169Q0    55
    23456   Product B   $69     $30         Imported        1101    JJ123Q0    5

Andy Hayden · Accepted Answer · 2017-10-27 16:14:42Z

You can use json_normalize:

In [11]: pd.io.json.json_normalize(d, "variationProducts", ["ID", "maxPrice", "minPrice", "productName"], record_prefix=".")
Out[11]:
   .inventory .sellingPrice .variantColor .variantSize     ID maxPrice minPrice productName
0           3        $89.00       JJ0BVE7          080  12345   $89.00   $89.00  Product A
1           6        $89.00       JJ0BVE7          085  12345   $89.00   $89.00  Product A
2          55        $69.00       JJ169Q0          050  23456   $69.00   $29.99   Product B
3           5        $69.00       JJ123Q0          055  23456   $69.00   $29.99   Product B

In [12]: pd.io.json.json_normalize(d, "productAttributes", ["ID", "maxPrice", "minPrice", "productName"], record_prefix=".")
Out[12]:
               .ID    .value     ID maxPrice minPrice productName
0  countryOfOrigin  Imported  12345   $89.00   $89.00  Product A
1    csProductCode      1100  12345   $89.00   $89.00  Product A
2  countryOfOrigin  Imported  23456   $69.00   $29.99   Product B
3    csProductCode      1101  23456   $69.00   $29.99   Product B

You can then join/merge these two together...

Thanks for sharing this, when i run it i get an error: TypeError: string indices must be integers. How do i fix this?
@Abhay what version of pandas are you using? Are you testing this on a different dictionary to the one your provided?
Im using Pandas version: '0.20.3', and yes, i'm trying to use it on the main JSON file, which has a lot more variables in it
@Andy I loaded the data using json (instead of pandas) and it works now. Not sure what the difference is, but it works! Thanks for helping me with this. Cheers!

Sunnysinh Solanki · Accepted Answer · 2017-10-27 16:10:29Z

0

I think you'll have to do little parsing to data to convert it to proper format for read_json to work.

first use json.load(file_name) to get json data in to one python object which will be list.

Now you need to convert this list such that each object is dictionary and each dictionary has keys as column names and values as value which you want in that column.

One you have list ready like that then you can use pandas.DataFrame(list)

answered Oct 27, 2017 at 16:10

Sunnysinh Solanki

5514 silver badges10 bronze badges

Comments

Sunnysinh Solanki · Accepted Answer · 2017-10-27 16:31:02Z

l = [
  {
    "ID": "12345",
    "productName": "Product A ",
    "minPrice": "$89.00",
    "maxPrice": "$89.00",
    "variationProducts": [
      {
        "variantColor": "JJ0BVE7",
        "variantSize": "080",
        "sellingPrice": "$89.00",
        "inventory": 3,
      },
      {
        "variantColor": "JJ0BVE7",
        "variantSize": "085",
        "sellingPrice": "$89.00",
        "inventory": 6,
      }
    ],
    "productAttributes": [
        {
        "ID": "countryOfOrigin",
        "value": "Imported"
      },
      {
        "ID": "csProductCode",
        "value": "1100"
      }
    ]
  },
  {
    "ID": "23456",
    "productName": "Product B",
    "minPrice": "$29.99",
    "maxPrice": "$69.00",
    "variationProducts": [
      {
        "variantColor": "JJ169Q0",
        "variantSize": "050",
        "sellingPrice": "$69.00",
        "inventory": 55,
      },
      {
        "variantColor": "JJ123Q0",
        "variantSize": "055",
        "sellingPrice": "$69.00",
        "inventory": 5,
      }
    ],
   "productAttributes": [
        {
        "ID": "countryOfOrigin",
        "value": "Imported"
      },
      {
        "ID": "csProductCode",
        "value": "1101"
      }
    ]
  }
]


import pandas as pd
from itertools import *

final_list = []
for val in l:
    d = {}
    d.update({key:val[key] for key in val.keys() if key not in ['variationProducts','productAttributes']})
    for prods,attrs in izip_longest(val['variationProducts'],val['productAttributes']):
        if prods:
            d.update(prods)
        if attrs:
            d.update({attrs['ID']:attrs['value']})
        final_list.append(d.copy())

pd.DataFrame(final_list)

This goes a bit over my head, but i tried running it and got this error on line 4: AttributeError: 'str' object has no attribute 'keys'

Collectives™ on Stack Overflow

JSON to Python Pandas dataframe

3 Answers 3

8 Comments

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

Comments

1 Comment

Related