3

I would like to convert a csv file into a json file using python 2.7. Down below is the python code I tried but it is not giving me expected result. Also, I would like to know if there is any simplified version than mine. Any help is appreciated.

Here is my csv file (SampleCsvFile.csv):

zipcode,date,state,val1,val2,val3,val4,val5
95110,2015-05-01,CA,50,30.00,5.00,3.00,3
95110,2015-06-01,CA,67,31.00,5.00,3.00,4
95110,2015-07-01,CA,97,32.00,5.00,3.00,6

Here is the expected json file (ExpectedJsonFile.json):

{
        "zipcode": "95110", 
        "state": "CA", 
        "subset": [
            {
                "date": "2015-05-01",
                "val1": "50", 
                "val2": "30.00", 
                "val3": "5.00", 
                "val4": "3.00", 
                "val5": "3"
            }, 
            {
                "date": "2015-06-01", 
                "val1": "67", 
                "val2": "31.00", 
                "val3": "5.00", 
                "val4": "3.00", 
                "val5": "4"
            }, 
            {
                "date": "2015-07-01", 
                "val1": "97", 
                "val2": "32.00", 
                "val3": "5.00", 
                "val4": "3.00", 
                "val5": "6"
            }
        ]

}

Here's the python code I tried:

import pandas as pd
from itertools import groupby 
import json    

df = pd.read_csv('SampleCsvFile.csv')

names = df.columns.values.tolist()
data = df.values

master_list2 = [ (d["zipcode"], d["state"], d) for d in [dict(zip(names, d)) for d in data] ]
intermediate2 = [(k, [x[2] for x in list(v)]) for k,v in groupby(master_list2, lambda t: (t[0],t[1]) )]
nested_json2 = [dict(zip(names,(k[0][0], k[0][1], k[1]))) for k in [(i[0], i[1]) for i in intermediate2]]

#print json.dumps(nested_json2, indent=4)
with open('ExpectedJsonFile.json', 'w') as outfile:
     outfile.write(json.dumps(nested_json2, indent=4))

1 Answer 1

2

Since you are using pandas already, I tried to get as much mileage as I could out of dataframe methods. I also ended up wandering fairly far afield from your implementation. I think the key here, though, is don't try to get too clever with your list and/or dictionary comprehensions. You can very easily confuse yourself and everyone who reads your code.

import pandas as pd
from itertools import groupby 
from collections import OrderedDict
import json    

df = pd.read_csv('SampleCsvFile.csv', dtype={
            "zipcode" : str,
            "date" : str,
            "state" : str,
            "val1" : str,
            "val2" : str,
            "val3" : str,
            "val4" : str,
            "val5" : str
        })

results = []

for (zipcode, state), bag in df.groupby(["zipcode", "state"]):
    contents_df = bag.drop(["zipcode", "state"], axis=1)
    subset = [OrderedDict(row) for i,row in contents_df.iterrows()]
    results.append(OrderedDict([("zipcode", zipcode),
                                ("state", state),
                                ("subset", subset)]))

print json.dumps(results[0], indent=4)
#with open('ExpectedJsonFile.json', 'w') as outfile:
#    outfile.write(json.dumps(results[0], indent=4))

The simplest way to have all the json datatypes written as strings, and to retain their original formatting, was to force read_csv to parse them as strings. If, however, you need to do any numerical manipulation on the values before writing out the json, you will have to allow read_csv to parse them numerically and coerce them into the proper string format before converting to json.

Sign up to request clarification or add additional context in comments.

10 Comments

When I try to execute your code, I get the type error: 95110 is not JSON serializable. Also, the output doesn't have double quotes to numerical values and truncates the decimal portion. Any idea on how to resolve it?
Here's the output of "print result" (without the json.dumps): {'subset': [{'val3': 5.0, 'date': '2015-05-01', 'val2': 30.0, 'val1': 50L, 'val5': 3L, 'val4': 3.0}, {'val3': 5.0, 'date': '2015-06-01', 'val2': 31.0, 'val1': 67L, 'val5': 4L, 'val4': 3.0}, {'val3': 5.0, 'date': '2015-07-01', 'val2': 32.0, 'val1': 97L, 'val5': 6L, 'val4': 3.0}], 'state': 'CA', 'zipcode': 95110}
I adjusted the answer to force all the datatypes to be read as strings, which will resolve the quoting and numerical formatting issues. I'm not sure about the type error you're experiencing, as I can't replicate that, but since the datatype of the parsed zipcode is different now there is a chance that will be resolved as well.
New code works like a charm. Thanks a lot Aniket, I really appreciate it.
one question though, is there a way to change/set the order of elements? For example, I would like to see state and zipcode on the top and subset in the bottom. Right now, the order seems to be random.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.