2

I am trying to create the following data frame from the below mentioned dictionary. Is there any efficient solutions?

data_dict = {
    'Total_Amount' : '150.00',
    'LinkAPI' : [{"ConfidenceScore":4},{"ConfidenceScore":9}],
    'RecordID' : 5687,
    'ClientId' : 45,
    'Customer_Number' : ["HDMO70232"],
    'RowNumber' : 0,
    'Invoice_Number' : '',
    'Customer_Name' : 'HD MOTORCYCLES SIS/SVC'
}

The number of rows in the dataframe, should be equal to the number of items in the list of 'LinkAPI'. The data frame for the above data should look like below dataframe.

ClientId    Customer_Name   Customer_Number Invoice_Number  LinkAPI RecordID    RowNumber   Total_Amount
0   45  HD MOTORCYCLES SIS/SVC  [HDMO70232]     {'ConfidenceScore': 4}  5687    0   150.00
1   45  HD MOTORCYCLES SIS/SVC  [HDMO70232]     {'ConfidenceScore': 9}  5687    0   150.00

I have tried two solutions to implement this. I hope there is a better way to create the dataframe. solution-1:

items_number = len(data_dict['LinkAPI'])
df_dict = {k : [data_dict[k] for _ in range(items_number)] if k != 'LinkAPI' else data_dict[k]
           for k in data_dict.keys()}
df = pd.DataFrame(df_dict)

solution-2:

LinkAPI = data_dict["LinkAPI"]

df_new = pd.DataFrame(columns=list(df))  # list(df) is ['ClientId','Customer_Name', 'Customer_Number', 
                                            # 'Invoice_Number', 'LinkAPI','RecordID', 'RowNumber', 'Total_Amount']
i=0
for conf in LinkAPI:
    df_new.loc[i] = [data_dict["Total_Amount"], conf, data_dict["RecordID"], data_dict["ClientId"], data_dict["Customer_Number"],
                    data_dict["RowNumber"], data_dict["Invoice_Number"], data_dict["Customer_Name"]]
    i+=1

3 Answers 3

3

Use json_normalize:

from pandas.io.json import json_normalize

cols = ['Total_Amount','RecordID','ClientId','Customer_Number',
        'RowNumber','Invoice_Number','Customer_Name']
df = json_normalize(data, 'LinkAPI', cols)
#data borrowed from HYRY
print (df)
   ConfidenceScore  test Total_Amount Invoice_Number  RowNumber  \
0              4.0   NaN       150.00                         0   
1              9.0   NaN       150.00                         0   
2              8.0   NaN      1500.00                         1   
3             10.0   NaN      1500.00                         1   
4             20.0   NaN      1500.00                         1   
5              NaN   2.0      1500.00                         1   

  Customer_Number  ClientId           Customer_Name  RecordID  
0       HDMO70232        45  HD MOTORCYCLES SIS/SVC      5687  
1       HDMO70232        45  HD MOTORCYCLES SIS/SVC      5687  
2       HDMO70232       415  HD MOTORCYCLES SIS/SVC     56287  
3       HDMO70232       415  HD MOTORCYCLES SIS/SVC     56287  
4       HDMO70232       415  HD MOTORCYCLES SIS/SVC     56287  
5       HDMO70232       415  HD MOTORCYCLES SIS/SVC     56287  
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the answer
1

I changed your data to a list of dict:

data = [
{
    'Total_Amount' : '150.00',
    'LinkAPI' : [{"ConfidenceScore":4},{"ConfidenceScore":9}],
    'RecordID' : 5687,
    'ClientId' : 45,
    'Customer_Number' : ["HDMO70232"],
    'RowNumber' : 0,
    'Invoice_Number' : '',
    'Customer_Name' : 'HD MOTORCYCLES SIS/SVC'
},
{
    'Total_Amount' : '1500.00',
    'LinkAPI' : [{"ConfidenceScore":8},{"ConfidenceScore":10}, {"ConfidenceScore":20}, {"test":2}],
    'RecordID' : 56287,
    'ClientId' : 415,
    'Customer_Number' : ["HDMO70232"],
    'RowNumber' : 1,
    'Invoice_Number' : '',
    'Customer_Name' : 'HD MOTORCYCLES SIS/SVC'
},
]

df = pd.DataFrame(data)

df2 = pd.DataFrame(np.concatenate(df.LinkAPI).tolist(), 
                   index=np.repeat(df.index, df.LinkAPI.str.len().astype(int)))

df.drop("LinkAPI", axis=1).join(df2)

the output:

   ClientId           Customer_Name Customer_Number Invoice_Number  RecordID  RowNumber Total_Amount  ConfidenceScore  test
0        45  HD MOTORCYCLES SIS/SVC     [HDMO70232]                     5687          0       150.00              4.0   NaN
0        45  HD MOTORCYCLES SIS/SVC     [HDMO70232]                     5687          0       150.00              9.0   NaN
1       415  HD MOTORCYCLES SIS/SVC     [HDMO70232]                    56287          1      1500.00              8.0   NaN
1       415  HD MOTORCYCLES SIS/SVC     [HDMO70232]                    56287          1      1500.00             10.0   NaN
1       415  HD MOTORCYCLES SIS/SVC     [HDMO70232]                    56287          1      1500.00             20.0   NaN
1       415  HD MOTORCYCLES SIS/SVC     [HDMO70232]                    56287          1      1500.00              NaN   2.0

Comments

0

I don't know if it's an option, but if you can alter your dictionary to have equal-length lists for all entries (and e.g. just repeat the values currently in your data_dict, you can just use pd.DataFrame(data_dict). In your case, each entry of your dictionary will have to have a length equal to 2, as that is the longest entry in your dictionary (LinkAPI):

import pandas as pd
pd.set_option("display.width", 300)  # You can ignore this

data_dict = {
    'Total_Amount' : '150.00',
    'LinkAPI' : [{"ConfidenceScore":4},{"ConfidenceScore":9}],
    'RecordID' : [5687] * 2,
    'ClientId' : [45] * 2,
    'Customer_Number' : ["HDMO70232"] * 2,
    'RowNumber' : [0] * 2,
    'Invoice_Number' : [''] * 2,
    'Customer_Name' : ['HD MOTORCYCLES SIS/SVC'] * 2
}

df = pd.DataFrame(data_dict)

print df

Which gives you the following dataframe:

   ClientId           Customer_Name Customer_Number Invoice_Number                  LinkAPI  RecordID  RowNumber Total_Amount
0        45  HD MOTORCYCLES SIS/SVC       HDMO70232                 {u'ConfidenceScore': 4}      5687          0       150.00
1        45  HD MOTORCYCLES SIS/SVC       HDMO70232                 {u'ConfidenceScore': 9}      5687          0       150.00

EDIT:

To clarify, to read a dictionary to a dataframe, pandas requires each entry (key in your dictionary that will be a column in your dataframe) to be of equal length. Otherwise, it will throw a ValueError:

ValueError: arrays must all be same length

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.