pandas combining dataframe

Question

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

java = pickle.load(open('JavaSafe.p','rb')) ##import 2d array
python = pickle.load(open('PythonSafe.p','rb')) ##import 2d array

javaFrame = pd.DataFrame(java,columns=['Town','Java Jobs'])
pythonFrame = pd.DataFrame(python,columns=['Town','Python Jobs'])
javaFrame = javaFrame.sort_values(by='Java Jobs',ascending=False)
pythonFrame = pythonFrame.sort_values(by='Python Jobs',ascending=False)
print(javaFrame,"\n",pythonFrame)

This code comes out with the following:

                Town  Java Jobs
435          York,NY       3593
212       NewYork,NY       3585
584       Seattle,WA       2080
624       Chicago,IL       1920
301        Boston,MA       1571
...
79        Holland,MI          5
38      Manhattan,KS          5
497        Vernon,IL          5
30        Clayton,MO          5
90       Waukegan,IL          5

[653 rows x 2 columns] 

                 Town  Python Jobs
160       NewYork,NY         2949
11           York,NY         2938
349       Seattle,WA         1321
91        Chicago,IL         1312
167        Boston,MA         1117

383       Hanover,NH            5
209      Bulverde,TX            5
203     Salisbury,NC            5
67       Rockford,IL            5
256       Ventura,CA            5

[416 rows x 2 columns]

I want to make a new dataframe that uses the town names as an index and has a column for each java and python. However, some of the towns will only have results for one of the languages.

you could also do given your original code result = pd.merge(pythonFrame, javeFrame, on='Town', how='outer').set_index('Town') — tipanverella
– tipanverella, Commented Jun 3, 2016 at 19:11

unutbu · Accepted Answer · 2016-06-04 00:51:26Z

import pandas as pd

javaFrame = pd.DataFrame({'Java Jobs': [3593, 3585, 2080, 1920, 1571, 5, 5, 5, 5, 5],
     'Town': ['York,NY', 'NewYork,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Holland,MI', 'Manhattan,KS', 'Vernon,IL', 'Clayton,MO', 'Waukegan,IL']}, index=[435, 212, 584, 624, 301, 79, 38, 497, 30, 90])
pythonFrame = pd.DataFrame({'Python Jobs': [2949, 2938, 1321, 1312, 1117, 5, 5, 5, 5, 5],
     'Town': ['NewYork,NY', 'York,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Hanover,NH', 'Bulverde,TX', 'Salisbury,NC', 'Rockford,IL', 'Ventura,CA']}, index=[160, 11, 349, 91, 167, 383, 209, 203, 67, 256])

result = pd.merge(javaFrame, pythonFrame, how='outer').set_index('Town')
#               Java Jobs  Python Jobs
# Town                                
# York,NY          3593.0       2938.0
# NewYork,NY       3585.0       2949.0
# Seattle,WA       2080.0       1321.0
# Chicago,IL       1920.0       1312.0
# Boston,MA        1571.0       1117.0
# Holland,MI          5.0          NaN
# Manhattan,KS        5.0          NaN
# Vernon,IL           5.0          NaN
# Clayton,MO          5.0          NaN
# Waukegan,IL         5.0          NaN
# Hanover,NH          NaN          5.0
# Bulverde,TX         NaN          5.0
# Salisbury,NC        NaN          5.0
# Rockford,IL         NaN          5.0
# Ventura,CA          NaN          5.0

pd.merge will by default join two DataFrames on all columns shared in common. In this case, javaFrame and pythonFrame share only the Town column in common. So by default pd.merge would join the two DataFrames on the Town column.

how='outer causes pd.merge to use the union of the keys from both frames. In other words it causes pd.merge to return rows whose data come from either javaFrame or pythonFrame even if only one DataFrame contains the Town. Missing data is fill with NaNs.

result = pd.merge(javaFrame, pythonFrame, how='outer').set_index('Town') is, I think, what they are expecting!

piRSquared · Accepted Answer · 2016-06-03 23:25:17Z

Use pd.concat

df = pd.concat([df.set_index('Town') for df in [javaFrame, pythonFrame]], axis=1)

              Java Jobs  Python Jobs
Boston,MA        1571.0       1117.0
Bulverde,TX         NaN          5.0
Chicago,IL       1920.0       1312.0
Clayton,MO          5.0          NaN
Hanover,NH          NaN          5.0
Holland,MI          5.0          NaN
Manhattan,KS        5.0          NaN
NewYork,NY       3585.0       2949.0
Rockford,IL         NaN          5.0
Salisbury,NC        NaN          5.0
Seattle,WA       2080.0       1321.0
Ventura,CA          NaN          5.0
Vernon,IL           5.0          NaN
Waukegan,IL         5.0          NaN
York,NY          3593.0       2938.0

Collectives™ on Stack Overflow

pandas combining dataframe

2 Answers 2

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Related