how to merge two dataframes on both indexes and columns

Question

Here's the problem: I know how to merge two DataFrames on indexes or on columns, but I am not able to merge them on a both indexes and columns.

I have two DataFrames and I want to merge them on indexes (which are dates) and on column id. I created some sample data to better explain my problem.

from datetime import date
import numpy as np
import pandas as pd

np.random.seed(200)
dates = [date(2020, 1, 31), date(2020, 2, 28)]
a = {"id": ["A", "B"] * len(dates), "w": [.5, .5] * len(dates)}
b = {"id": ["B", "A"] * len(dates), "x": np.random.random(2 * len(dates))}

a = pd.DataFrame(a, index=dates * len(dates))
b = pd.DataFrame(b, index=dates * len(dates))

Desired output:

           id    w         x
2020-01-31  A  0.5  0.226547
2020-02-28  B  0.5  0.947632
2020-01-31  A  0.5  0.428309
2020-02-28  B  0.5  0.594420

Please note that I am searching for a general solution, where a and b do not necessarily contains the same indexes or elements in id.

your example is a bit ambiguous, should the id column be more ['A','B','B','A]? to actually have an interest to merge on the index and the column id — Ben.T
– Ben.T, Commented May 8, 2020 at 20:22

Ben.T · Accepted Answer · 2020-05-08 20:53:37Z

IIUC, you can use set_index to append the columns, use join and then reset_index like

print(a.set_index('id', append=True)\
       .join(b.set_index('id', append=True), how='outer')\
       .reset_index('id'))
           id    w         x
2020-01-31  A  0.5  0.947632
2020-02-28  B  0.5  0.226547
2020-01-31  B  0.5  0.594420
2020-02-28  A  0.5  0.428309

or the opposite direction with merge:

print(a.reset_index()\
       .merge(b.reset_index(), on=['index', 'id'], how='outer')\
       .set_index('index'))
           id    w         x
index                       
2020-01-31  A  0.5  0.947632
2020-02-28  B  0.5  0.226547
2020-01-31  B  0.5  0.594420
2020-02-28  A  0.5  0.428309

Just to be sure that is what you want to do, let's assume a and b are like this with another id:

a = pd.DataFrame({"id": ["A", "B", 'B','A'] , "w": np.random.random(4)}, 
                 index=[date(2020, 1, 31), date(2020, 2, 28)]*2)
#           id         w
#2020-01-31  A  0.764141
#2020-02-28  B  0.002861
#2020-01-31  B  0.357424
#2020-02-28  A  0.909695

b = pd.DataFrame({"id": ["A", "B", 'C','A'], "x": np.random.random(4)}, 
                 index=[date(2020, 1, 31), date(2020, 2, 28)]*2)
#           id         x
#2020-01-31  A  0.456081
#2020-02-28  B  0.981803
#2020-01-31  C  0.867357
#2020-02-28  A  0.986028

Then the result of the method with join is:

           id         w         x
2020-01-31  A  0.764141  0.456081
2020-01-31  B  0.357424       NaN
2020-01-31  C       NaN  0.867357
2020-02-28  A  0.909695  0.986028
2020-02-28  B  0.002861  0.981803

thanks, I hoped for a some hidden pd.merge arg, but this works!

Scott Boston · Accepted Answer · 2020-05-08 20:05:22Z

1

Use a helper column based on cumcount, as give indexes a name to make it eaiser to merge on indexes:

a['helper'] = a.groupby([a.index, 'id']).cumcount()
b['helper'] = b.groupby([b.index, 'id']).cumcount()
a = a.rename_axis('date')
b = b.rename_axis('date')

a.merge(b, on=['date','id','helper']).drop('helper', axis=1)

Output:

           id    w         x
date                        
2020-01-31  A  0.5  0.947632
2020-02-28  B  0.5  0.226547
2020-01-31  A  0.5  0.594420
2020-02-28  B  0.5  0.428309

answered May 8, 2020 at 20:05

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

1 Comment

davide Over a year ago

thanks, but this solution does not seems to generalize well. it won't work if: ids = ["A", "B"] dates = [date(2020, 1, 31), date(2020, 2, 28)] a = {"id": ids * len(dates), "w": [.5, .5] * len(dates)} ids.append("C") b = {"id": ids * len(dates), "x": np.random.random(len(ids) * len(dates))} a = pd.DataFrame(a, index=dates * (len(ids) - 1)) b = pd.DataFrame(b, index=dates * len(ids))

pyOliv · Accepted Answer · 2020-05-08 20:25:04Z

You can simply add a new column using b['w'] = a['w']. This is note really a merge, but a copy from a into b.

The full code is :

from datetime import date
import numpy as np
import pandas as pd

np.random.seed(200)
ids = ["A", "B"]
dates = [date(2020, 1, 31), date(2020, 2, 28)]
a = {"id": ids * len(dates), "w": [.5, .5] * len(dates)}
b = {"id": ids * len(dates), "x": np.random.random(len(ids) * len(dates))}

a = pd.DataFrame(a, index=dates * len(dates))
b = pd.DataFrame(b, index=dates * len(dates))

b['w'] = a['w']
print(b)

EDIT: Other way to obtain the result you want (well, I'm not so sure because of the duplicate 'id' column). Please let me known the structure of the id of the two dataframe :

import pandas as pd

a = pandas.DataFrame([
    ['A', 0.5],
    ['B', 1],
    ['C', 1.5],
    ['D', 2.]],
    columns=['id', 'w'], 
    index=['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'])
print(a)

b = pandas.DataFrame([
    ['A', 0.5],
    ['B', 1],
    ['C', 1.5],
    ['D', 2.]],
    columns=['id', 'x'], 
    index=['2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05'])
print(b)

c = pandas.concat([a, b], axis=1)
print(c)

output:

           id    w
2020-01-01  A  0.5
2020-01-02  B  1.0
2020-01-03  C  1.5
2020-01-04  D  2.0
           id    x
2020-01-02  A  0.5
2020-01-03  B  1.0
2020-01-04  C  1.5
2020-01-05  D  2.0
             id    w   id    x
2020-01-01    A  0.5  NaN  NaN
2020-01-02    B  1.0    A  0.5
2020-01-03    C  1.5    B  1.0
2020-01-04    D  2.0    C  1.5
2020-01-05  NaN  NaN    D  2.0

thanks, but the example was simplified. I need a general approach where a and b does contains different indexes or ids
I've edited my post to provide a more general solution. The structure of the two dataframe is a key to give you the answer you want
@Oliver thanks again, but the results are not aligned, and there shouldn't be NaN given that data
The pandas.concat function used here concatenate based on the index (here: day series). Note that the df 'a' index is not the same as 'b'. The concat function won't give you Nan values if the indexes are exactly the same.
I didn't see it, sorry. but the problem remains: I want to merge also on id

Thabris · Accepted Answer · 2020-05-08 20:05:26Z

0

It doesn't seem to be a merger issue but more feeding to me. Adding seems to work

a['x'] = b['x']

answered May 8, 2020 at 20:05

Thabris

245 bronze badges

1 Comment

davide Over a year ago

thanks, but the example was simplified. I need a general approach where a and b does contains different indexes or ids

Collectives™ on Stack Overflow

how to merge two dataframes on both indexes and columns

4 Answers 4

1 Comment

1 Comment

5 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

5 Comments

1 Comment

Related