0

Here's the problem: I know how to merge two DataFrames on indexes or on columns, but I am not able to merge them on a both indexes and columns.

I have two DataFrames and I want to merge them on indexes (which are dates) and on column id. I created some sample data to better explain my problem.

from datetime import date
import numpy as np
import pandas as pd

np.random.seed(200)
dates = [date(2020, 1, 31), date(2020, 2, 28)]
a = {"id": ["A", "B"] * len(dates), "w": [.5, .5] * len(dates)}
b = {"id": ["B", "A"] * len(dates), "x": np.random.random(2 * len(dates))}

a = pd.DataFrame(a, index=dates * len(dates))
b = pd.DataFrame(b, index=dates * len(dates))

Desired output:

           id    w         x
2020-01-31  A  0.5  0.226547
2020-02-28  B  0.5  0.947632
2020-01-31  A  0.5  0.428309
2020-02-28  B  0.5  0.594420

Please note that I am searching for a general solution, where a and b do not necessarily contains the same indexes or elements in id.

2
  • 1
    your example is a bit ambiguous, should the id column be more ['A','B','B','A]? to actually have an interest to merge on the index and the column id Commented May 8, 2020 at 20:22
  • you're right, I'm gonna edit it Commented May 8, 2020 at 20:28

4 Answers 4

2

IIUC, you can use set_index to append the columns, use join and then reset_index like

print(a.set_index('id', append=True)\
       .join(b.set_index('id', append=True), how='outer')\
       .reset_index('id'))
           id    w         x
2020-01-31  A  0.5  0.947632
2020-02-28  B  0.5  0.226547
2020-01-31  B  0.5  0.594420
2020-02-28  A  0.5  0.428309

or the opposite direction with merge:

print(a.reset_index()\
       .merge(b.reset_index(), on=['index', 'id'], how='outer')\
       .set_index('index'))
           id    w         x
index                       
2020-01-31  A  0.5  0.947632
2020-02-28  B  0.5  0.226547
2020-01-31  B  0.5  0.594420
2020-02-28  A  0.5  0.428309

Just to be sure that is what you want to do, let's assume a and b are like this with another id:

a = pd.DataFrame({"id": ["A", "B", 'B','A'] , "w": np.random.random(4)}, 
                 index=[date(2020, 1, 31), date(2020, 2, 28)]*2)
#           id         w
#2020-01-31  A  0.764141
#2020-02-28  B  0.002861
#2020-01-31  B  0.357424
#2020-02-28  A  0.909695

b = pd.DataFrame({"id": ["A", "B", 'C','A'], "x": np.random.random(4)}, 
                 index=[date(2020, 1, 31), date(2020, 2, 28)]*2)
#           id         x
#2020-01-31  A  0.456081
#2020-02-28  B  0.981803
#2020-01-31  C  0.867357
#2020-02-28  A  0.986028

Then the result of the method with join is:

           id         w         x
2020-01-31  A  0.764141  0.456081
2020-01-31  B  0.357424       NaN
2020-01-31  C       NaN  0.867357
2020-02-28  A  0.909695  0.986028
2020-02-28  B  0.002861  0.981803
Sign up to request clarification or add additional context in comments.

1 Comment

thanks, I hoped for a some hidden pd.merge arg, but this works!
1

Use a helper column based on cumcount, as give indexes a name to make it eaiser to merge on indexes:

a['helper'] = a.groupby([a.index, 'id']).cumcount()
b['helper'] = b.groupby([b.index, 'id']).cumcount()
a = a.rename_axis('date')
b = b.rename_axis('date')

a.merge(b, on=['date','id','helper']).drop('helper', axis=1)

Output:

           id    w         x
date                        
2020-01-31  A  0.5  0.947632
2020-02-28  B  0.5  0.226547
2020-01-31  A  0.5  0.594420
2020-02-28  B  0.5  0.428309

1 Comment

thanks, but this solution does not seems to generalize well. it won't work if: ids = ["A", "B"] dates = [date(2020, 1, 31), date(2020, 2, 28)] a = {"id": ids * len(dates), "w": [.5, .5] * len(dates)} ids.append("C") b = {"id": ids * len(dates), "x": np.random.random(len(ids) * len(dates))} a = pd.DataFrame(a, index=dates * (len(ids) - 1)) b = pd.DataFrame(b, index=dates * len(ids))
1

You can simply add a new column using b['w'] = a['w']. This is note really a merge, but a copy from a into b.

The full code is :

from datetime import date
import numpy as np
import pandas as pd

np.random.seed(200)
ids = ["A", "B"]
dates = [date(2020, 1, 31), date(2020, 2, 28)]
a = {"id": ids * len(dates), "w": [.5, .5] * len(dates)}
b = {"id": ids * len(dates), "x": np.random.random(len(ids) * len(dates))}

a = pd.DataFrame(a, index=dates * len(dates))
b = pd.DataFrame(b, index=dates * len(dates))

b['w'] = a['w']
print(b)

EDIT: Other way to obtain the result you want (well, I'm not so sure because of the duplicate 'id' column). Please let me known the structure of the id of the two dataframe :

import pandas as pd

a = pandas.DataFrame([
    ['A', 0.5],
    ['B', 1],
    ['C', 1.5],
    ['D', 2.]],
    columns=['id', 'w'], 
    index=['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'])
print(a)

b = pandas.DataFrame([
    ['A', 0.5],
    ['B', 1],
    ['C', 1.5],
    ['D', 2.]],
    columns=['id', 'x'], 
    index=['2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05'])
print(b)

c = pandas.concat([a, b], axis=1)
print(c)

output:

           id    w
2020-01-01  A  0.5
2020-01-02  B  1.0
2020-01-03  C  1.5
2020-01-04  D  2.0
           id    x
2020-01-02  A  0.5
2020-01-03  B  1.0
2020-01-04  C  1.5
2020-01-05  D  2.0
             id    w   id    x
2020-01-01    A  0.5  NaN  NaN
2020-01-02    B  1.0    A  0.5
2020-01-03    C  1.5    B  1.0
2020-01-04    D  2.0    C  1.5
2020-01-05  NaN  NaN    D  2.0

5 Comments

thanks, but the example was simplified. I need a general approach where a and b does contains different indexes or ids
I've edited my post to provide a more general solution. The structure of the two dataframe is a key to give you the answer you want
@Oliver thanks again, but the results are not aligned, and there shouldn't be NaN given that data
The pandas.concat function used here concatenate based on the index (here: day series). Note that the df 'a' index is not the same as 'b'. The concat function won't give you Nan values if the indexes are exactly the same.
I didn't see it, sorry. but the problem remains: I want to merge also on id
0

It doesn't seem to be a merger issue but more feeding to me. Adding seems to work

a['x'] = b['x']

1 Comment

thanks, but the example was simplified. I need a general approach where a and b does contains different indexes or ids

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.