4

I have two pandas data frames. The first is:

df1 = pd.DataFrame({"val1" : ["B2","A1","B2","A1","B2","A1"]})

The second data frame is:

df2 = pd.DataFrame({"val1" : ["A1","A1","A1","B2","B2","B2"],
                    "val2" : [10, 13, 16, 11, 20, 22]})

I would like to merge the two together in a way in which the row ordering from df1 is used and the values from df2 follow this ordering. Ideally, I would like it to look like this:

df_final = pd.DataFrame({"val1" : ["B2","A1","B2","A1","B2","A1"],
                         "val2" : [11, 10, 20, 13, 22, 16]})

I've tried using the merge function with left_on and right_on, but I don't get the output I'm looking for. Any help would be greatly appreciated.

6
  • Maybe df2.update(df1) and then sort vals? Commented Apr 3, 2016 at 21:06
  • @PadraicCunningham, the question is how to sort them? ;) Commented Apr 3, 2016 at 21:09
  • @MaxU, it is pretty easy to do with regular python Commented Apr 3, 2016 at 21:10
  • @PadraicCunningham, will you get the desired order then? Commented Apr 3, 2016 at 21:11
  • 2
    I don't think this problem is well-defined, is it? Val1 is not unique. So it seems like the requested result depends on some implicit ordering assumption within A1 and B2 groupings. It would be a good idea to make that assumption explicit here. And I'm skeptical this is really a good overall way to handle data to be honest. Commented Apr 3, 2016 at 21:29

2 Answers 2

1

You can do it this way:

  1. sort values in df2 by ['val1', 'val2'], group it by val1 and store it as g2?
  2. add idx column to df1 which will be used in order to pick values from df2

Code:

In [176]: df1['idx'] = 1

In [177]: df1['idx'] = df1.groupby('val1')['idx'].cumsum()-1

In [178]: df1
Out[178]:
  val1  idx
0   B2    0
1   A1    0
2   B2    1
3   A1    1
4   B2    2
5   A1    2

In [179]: g2 = df2.sort_values(['val1', 'val2']).groupby('val1')

In [180]: g2.groups
Out[180]: {'A1': [0, 1, 2], 'B2': [3, 4, 5]}

In [181]: df2.iloc[g2.groups['A1'][1]]
Out[181]:
val1    A1
val2    13
Name: 1, dtype: object

In [182]: df1.apply(lambda x: df2.iloc[g2.groups[x['val1']][x['idx']]], axis=1)
Out[182]:
  val1  val2
0   B2    11
1   A1    10
2   B2    20
3   A1    13
4   B2    22
5   A1    16
Sign up to request clarification or add additional context in comments.

Comments

0

You could use groupby/cumcount to assign a unique number to each row within each group:

df1['cumcount'] = df1.groupby('val1').cumcount()
#   val1  cumcount
# 0   B2         0
# 1   A1         0
# 2   B2         1
# 3   A1         1
# 4   B2         2
# 5   A1         2

If we do the same to df2:

df2['cumcount'] = df2.groupby('val1').cumcount()
#   val1  val2  cumcount
# 0   A1    10         0
# 1   A1    13         1
# 2   A1    16         2
# 3   B2    11         0
# 4   B2    20         1
# 5   B2    22         2

then merging df1 with df2 on the common columns (val1 and cumcount) produces the desired result:

import numpy as np
import pandas as pd

df1 = pd.DataFrame({"val1" : ["B2","A1","B2","A1","B2","A1"]})
df2 = pd.DataFrame({"val1" : ["A1","A1","A1","B2","B2","B2"],
                    "val2" : [10, 13, 16, 11, 20, 22]})
df_final = pd.DataFrame({"val1" : ["B2","A1","B2","A1","B2","A1"],
                         "val2" : [11, 10, 20, 13, 22, 16]})

df1['cumcount'] = df1.groupby('val1').cumcount()
df2['cumcount'] = df2.groupby('val1').cumcount()
result = pd.merge(df1, df2, how='left')
result = result.drop('cumcount', axis=1)
print(result)
assert result.equals(df_final)

yields

  val1  val2
0   B2    11
1   A1    10
2   B2    20
3   A1    13
4   B2    22
5   A1    16

Note that merging with how='left' produces a result with the same number of rows as the first DataFrame, df1, and maintains the same order of rows as df1.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.