Replace values in one dataframe with values in second dataframe in Python

Question

I have a large dataframe (DF1) that contains a variable containing UK postcode data. Inevitably there are some typos in the data. However, after some work with regular expressions, I have created a second database that contains corrected versions of the postcode data (but only for those rows where the original postcode was incorrect) – DF2. (N.B. the index values are not necessarily consecutive.)

    id   postcode                     remark
0    1      L93AP                     Normal
2    2     LD38AH                     Normal
4    3    SO224ER                     Normal
6    4       SO21                  Too short
8    5    DN379HJ                     Normal
10   6     M21ORH  Zero replaced with O (oh)
12   7     NP745G          S replaced with 5
14   8    SE136R2          Z replaced with 2
16   9  BN251ESBN                   Too long
18  10    TD152EH                     Normal

The dataframe containing the corrected data is:

       0  1  2  3 pcCorrected
10   M21  0  R  H      M210RH
12   NP7  4  S  G      NP74SG
14  SE13  6  R  Z     SE136RZ

I want to combine the 2 databases such that the new values in the pcCorrected column in DF2 replace the old postcode values in the DF1 dataframe but, for other cells, the existing postcode values remain in tact. The final database should look like:

    id   postcode                     remark
0    1      L93AP                     Normal
2    2     LD38AH                     Normal
4    3    SO224ER                     Normal
6    4       SO21                  Too short
8    5    DN379HJ                     Normal
10   6     M210RH                     Normal
12   7     NP74SG                     Normal
14   8    SE136RZ                     Normal
16   9  BN251ESBN                   Too long
18  10    TD152EH                     Normal

The databases are quite large (>1 million rows). Does this action have a name and what is the most efficient way to do this?

jezrael · Accepted Answer · 2016-03-24 06:32:38Z

1

You can try merge by indexes , create mask by notnull and add new values by loc:

df = pd.merge(df1, df2, left_index=True, right_index=True, how='left')
mask = pd.notnull(df['pcCorrected'])
print mask
0     False
2     False
4     False
6     False
8     False
10     True
12     True
14     True
16    False
18    False
Name: pcCorrected, dtype: bool

df.loc[mask, 'remark'] = 'Normal'
df.loc[mask, 'postcode'] = df['pcCorrected']

print df[['id','postcode','remark']]
    id   postcode     remark
0    1      L93AP     Normal
2    2     LD38AH     Normal
4    3    SO224ER     Normal
6    4       SO21  Too short
8    5    DN379HJ     Normal
10   6     M210RH     Normal
12   7     NP74SG     Normal
14   8    SE136RZ     Normal
16   9  BN251ESBN   Too long
18  10    TD152EH     Normal

edited Mar 24, 2016 at 6:32

answered Mar 24, 2016 at 6:27

jezrael

868k102 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user1718097 Over a year ago

When I first read this answer, I was a little concerned that adding a column from one df to another based on a mask might get out of sync if dataframes had been modified. But, on re-reading, I can see that this solution is much more subtle and secure than that. Merging into a single df and then manipulating columns within that newly created df put paid to all those concerns. In a nutshell, I really like this answer. Thanks very much.

user1718097 Over a year ago

From a memory-management point-of-view, would it be better to merge DF2 into DF1 – e.g. using DF1 = DF1.merge(DF2, left_index=True, right_index=True, how='left') – rather than creating another brand new (and large) dataframe?

jezrael Over a year ago

I think better is use existing DF2. The best is test it.

Siraj S. · Accepted Answer · 2016-03-24 19:16:09Z

1

maybe a simpler method would be:

mask = df1.index.isin(df2.index)

df1.loc[mask,'postcode'] = df2['pcCorrected']

this could avoid the additional step of merging the two DataFrames

answered Mar 24, 2016 at 19:16

Siraj S.

3,7714 gold badges37 silver badges50 bronze badges

1 Comment

Gathide Over a year ago

To use index, df1 and df2 must be sorted. Is it possible to avoid sorting and use one of the columns, e.g. id column in this case?

user1718097 · Accepted Answer · 2016-03-24 06:12:04Z

0

One way is to step through each row of the dataframe containing the corrected values and then transfer to the original dataframe:

for i in DF2.index:
    DF1.ix[i,'postcode'] = DF2.ix[i,'pcCorrected']

Is there a more efficient way to do this?

answered Mar 24, 2016 at 6:12

user1718097

4,33212 gold badges53 silver badges66 bronze badges

1 Comment

Gathide Over a year ago

What is ix as used here?

Mohsin Mahmood · Accepted Answer · 2020-01-09 22:42:44Z

0

You can replace the values of first dataframe df1 with the values of second dataframe df2 by doing the following steps.

Step 1: Set index of the first dataframe (df1)

df1.set_index('id')

Step 2: Set index of the second dataframe (df2)

df2.set_index('id')

and finally update the dataframe using the following snippet —

df1.update(df2)

answered Jan 9, 2020 at 22:42

Mohsin Mahmood

3,4863 gold badges24 silver badges26 bronze badges

Collectives™ on Stack Overflow

Replace values in one dataframe with values in second dataframe in Python

4 Answers 4

3 Comments

1 Comment

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

1 Comment

Comments

Related