Conditional replace for pandas DataFrame

Question

I have the following pandas DataFrame:

ID  COL1  COL2
123 1     ABC
123 1     CCC
123 NaN   AVV
345 2     FGG
345 NaN   FRG
345 NaN   FGT

I need to substitute all NaN values in Col1 based on the same ID in order to get this result:

ID  COL1  COL2
123 1     ABC
123 1     CCC
123 1     AVV
345 2     FGG
345 2     FRG
345 2     FGT

I can write for loop, but it will take a long time for my dataset to execute the script. Is there any conditional replace function?

@Psidom: Yes, it does. Thank you. The only problem is that it takes a long time to finish the calculation for 1GB of data — duckertito
– duckertito, Commented Nov 20, 2016 at 23:33
Try df.sort_values(['ID', 'COL1']).ffill(), which seems to be 3 ~ 4 times faster than the above method. It sorts the NaN values to the end of the data frame and use only ffill() method to fill missing values. — akuiper
– akuiper, Commented Nov 20, 2016 at 23:57
@Psidom: Could you please publish your last solution? It worked fine for me. Also I appreciate if you explain how to extend this solution to substituting any value, not only NaN. Let's say that instead of NaN I have Not-Defined. Can I still use ffill()? — duckertito
– duckertito, Commented Nov 21, 2016 at 8:56
What do you mean with Not-Defined? Is it a string or null? — akuiper
– akuiper, Commented Nov 21, 2016 at 14:05

akuiper · Accepted Answer · 2016-11-21 14:16:51Z

Starting with an example as follows:

df = pd.DataFrame({'ID': list(range(10)), 'COL1': [np.random.choice([1,np.nan]) for _ in range(10)]})
df = pd.concat([df]*100000).reset_index(drop = True)

df.head()

#  COL1 ID
#0  NaN  0
#1  1.0  1
#2  1.0  2
#3  NaN  3
#4  1.0  4

You can use the forward fill and backward fill methods within each group to fill missing values:

%timeit df.groupby('ID').ffill().bfill()
1 loop, best of 3: 212 ms per loop

Or an alternative is to sort values by ID and COL1, this sorts ID firstly and then sort COL1 within each ID which pushes all missing values to the end of each ID and then you can use ffill() which seems to be faster than the ffill(), bfill() methods above for this example:

%timeit df.sort_values(['ID', 'COL1']).ffill()
10 loops, best of 3: 71.6 ms per loop

If there are other unwanted strings, you can call the replace method to replace the strings with NaN firstly. For instance, if there are empty strings in the data frame you want to fill. You can do df.replace('', np.nan).sort_values(['ID', 'COL1']).ffill()

MarredCheese · Accepted Answer · 2016-11-21 22:59:47Z

How about using Series.isnull() to select the rows and Series.map() to do the conditional replacement?

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'ID': [123, 123, 123, 345, 345, 345],
    'COL1': [1, 1, np.nan, 2, np.nan, np.nan],
    'COL2':['ABC', 'CCC', 'AVV', 'FGG', 'FRG', 'FGT']},
    columns=['ID','COL1', 'COL2'])

print df
mapping = {123: 1, 345: 2}
df.loc[df['COL1'].isnull(), 'COL1'] = df['ID'].map(mapping)
print df

before:

    ID  COL1 COL2
0  123   1.0  ABC
1  123   1.0  CCC
2  123   NaN  AVV
3  345   2.0  FGG
4  345   NaN  FRG
5  345   NaN  FGT

after:

    ID  COL1 COL2
0  123   1.0  ABC
1  123   1.0  CCC
2  123   1.0  AVV
3  345   2.0  FGG
4  345   2.0  FRG
5  345   2.0  FGT

EDIT: To build mapping programmatically, you can use these two lines of code:

df_unique = df.loc[df['COL1'].notnull()].groupby('ID').nth(0)
mapping = pd.Series(df_unique['COL1'].values, index=df_unique.index).to_dict()

Your solution is quite interesting and seems to be flexible, if you explain how to automatically create mapping. Thanks.
I added an edit with the automatic creation of mapping. I assume my code is slower than Psidom's solution, but hopefully it's still useful to somebody.

Collectives™ on Stack Overflow

Conditional replace for pandas DataFrame

2 Answers 2

Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Related