1

I have the following pandas DataFrame:

ID  COL1  COL2
123 1     ABC
123 1     CCC
123 NaN   AVV
345 2     FGG
345 NaN   FRG
345 NaN   FGT 

I need to substitute all NaN values in Col1 based on the same ID in order to get this result:

ID  COL1  COL2
123 1     ABC
123 1     CCC
123 1     AVV
345 2     FGG
345 2     FRG
345 2     FGT 

I can write for loop, but it will take a long time for my dataset to execute the script. Is there any conditional replace function?

7
  • Does df.groupby('ID').ffill().bfill() give what you need? Commented Nov 20, 2016 at 23:22
  • @Psidom: Yes, it does. Thank you. The only problem is that it takes a long time to finish the calculation for 1GB of data Commented Nov 20, 2016 at 23:33
  • Try df.sort_values(['ID', 'COL1']).ffill(), which seems to be 3 ~ 4 times faster than the above method. It sorts the NaN values to the end of the data frame and use only ffill() method to fill missing values. Commented Nov 20, 2016 at 23:57
  • @Psidom: Could you please publish your last solution? It worked fine for me. Also I appreciate if you explain how to extend this solution to substituting any value, not only NaN. Let's say that instead of NaN I have Not-Defined. Can I still use ffill()? Commented Nov 21, 2016 at 8:56
  • What do you mean with Not-Defined? Is it a string or null? Commented Nov 21, 2016 at 14:05

2 Answers 2

1

Starting with an example as follows:

df = pd.DataFrame({'ID': list(range(10)), 'COL1': [np.random.choice([1,np.nan]) for _ in range(10)]})
df = pd.concat([df]*100000).reset_index(drop = True)

df.head()

#  COL1 ID
#0  NaN  0
#1  1.0  1
#2  1.0  2
#3  NaN  3
#4  1.0  4

You can use the forward fill and backward fill methods within each group to fill missing values:

%timeit df.groupby('ID').ffill().bfill()
1 loop, best of 3: 212 ms per loop

Or an alternative is to sort values by ID and COL1, this sorts ID firstly and then sort COL1 within each ID which pushes all missing values to the end of each ID and then you can use ffill() which seems to be faster than the ffill(), bfill() methods above for this example:

%timeit df.sort_values(['ID', 'COL1']).ffill()
10 loops, best of 3: 71.6 ms per loop

If there are other unwanted strings, you can call the replace method to replace the strings with NaN firstly. For instance, if there are empty strings in the data frame you want to fill. You can do df.replace('', np.nan).sort_values(['ID', 'COL1']).ffill()

Sign up to request clarification or add additional context in comments.

Comments

1

How about using Series.isnull() to select the rows and Series.map() to do the conditional replacement?

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'ID': [123, 123, 123, 345, 345, 345],
    'COL1': [1, 1, np.nan, 2, np.nan, np.nan],
    'COL2':['ABC', 'CCC', 'AVV', 'FGG', 'FRG', 'FGT']},
    columns=['ID','COL1', 'COL2'])

print df
mapping = {123: 1, 345: 2}
df.loc[df['COL1'].isnull(), 'COL1'] = df['ID'].map(mapping)
print df

before:

    ID  COL1 COL2
0  123   1.0  ABC
1  123   1.0  CCC
2  123   NaN  AVV
3  345   2.0  FGG
4  345   NaN  FRG
5  345   NaN  FGT

after:

    ID  COL1 COL2
0  123   1.0  ABC
1  123   1.0  CCC
2  123   1.0  AVV
3  345   2.0  FGG
4  345   2.0  FRG
5  345   2.0  FGT

EDIT: To build mapping programmatically, you can use these two lines of code:

df_unique = df.loc[df['COL1'].notnull()].groupby('ID').nth(0)
mapping = pd.Series(df_unique['COL1'].values, index=df_unique.index).to_dict()

2 Comments

Your solution is quite interesting and seems to be flexible, if you explain how to automatically create mapping. Thanks.
I added an edit with the automatic creation of mapping. I assume my code is slower than Psidom's solution, but hopefully it's still useful to somebody.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.