Pandas - Filter Dataframe on Multiple Criteria

Question

I have a dataframe df:

type    rec_1   rec_2   rec_3   rec_4   rec_1_outlier   rec_2_outlier   rec_3_outlier   rec_4_outlier
yellow  1          7       3       1       FALSE        TRUE                  TRUE          TRUE
red     3         11       2       5       FALSE        TRUE                 FALSE          FALSE
blue    5         2        1       6        TRUE        FALSE                FALSE          FALSE
green   2         9       13       9        FALSE       FALSE                TRUE           FALSE

I want to get separate dataframes per type where the _outlier columns are only false, but the rec columns are independent of each other and one column may be true and the other false.

So theoretically if I were to try

df_blue = df['type']=='blue' & df['rec_1_outlier']=='False' & df['rec_2_outlier']=='False' & df['rec_3_outlier']=='False' & df['rec_4_outlier']=='False'

This would might never select any rows because the _outlier columns might never all be false.

I have also thought about doing it one column at a time like this.

df_blue_rec_1 = df['type']=='blue' & df['rec_1_outlier']=='False'
df_blue_rec_2 = df['type']=='blue' & df['rec_2_outlier']=='False'

Then just appending the separate dataframes into one.

I have this feeling like there is a better way to accomplish this.

Kosmos · Accepted Answer · 2020-05-13 17:28:13Z

You are on the right path. What you did was create a boolean mask. like so:

mask_blue =((df['type']=='blue') & 
            (df['rec_1_outlier']=='False') & 
            (df['rec_2_outlier']=='False') & 
            (df['rec_3_outlier']=='False') & 
            (df['rec_4_outlier']=='False')

This mask gives a list of true/false that corresponds to the indexes of your original df.

df_blue = df.loc[mask_blue,:]

Now you choose which column to transfer to df_blue by changing the (:) above. for example:

df_blue = df.loc[mask_blue,['type','rec_1']]

This would give a df with the column: type and rec_1

Update
To do this for every individual rec_1, try creating on mask for each rec_x. This will give nan values for True outliers. The following code is an example for rec_1 and rec_2.

df_blue = pd.Datafram()
mask_blue1 =((df['type']=='blue') & (df['rec_1_outlier']=='False'))
df_blue.loc[:,'rec_1'] = df.loc[mask_blue1,'rec_1']
mask_blue2 =((df['type']=='blue') & (df['rec_2_outlier']=='False'))
df_blue.loc[:,'rec_2'] = df.loc[mask_blue2,'rec_2']

Thanks Kosmos. Question for you, would your mask above only find rows where all the _outlier columns = False? Sometimes the row would have rec_1_outlier == False but rec_2_outlier == True. In that row I would want to take the value for rec_1_outlier but reject rec_2_outlier. Does that make sense?
No problem. Hope it works for you. I've updated the post to match your question. This solution will give nan values when the mask gives False.

Collectives™ on Stack Overflow

Pandas - Filter Dataframe on Multiple Criteria

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related