Filtering Data in pandas as per condition

Question

I have a df known as df2 as shown:

Name    Age Experience  Education
Archana 35  8           Bachelors
Sharad  39  12          Bachelors
Jitesh  30  2           Diploma
Sukanya 45  18          Bachelors
Shirish 40  15          Bachelors

I want to filter data and add a column promotion which I want to set as 1 in the df as per given conditions:

If education = Bachelors
If experience > 10
If age >30

Hence the expected df should be:

I know that I can use np.where for the given task but I have to convert all the columns to string type as Education column is string data type

Hence is there any faster way apart from np.where wherein I could achieve similar result without converting columns

I used

df2['prom'] = (df2['Age']>30)&(df2['experience']>10)&(df2['education' == 'Bachelors'])

But it gives me following error:

KeyError                                  Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: False

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_6476/2030827498.py in <module>
      1 #df2['ELIGIBLE_FOR_DISCOUNT'] = np.where((df2['TENURE'] >= '60') & (df2['NO_OF_FAMILY_MEMBERS'] >= '4') & (df2['EMPLOYMENT_STATUS'] =='N'), 1, 0)
      2 
----> 3 df2['ELIGIBLE_FOR_DISCOUNT'] = (df2['TENURE']>60)&(df2['NO_OF_FAMILY_MEMBERS']>3)&(df2['EMPLOYMENT_STATUS' == 'N'])
      4 
      5 

~\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: False

try: df['promotion'] = (df['Education'].eq('Bachelors') & df['Experience'].gt(10) & df['Age'].gt(30)).astype(int) — user7864386
– user7864386, Commented Feb 27, 2022 at 6:54
Recommend you first to check each of conditions to see which of them is producing the error. Then the error shows isna(key) is True, so I suspected that nans are the cause. — keramat
– keramat, Commented Feb 27, 2022 at 7:12

keramat · Accepted Answer · 2022-02-27 06:35:30Z

1

Use:

df['prom'] = (df['Age']>30)&(df['experience']>10)&(df['education' == 'Bachelors'])

if the age and experience columns are not numerical:

df['prom'] = (df['Age'].astype(int)>30)&(df['experience'].astype(int)>10)&(df['education' == 'Bachelors'])

answered Feb 27, 2022 at 6:35

keramat

4,6038 gold badges29 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Huzefa Sadikot Over a year ago

I am getting the error > not supported between instance of str and int

keramat Over a year ago

Did you use the second? This does not produce that error as we are casting first. Maybe there is some nans there. Can you provide sample date?

Huzefa Sadikot Over a year ago

Yes I used the second I am getting the error as posted in the edited question. There are no nan values available in the dataset. There are values marked as "NONE" thought but no blank values or NANS

keramat Over a year ago

So that is the reason.

Huzefa Sadikot Over a year ago

But why values written as a str(NONE) is not same as blank values or nan

Huzefa Sadikot · Accepted Answer · 2022-02-27 07:02:35Z

1

As suggested in one of the comments use:

df['promotion'] = (df['Education'].eq('Bachelors') & df['Experience'].gt(10) & df['Age'].gt(30)).astype(int)

answered Feb 27, 2022 at 7:02

Huzefa Sadikot

5811 gold badge7 silver badges24 bronze badges

Comments

Pradip · Accepted Answer · 2022-02-27 07:58:31Z

1

This will handle all your fallback cases.

def filter(x):
    try:
        return 1 if int(x[1]) > 30 and int(x[2]) > 10 and str(x[3]) == "Bachelors" else 0
    except:
        return 0

df["promotion"] = df.apply(filter, axis=1)

answered Feb 27, 2022 at 7:58

Pradip

1411 gold badge1 silver badge6 bronze badges

Collectives™ on Stack Overflow

Filtering Data in pandas as per condition

3 Answers 3

5 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Related