61

I would like to create a new column with a numerical value based on the following conditions:

a. if gender is male & pet1==pet2, points = 5

b. if gender is female & (pet1 is 'cat' or pet1 is 'dog'), points = 5

c. all other combinations, points = 0

    gender    pet1      pet2
0   male      dog       dog
1   male      cat       cat
2   male      dog       cat
3   female    cat       squirrel
4   female    dog       dog
5   female    squirrel  cat
6   squirrel  dog       cat

I would like the end result to be as follows:

    gender    pet1      pet2      points
0   male      dog       dog       5
1   male      cat       cat       5
2   male      dog       cat       0
3   female    cat       squirrel  5
4   female    dog       dog       5
5   female    squirrel  cat       0
6   squirrel  dog       cat       0

How do I accomplish this?

0

8 Answers 8

76

numpy.select

This is a perfect case for np.select where we can create a column based on multiple conditions and it's a readable method when there are more conditions:

conditions = [
    df['gender'].eq('male') & df['pet1'].eq(df['pet2']),
    df['gender'].eq('female') & df['pet1'].isin(['cat', 'dog'])
]

choices = [5,5]

df['points'] = np.select(conditions, choices, default=0)

print(df)
     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0
Sign up to request clarification or add additional context in comments.

1 Comment

perfect, indeed
55

You can do this using np.where, the conditions use bitwise & and | for and and or with parentheses around the multiple conditions due to operator precedence. So where the condition is true 5 is returned and 0 otherwise:

In [29]:
df['points'] = np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ), 5, 0)
df

Out[29]:
     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

Comments

30

using apply.

def f(x):
  if x['gender'] == 'male' and x['pet1'] == x['pet2']: return 5
  elif x['gender'] == 'female' and (x['pet1'] == 'cat' or x['pet1'] == 'dog'): return 5
  else: return 0

data['points'] = data.apply(f, axis=1)

Comments

7

You can also use the apply function. For example:

def myfunc(gender, pet1, pet2):
    if gender=='male' and pet1==pet2:
        myvalue=5
    elif gender=='female' and (pet1=='cat' or pet1=='dog'):
        myvalue=5
    else:
        myvalue=0
    return myvalue

And then using the apply function by setting axis=1

df['points'] = df.apply(lambda x: myfunc(x['gender'], x['pet1'], x['pet2']), axis=1)

We get:

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

Comments

6

The apply method described by @RuggeroTurra takes a lot longer for 500k rows. I ended up using something like

df['result'] = ((df.a == 0) & (df.b != 1)).astype(int) * 2 + \
               ((df.a != 0) & (df.b != 1)).astype(int) * 3 + \
               ((df.a == 0) & (df.b == 1)).astype(int) * 4 + \
               ((df.a != 0) & (df.b == 1)).astype(int) * 5 

where the apply method took 25 seconds and this method above took about 18ms.

Comments

4

Writing the conditions as a string expression and evaluating it using eval() is another method to evaluate the condition and assign values to the column using numpy.where().

# evaluate the condition 
condition = df.eval("gender=='male' and pet1==pet2 or gender=='female' and pet1==['cat','dog']")
# assign values
df['points'] = np.where(condition, 5, 0)

If you have a large dataframe (100k+ rows) and a lot of comparisons to evaluate, this method is probably the fastest pandas method to construct a boolean mask.1

Another advantage of this method over chained & and/or | operators (used in the other vectorized answers here) is better readability (arguably).


1: For a dataframe with 105k rows, if you evaluate 4 conditions where each chain two comparisons, eval() creates a boolean mask substantially faster than chaining bitwise operators.

df = pd.DataFrame([{'gender': 'male', 'pet1': 'dog', 'pet2': 'dog'}, {'gender': 'male', 'pet1': 'cat', 'pet2': 'cat'}, {'gender': 'male', 'pet1': 'dog', 'pet2': 'cat'},{'gender': 'female', 'pet1': 'cat', 'pet2': 'squirrel'},{'gender': 'female', 'pet1': 'dog', 'pet2': 'dog'},{'gender': 'female', 'pet1': 'squirrel', 'pet2': 'cat'},{'gender': 'squirrel', 'pet1': 'dog', 'pet2': 'cat'}]*15_000)

%timeit np.where(df.eval("gender == 'male' and pet1 == pet2 or gender == 'female' and pet1 == ['cat','dog'] or gender == 'female' and pet2 == ['squirrel','dog'] or pet1 == 'cat' and pet2 == 'cat'"), 5, 0)
# 37.9 ms ± 847 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ) | ( (df['gender'] == 'female') & (df['pet2'].isin(['squirrel','dog'] ) ) ) | ( (df['pet1'] == 'cat') & (df['pet2'] == 'cat') ), 5, 0)
# 53.5 ms ± 1.38 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit np.select([df['gender'].eq('male') & df['pet1'].eq(df['pet2']), df['gender'].eq('female') & df['pet1'].isin(['cat', 'dog']), df['gender'].eq('female') & df['pet2'].isin(['squirrel', 'dog']), df['pet1'].eq('cat') & df['pet2'].eq('cat')], [5,5,5,5], default=0)
# 48.9 ms ± 5.06 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)

1 Comment

Thank you very much for this method, this was precisely what I was looking for when stumbled upon this question. I knew about the other mentioned methods, but was recently fascinated by the beauty of eval() and query() syntax from the readability perspective, but failed to come up with a good way of using those for this problem. But this seems to be the way. I know that this answer was posted way later then most others, but it is still a crime that it is so down on votes...
0

One option is with case_when from pyjanitor; under the hood it uses pd.Series.mask.

The basic idea is a pairing of condition and expected value; you can pass as many pairings as required, followed by a default value and a target column name:

# pip install pyjanitor
import pandas as pd
import janitor
df.case_when(
    # condition, value
    df.gender.eq('male') & df.pet1.eq(df.pet2), 5,
    df.gender.eq('female') & df.pet1.isin(['cat', 'dog']), 5,
    0, # default
    column_name = 'points')

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

You could use strings for the conditions, as long as they can be evaluated by pd.eval on the parent dataframe - note that speed wise, this can be slower for small datasets:

df.case_when(
   "gender == 'male' and pet1 == pet2", 5,
   "gender == 'female' and pet2 == ['cat', 'dog']", 5,
   0,
   column_name = 'points')

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       0
4    female       dog       dog       5
5    female  squirrel       cat       5
6  squirrel       dog       cat       0

Anonymous functions are also possible, which can be handy in chained operations:

df.case_when(
    lambda df: df.gender.eq('male') & df.pet1.eq(df.pet2), 5,
    lambda df: df.gender.eq('female') & df.pet1.isin(['cat', 'dog']), 5,
    0, # default
    column_name = 'points')

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

Comments

0

df.assign

Let's take a dummy example where we have raw length and predicted length as attributes. We want to know if the difference between these two lengths is within acceptable tolerance.

For this we create two columns as follows -

>>> import pandas as pd
>>> df = pd.DataFrame({"length":[1,2,3],"pred_length":[1.001,2,3.1]})
>>> df['absdiff'] = (df['length']-df['pred_length']).abs()
>>> df['percdiff'] = 100 * df['absdiff']/df['length']
>>> df
   length  pred_length  absdiff  percdiff
0       1        1.001    0.001  0.100000
1       2        2.000    0.000  0.000000
2       3        3.100    0.100  3.333333

Next, we can use df.assign to create a decision column within_limits based on absdiff and percdiff columns as follows -

>>> df = df.assign(is_equal=np.where((df.absdiff<0.01) & (df.percdiff<0.1),1,0))
>>> df
   length  pred_length  absdiff  percdiff  is_equal
0       1        1.001    0.001  0.100000         1
1       2        2.000    0.000  0.000000         1
2       3        3.100    0.100  3.333333         0

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.