0

I need to make a column in my pandas dataframe that relies on other items in that same row. For example, here's my dataframe.

    df = pd.DataFrame(
        [['a',],['a',1],['a',1],['a',2],['b',2],['b',2],['c',3]],
        columns=['letter','number']
    )
   letters  numbers
 0    a     1
 1    a     1
 2    a     1
 3    a     2
 4    b     2
 5    b     2
 6    c     3

I need a third column, that is 1 if 'a' and 2 are present in the row, and 0 otherwise. So it would be [`0,0,0,1,0,0,0]`

How can I use Pandas `apply` or `map` to do this? Iterating over the rows is my first thought, but this seems like a clumsy way of doing it.
4
  • 1
    If it's just that simple condition, you don't need apply here. df['new_column'] = ((df['letters'] == "a") & (df['numbers'] == 2)).astype(int) Commented Nov 26, 2018 at 20:10
  • This makes sense, but for even 3 or 4 columns with a condition, this would get unwieldy. Are there any alternatives? Commented Nov 26, 2018 at 20:16
  • There are alternatives, have you looked through the documentation? Your best bet is to try something and see if it fits your needs. Commented Nov 26, 2018 at 20:17
  • @max whether it be via apply or using boolean conditions, it will be about equally unwieldy (code-wise) but the latter will be much faster. Commented Nov 26, 2018 at 20:24

2 Answers 2

2

You can use apply with axis=1. Suppose you wanted to call your new column c:

df['c'] = df.apply(
    lambda row: (row['letter'] == 'a') and (row['number'] == 2),
    axis=1
).astype(int)

print(df)
#  letter  number  c
#0      a     NaN  0
#1      a     1.0  0
#2      a     1.0  0
#3      a     2.0  1
#4      b     2.0  0
#5      b     2.0  0
#6      c     3.0  0

But apply is slow and should be avoided if possible. In this case, it would be much better to boolean logic operations, which are vectorized.

df['c'] = ((df['letter'] == "a") & (df['number'] == 2)).astype(int)

This has the same result as using apply above.

Sign up to request clarification or add additional context in comments.

Comments

1

You can try to use pd.Series.where()/np.where(). If you only are interested in the int represantation of the boolean values, you can pick the other solution. If you want more freedom for the if/else value you can use np.where()

import pandas as pd
import numpy as np

# create example
values = ['a', 'b', 'c']
df = pd.DataFrame()
df['letter'] = np.random.choice(values, size=10)
df['number'] = np.random.randint(1,3, size=10)

# condition
df['result'] = np.where((df['letter'] == 'a') & (df['number'] == 2), 1, 0)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.