Finding string over multiple columns in Pandas

Question

I am trying to find if a string exists across multiple columns. I would like to return a 1 if the string exists and 0 if it doesn't as a new series within the dataframe.

After searching the forums, I understand that str.contains could be used, but i'm searching over 100+ columns therefore it isn't efficient for me to work with individual series at a time.

There are some NAs within the columns if this is relevant.

Example simplified dataframe:

d = {'strings_1': ['AA', 'AB', 'AV'], 'strings_2': ['BB', 'BA', 'AG'], 
'strings_1': ['AE', 'AC', 'AI'], 'strings_3': ['AA', 'DD', 'PP'], 
'strings_4': ['AV', 'AB', 'BV']}
simple_df = pd.DataFrame(data=d)

If I am interested in finding 'AA' for example, I would like to return the following dataframe.

Example target dataframe:

d = {'strings_1': ['AA', 'AB', 'AV'], 'strings_2': ['BB', 'BA', 'AG'], 
'strings_1': ['AE', 'AC', 'AI'], 'strings_3': ['AA', 'DD', 'PP'], 
'strings_4': ['AV', 'AB', 'BV'], 'AA_TRUE': [1, 0, 0]}
target_df = pd.DataFrame(data=d)

Many thanks for help.

Do you have a list of strings ? You say ` 1 if the string exists` which string ? — Bharath M Shetty
– Bharath M Shetty, Commented Nov 8, 2017 at 13:44
In my example it would be AA, however yes there would be a list i'd be interested in. I'll likely assign them individually as i'd like multiple labels. — shbfy
– shbfy, Commented Nov 8, 2017 at 13:47
Next time add that in the question too. How can we guess you have a list that holds ['AA'] ? — Bharath M Shetty
– Bharath M Shetty, Commented Nov 8, 2017 at 13:47

jezrael · Accepted Answer · 2017-11-08 13:50:13Z

13

If need check mixed values - numeric with strings compare numpy array created by values, use DataFrame.any for check at least one True per row and last cast to int:

simple_df['new'] = (simple_df.values == 'AA').any(1).astype(int)
#or cast all values to string before comparing
#simple_df['new'] = (simple_df.astype(str)== 'AA').any(1).astype(int)
print (simple_df)
  strings_1 strings_2 strings_3 strings_4  new
0        AE        BB        AA        AV    1
1        AC        BA        DD        AB    0
2        AI        AG        PP        BV    0

Detail:

print ((simple_df.values == 'AA'))
[[False False  True False False]
 [False False False False False]
 [False False False False False]]

print ((simple_df.values == 'AA').any(1))
[ True False False]

If need check substring:

simple_df['new'] = simple_df.applymap(lambda x: 'G' in x).any(1).astype(int)
print (simple_df)
  strings_1 strings_2 strings_3 strings_4  new
0        AE        BB        AA        AV    0
1        AC        BA        DD        AB    0
2        AI        AG        PP        BV    1

edited Nov 8, 2017 at 13:50

answered Nov 8, 2017 at 13:45

jezrael

868k102 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

dondapati Over a year ago

how to use two substrings at a time in entire data. like here using G ,can use E and G at a time ?

dondapati Over a year ago

simple_df.applymap(lambda x: 'G|E' in x).any(1).astype(int) i writing this way but it not gives the output

jezrael Over a year ago

@dondapati - need simple_df.applymap(lambda x: any(y in x for y in ['G','E']))

dondapati Over a year ago

can we apply the above function with selected columns ? just apply the 1 and 2 columns

jezrael Over a year ago

@dondapati - sure, use simple_df.iloc[:, :1].applymap(lambda x: any(y in x for y in ['G','E']))

Collectives™ on Stack Overflow

Finding string over multiple columns in Pandas

1 Answer 1

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Linked

Related