Select columns if any of their rows contain a certain string

Question

I am trying to obtain a list of columns in a DataFrame if any value in a column contains a string. For example in the below dataframe I would like a list of columns that have the % in the string. I am able to accomplish this using a for loop and the series.str.contains method but doens't seem optimal especially with a larger dataset. Is there a more efficient way to do this?

import pandas as pd

df = pd.DataFrame({'A': {0: '2019-06-01', 1: '2019-06-01', 2: '2019-06-01'},
                   'B': {0: '10', 1: '20', 2: '30'},
                   'C': {0: '10', 1: '20%', 2: '30%'},
                   'D': {0: '10%', 1: '20%', 2: '30'},
               })

DataFrame

            A   B    C    D
0  2019-06-01  10   10  10%
1  2019-06-01  20  20%  20%
2  2019-06-01  30  30%   30

Current Method

col_list = []
for col in df.columns:
    if (True in list(df[col].str.contains('%'))) is True:
        col_list.append(col)

Output

['C', 'D']

Why not just call df.dtypes ?

Alexandre B.
– Alexandre B.

2019-06-21 13:47:59 +00:00
Commented Jun 21, 2019 at 13:47 — Alexandre B.
– Alexandre B., Commented Jun 21, 2019 at 13:47

piRSquared · Accepted Answer · 2019-06-21 14:19:29Z

16

`stack` with `any`

df.columns[df.stack().str.contains('%').any(level=1)]

Index(['C', 'D'], dtype='object')

comprehension

[c for c in df if df[c].str.contains('%').any()]

['C', 'D']

`filter`

[*filter(lambda c: df[c].str.contains('%').any(), df)]

['C', 'D']

Numpy's `find`

from numpy.core.defchararray import find

df.columns[(find(df.to_numpy().astype(str), '%') >= 0).any(0)]

Index(['C', 'D'], dtype='object')

edited Jun 21, 2019 at 14:19

answered Jun 21, 2019 at 13:49

piRSquared

295k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Bowen Liu Over a year ago

Great thinking. Thanks. I think your first method needs a bit adjustment to apply to more scenarios: for dataframes with columns of float64 data type, it wouldn't work. Seems like str. will just eliminate all the float64 columns.

jezrael · Accepted Answer · 2019-06-21 13:55:17Z

10

First use DataFrame.select_dtypes for filter only object columns, obviously string columns.

Then use DataFrame.applymap for elementwise check values with DataFrame.any for return True if at least one per column, so possible filter columns:

c = df.columns[df.select_dtypes(object).applymap(lambda x: '%' in str(x)).any()].tolist()
print (c)
['C', 'D']

Or use Series.str.contains per columns, na parameter should be omit if all strings columns:

f = lambda x: x.str.contains('%', na=False)
c = df.columns[df.select_dtypes(object).apply(f).any()].tolist()
print (c)
['C', 'D']

edited Jun 21, 2019 at 13:55

answered Jun 21, 2019 at 13:48

jezrael

867k102 gold badges1.4k silver badges1.3k bronze badges

1 Comment

Bowen Liu Over a year ago

Not OP but thanks for the succinct answer. Your first method, if I'm not mistaken, will only work if dtypes of all the columns are str right?

Quang Hoang · Accepted Answer · 2019-06-21 13:50:10Z

6

Try this:

df.columns[df.apply(lambda x: x.str.contains("\%")).any()]

answered Jun 21, 2019 at 13:50

Quang Hoang

151k11 gold badges63 silver badges86 bronze badges

Comments

cs95 · Accepted Answer · 2019-06-21 13:56:52Z

6

Compare with replace and create a mask to index the columns accordingly:

df.loc[:,(df != df.replace('%', '', regex=True)).any()]
     C    D
0   10  10%
1  20%  20%
2  30%   30

df.columns[(df != df.replace('%', '', regex=True)).any()]
# Index(['C', 'D'], dtype='object')

This avoids the need for a loop, apply, or applymap.

answered Jun 21, 2019 at 13:56

cs95

406k106 gold badges744 silver badges795 bronze badges

Comments

BENY · Accepted Answer · 2019-06-21 14:14:15Z

5

Let us do melt

df.melt().loc[lambda x :x.value.str.contains('%'),'variable'].unique()
Out[556]: array(['C', 'D'], dtype=object)

edited Jun 21, 2019 at 14:14

answered Jun 21, 2019 at 14:06

BENY

324k22 gold badges176 silver badges250 bronze badges

Comments

chetan wankhede · Accepted Answer · 2021-05-05 09:08:51Z

0

Look at the below solution can work on object dtypes also.

[c for c in dfif df[c].eq("CSS").any()]

answered May 5, 2021 at 9:08

chetan wankhede

111 bronze badge

Collectives™ on Stack Overflow

Select columns if any of their rows contain a certain string

DataFrame

Current Method

Output

6 Answers 6

`stack` with `any`

comprehension

`filter`

Numpy's `find`

1 Comment

1 Comment

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

DataFrame

Current Method

Output

6 Answers 6

stack with any

comprehension

filter

Numpy's find

1 Comment

1 Comment

Comments

Comments

Comments

Comments

Linked

Related

`stack` with `any`

`filter`

Numpy's `find`