8

I am trying to obtain a list of columns in a DataFrame if any value in a column contains a string. For example in the below dataframe I would like a list of columns that have the % in the string. I am able to accomplish this using a for loop and the series.str.contains method but doens't seem optimal especially with a larger dataset. Is there a more efficient way to do this?

import pandas as pd

df = pd.DataFrame({'A': {0: '2019-06-01', 1: '2019-06-01', 2: '2019-06-01'},
                   'B': {0: '10', 1: '20', 2: '30'},
                   'C': {0: '10', 1: '20%', 2: '30%'},
                   'D': {0: '10%', 1: '20%', 2: '30'},
               })

DataFrame

            A   B    C    D
0  2019-06-01  10   10  10%
1  2019-06-01  20  20%  20%
2  2019-06-01  30  30%   30

Current Method

col_list = []
for col in df.columns:
    if (True in list(df[col].str.contains('%'))) is True:
        col_list.append(col)

Output

['C', 'D']
1
  • Why not just call df.dtypes ? Commented Jun 21, 2019 at 13:47

6 Answers 6

16

stack with any

df.columns[df.stack().str.contains('%').any(level=1)]

Index(['C', 'D'], dtype='object')

comprehension

[c for c in df if df[c].str.contains('%').any()]

['C', 'D']

filter

[*filter(lambda c: df[c].str.contains('%').any(), df)]

['C', 'D']

Numpy's find

from numpy.core.defchararray import find

df.columns[(find(df.to_numpy().astype(str), '%') >= 0).any(0)]

Index(['C', 'D'], dtype='object')
Sign up to request clarification or add additional context in comments.

1 Comment

Great thinking. Thanks. I think your first method needs a bit adjustment to apply to more scenarios: for dataframes with columns of float64 data type, it wouldn't work. Seems like str. will just eliminate all the float64 columns.
10

First use DataFrame.select_dtypes for filter only object columns, obviously string columns.

Then use DataFrame.applymap for elementwise check values with DataFrame.any for return True if at least one per column, so possible filter columns:

c = df.columns[df.select_dtypes(object).applymap(lambda x: '%' in str(x)).any()].tolist()
print (c)
['C', 'D']

Or use Series.str.contains per columns, na parameter should be omit if all strings columns:

f = lambda x: x.str.contains('%', na=False)
c = df.columns[df.select_dtypes(object).apply(f).any()].tolist()
print (c)
['C', 'D']

1 Comment

Not OP but thanks for the succinct answer. Your first method, if I'm not mistaken, will only work if dtypes of all the columns are str right?
6

Try this:

df.columns[df.apply(lambda x: x.str.contains("\%")).any()]

Comments

6

Compare with replace and create a mask to index the columns accordingly:

df.loc[:,(df != df.replace('%', '', regex=True)).any()]
     C    D
0   10  10%
1  20%  20%
2  30%   30

df.columns[(df != df.replace('%', '', regex=True)).any()]
# Index(['C', 'D'], dtype='object')

This avoids the need for a loop, apply, or applymap.

Comments

5

Let us do melt

df.melt().loc[lambda x :x.value.str.contains('%'),'variable'].unique()
Out[556]: array(['C', 'D'], dtype=object)

Comments

0

Look at the below solution can work on object dtypes also.

[c for c in dfif df[c].eq("CSS").any()]

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.