Select certain columns based on multiple criteria in pandas

Question

I have the following dataset:

my_df = pd.DataFrame({'id':[1,2,3,4,5],
                      'type':['corp','smb','smb','corp','mid'],
                      'sales':[34567,2190,1870,22000,10000],
                      'sales_roi':[.10,.21,.22,.15,.16],
                      'sales_pct':[.38,.05,.08,.30,.20],
                      'sales_ln':[4.2,2.1,2.0,4.1,4],
                      'cost_pct':[22000,1000,900,14000,5000],
                      'flag':[0,1,0,1,1],
                      'gibberish':['bla','ble','bla','ble','bla'],
                      'tech':['lnx','mst','mst','lnx','mc']})
my_df['type'] = pd.Categorical(my_df.type)
my_df
    id  type    sales   sales_roi   sales_pct   sales_ln    cost_pct    flag    gibberish   tech
0   1   corp    34567   0.10        0.38        4.2         22000       0       bla         lnx
1   2   smb     2190    0.21        0.05        2.1         1000        1       ble         mst
2   3   smb     1870    0.22        0.08        2.0         900         0       bla         mst
3   4   corp    22000   0.15        0.30        4.1         14000       1       ble         lnx
4   5   mid     10000   0.16        0.20        4.0         5000        1       bla         mc

And I want to filter out all variables who end in "_pct" or "_ln" or are equal to "gibberish" or "tech". This is what I have tried:

df_selected = df.loc[:, ~my_df.columns.str.endswith('_pct') &
~my_df.columns.str.endswith('_ln') &
~my_df.columns.str.contains('gibberish','tech')]

But it returns me an unwanted column ("tech"):

    id  type    sales   sales_roi   flag    tech
0   1   corp    34567   0.10        0       lnx
1   2   smb     2190    0.21        1       mst
2   3   smb     1870    0.22        0       mst
3   4   corp    22000   0.15        1       lnx
4   5   mid     10000   0.16        1       mc

This is the expected result:

    id  type    sales   sales_roi   flag
0   1   corp    34567   0.10        0   
1   2   smb     2190    0.21        1   
2   3   smb     1870    0.22        0    
3   4   corp    22000   0.15        1   
4   5   mid     10000   0.16        1

Please consider that I have to deal with hundreds of variables and this is just an example of what I need.

my_df[my_df.columns[~my_df.columns.str.endswith(('_pct','_ln','gibberish','tech'))]] Put all the endswith in a single tuple — It_is_Chris
– It_is_Chris, Commented Nov 17, 2021 at 15:22
So I needed double parenthesis. Thank you very much @It_is_Chris, how can I award your answer? — Alexis
– Alexis, Commented Nov 17, 2021 at 15:23

It_is_Chris · Accepted Answer · 2021-11-17 15:25:56Z

1

Currently, what you are doing will return every column because of how the conditions are written. endswith will accept tuples so just put all the columns you are looking for in a single tuple and then filter

my_df[my_df.columns[~my_df.columns.str.endswith(('_pct','_ln','gibberish','tech'))]]

   id  type  sales  sales_roi  flag
0   1  corp  34567       0.10     0
1   2   smb   2190       0.21     1
2   3   smb   1870       0.22     0
3   4  corp  22000       0.15     1
4   5   mid  10000       0.16     1

answered Nov 17, 2021 at 15:25

It_is_Chris

14.2k3 gold badges27 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Alexis Over a year ago

Now I see you didn't used loc. Anyway, it works fine. Just a question, I tried this:

df_selected = my_df.loc[:, ~my_df.columns.str.endswith('_pct') & ~ my_df.columns.str.endswith('_ln') & ~ my_df.columns.str.contains(('gibberish','tech'))]

but it returned me an error. It wasn't just a double parenthesis as I wrongly assumed. Why my attempt couldn't work?

It_is_Chris Over a year ago

The issue is with this line my_df.columns.str.contains(('gibberish','tech')) string.contains does not accept tuples but string.endswith does that is the difference.

Alexis Over a year ago

Thank you very much for your time and great answer @It_is_Chris. Now I understood this. Have a great day

It_is_Chris Over a year ago

You're welcome. Also, just so you are aware, string.contains does accept the | operator so my_df.columns.str.contains('gibberish|tech') will work as well.

Nimantha · Accepted Answer · 2021-11-18 02:46:42Z

1

I would do it like this:

criterion = ["_pct", "_ln", "gibberish", "tech"]

for column in my_df:
    for criteria in criterion:
        if criteria in column:
            my_df = my_df.drop(column, axis=1)

Ofcourse you can change the if statement in line 3 to endswith or something of your choice.

edited Nov 18, 2021 at 2:46

Nimantha

6,5156 gold badges32 silver badges78 bronze badges

answered Nov 17, 2021 at 15:47

Bog

2562 silver badges10 bronze badges

2 Comments

Alexis Over a year ago

Hi Pixelbog, thanks for the answer. Let me share my take:

df_selected = my_df.loc[:, ~my_df.columns.str.endswith('_pct') & ~ my_df.columns.str.endswith('_ln') & ~ my_df.columns.str.contains('gibberish|tech')]

. With contains you can specify this or that word... anyway the answer from It_is_Chris is simpler than mine.

Bog Over a year ago

Hello Alexis. Great solution! Yeah I am quiet new to pandas and never would had thought of doing this that way, but great that I learned something new :)

Collectives™ on Stack Overflow

Select certain columns based on multiple criteria in pandas

2 Answers 2

4 Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Related