How to select all elements greater than a given values in a dataframe

Question

I have a csv that is read by my python code and a dataframe is created using pandas.

CSV file is in following format

My code calculates the percentile and wants to find all rows that have the value in 2nd column greater than 60.

df = pd.read_csv(io.BytesIO(body), error_bad_lines=False, header=None, encoding='latin1', sep=',')

percentile = df.iloc[:, 1:2].quantile(0.99)  # Selecting 2nd column and calculating percentile

criteria = df[df.iloc[:, 1:2] >= 60.0]

While my percentile code works fine, criteria to find all rows that have column 2's value greater than 60 returns

NaN     NaN
NaN     NaN
NaN     NaN
NaN     NaN

Can you please help me find the error.

What is frame? If I replace frame with df in the third row of your code everything basically works here... — applesoup
– applesoup, Commented Jun 14, 2018 at 21:16
I don't know whether this is what you want, but you can replace the column indices 1:2 by 1: criteria = frame[frame.iloc[:, 1] >= 60.0]. — applesoup
– applesoup, Commented Jun 14, 2018 at 21:19

GianAnge · Accepted Answer · 2018-06-14 22:16:49Z

Just correct the condition inside criteria. Being the second column "1" you should write df.iloc[:,1].
Example:

import pandas as pd
import numpy as np
b =np.array([[1,2,3,7], [1,99,20,63] ])

df = pd.DataFrame(b.T) #just creating the dataframe


criteria = df[ df.iloc[:,1]>= 60 ]     
print(criteria)

Why? It seems like the cause resides inside the definition type of the condition. Let's inspect

Case 1:

type( df.iloc[:,1]>= 60 )

Returns pandas.core.series.Series,
so it gives

 df[ df.iloc[:,1]>= 60 ]

 #out:
   0   1
1  2  99
3  7  63

Case2:

type( df.iloc[:,1:2]>= 60 )

Returns a pandas.core.frame.DataFrame
, and gives

df[ df.iloc[:,1:2]>= 60 ]

#out:
    0     1
0 NaN   NaN
1 NaN  99.0
2 NaN   NaN
3 NaN  63.0

Therefore I think it changes the way the index is processed.
Always keep in mind that 3 is a scalar, and 3:4 is a array.

For more info is always good to take a look at the official doc Pandas indexing

What if I want to make this comparison between the Int() of each element in my data frame and a given number?

Neroksi · Accepted Answer · 2018-06-14 21:51:18Z

People here seem to be more interested in coming up with alternative solutions instead of digging into his code in order to find out what's really wrong. I will adopt a diametrically opposed strategy!

The problem with your code is that you are indexing your DataFrame df by another DataFrame. Why? Because you use slices instead of integer indexing.

df.iloc[:, 1:2] >= 60.0 # Return a DataFrame with one boolean column
df.iloc[:, 1] >= 60.0 # Return a Series
df.iloc[:, [1]] >= 60.0 # Return a DataFrame with one boolean column

So correct your code by using :

criteria = df[df.iloc[:, 1] >= 60.0] # Dont slice !

tandem · Accepted Answer · 2018-12-19 15:42:37Z

2

Your indexing a bit off, since you only have two columns [0, 1] and you are interested in selecting just the one with index 1. As @applesoup mentioned the following is just enough:

criteria = df[df.iloc[:, 1] >= 60.0]

However, I would consider naming columns and just referencing based on name. This will allow you to avoid any mistakes in case your df structure changes, e.g.:

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3, 7], 'b': [1.0, 99.0, 20.0, 63.]})

criteria = df[df['b'] >= 60.0]

edited Dec 19, 2018 at 15:42

tandem

2,2584 gold badges34 silver badges63 bronze badges

answered Jun 14, 2018 at 21:43

An economist

1,3111 gold badge17 silver badges36 bronze badges

Collectives™ on Stack Overflow

How to select all elements greater than a given values in a dataframe

3 Answers 3

2 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Related