3

I have a csv that is read by my python code and a dataframe is created using pandas.

CSV file is in following format

1     1.0
2     99.0
3     20.0
7     63

My code calculates the percentile and wants to find all rows that have the value in 2nd column greater than 60.

df = pd.read_csv(io.BytesIO(body), error_bad_lines=False, header=None, encoding='latin1', sep=',')

percentile = df.iloc[:, 1:2].quantile(0.99)  # Selecting 2nd column and calculating percentile

criteria = df[df.iloc[:, 1:2] >= 60.0]

While my percentile code works fine, criteria to find all rows that have column 2's value greater than 60 returns

NaN     NaN
NaN     NaN
NaN     NaN
NaN     NaN

Can you please help me find the error.

3
  • 1
    What is frame? If I replace frame with df in the third row of your code everything basically works here... Commented Jun 14, 2018 at 21:16
  • 1
    I don't know whether this is what you want, but you can replace the column indices 1:2 by 1: criteria = frame[frame.iloc[:, 1] >= 60.0]. Commented Jun 14, 2018 at 21:19
  • 1
    @applesoup has the answer. Commented Jun 14, 2018 at 21:32

3 Answers 3

8

Just correct the condition inside criteria. Being the second column "1" you should write df.iloc[:,1].
Example:

import pandas as pd
import numpy as np
b =np.array([[1,2,3,7], [1,99,20,63] ])

df = pd.DataFrame(b.T) #just creating the dataframe


criteria = df[ df.iloc[:,1]>= 60 ]     
print(criteria)

Why? It seems like the cause resides inside the definition type of the condition. Let's inspect

Case 1:

type( df.iloc[:,1]>= 60 )

Returns pandas.core.series.Series,
so it gives

 df[ df.iloc[:,1]>= 60 ]

 #out:
   0   1
1  2  99
3  7  63

Case2:

type( df.iloc[:,1:2]>= 60 )

Returns a pandas.core.frame.DataFrame
, and gives

df[ df.iloc[:,1:2]>= 60 ]

#out:
    0     1
0 NaN   NaN
1 NaN  99.0
2 NaN   NaN
3 NaN  63.0

Therefore I think it changes the way the index is processed.
Always keep in mind that 3 is a scalar, and 3:4 is a array.

For more info is always good to take a look at the official doc Pandas indexing

Sign up to request clarification or add additional context in comments.

2 Comments

What if I want to make this comparison between the Int() of each element in my data frame and a given number?
Maybe you're looking for applymap or apply
2

People here seem to be more interested in coming up with alternative solutions instead of digging into his code in order to find out what's really wrong. I will adopt a diametrically opposed strategy!

The problem with your code is that you are indexing your DataFrame df by another DataFrame. Why? Because you use slices instead of integer indexing.

df.iloc[:, 1:2] >= 60.0 # Return a DataFrame with one boolean column
df.iloc[:, 1] >= 60.0 # Return a Series
df.iloc[:, [1]] >= 60.0 # Return a DataFrame with one boolean column

So correct your code by using :

criteria = df[df.iloc[:, 1] >= 60.0] # Dont slice !

Comments

2

Your indexing a bit off, since you only have two columns [0, 1] and you are interested in selecting just the one with index 1. As @applesoup mentioned the following is just enough:

criteria = df[df.iloc[:, 1] >= 60.0]

However, I would consider naming columns and just referencing based on name. This will allow you to avoid any mistakes in case your df structure changes, e.g.:

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3, 7], 'b': [1.0, 99.0, 20.0, 63.]})

criteria = df[df['b'] >= 60.0]

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.