How to efficiently select a rows from pandas DataFrame?

Question

The following table contains some keys and values:

N = 100
tbl = pd.DataFrame({'key':np.random.randint(0, 10, N), 
    'y':np.random.rand(N), 'z':np.random.rand(N)})

I would like to obtain a DataFrame in which each row contains a key and all the fields that correspond to the minimal value of a specified field.

Since the original table is very large, I'm interested in the most efficient way.

NOTE getting the minimal value of a field is simple:

tbl.groupby('key').agg(pd.Series.min)

But this takes the minimum values of every field, independently, I would like to know what is the minimum value of y and what z value corresponds to it.

Below I post an answer to my question with my naive approach, but I suspect there are better ways

Boris Gorelik · Accepted Answer · 2014-07-22 09:29:08Z

1

Here is a straightforward approach:

gr = tbl.groupby('key')
def take_min_y(t):
    ix = t.y.argmin()
    return t.loc[[ix]]

tbl_mins = gr.apply(take_min_y)

Is there a better way?

edited Jul 22, 2014 at 9:29

answered Jul 22, 2014 at 8:57

Boris Gorelik

32.1k41 gold badges136 silver badges172 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

EdChum Over a year ago

You're missing the code that does the groupby, I'm assuming gr = tbl.groupby('key')? Anyway, do you find this faster: gr.agg(pd.Series.argmin) it takes 2.36 ms for me vs 7.2ms for your method

Boris Gorelik Over a year ago

I wasn't clear enough about the problem I tried to solve. I edited the original question to make it clearer

EdChum Over a year ago

So you want the min value of y grouped by 'key' and the corresponding value of 'z' is that correct? So you're not interested in the min value of 'z'?

EdChum Over a year ago

If my previous comment is correct does the following achieve what you want: tbl.iloc[gr.agg(pd.Series.idxmin).y]?

EdChum · Accepted Answer · 2014-07-22 10:15:10Z

Based on your updated edit I believe the following is what you want:

In [107]:

tbl.iloc[gr['y'].agg(pd.Series.idxmin)]
Out[107]:
    key         y         z
47    0  0.094841  0.221435
26    1  0.062200  0.748082
45    2  0.032497  0.160199
28    3  0.002242  0.064829
73    4  0.122438  0.723844
75    5  0.128193  0.638933
79    6  0.071833  0.952624
86    7  0.058974  0.113317
36    8  0.068757  0.611111
12    9  0.082604  0.271268

idxmin returns the index of the min value, we can then use this to filter the original dataframe to select these rows.

Timings show this method is approx 7 times faster:

In [108]:

%timeit tbl.iloc[gr['y'].agg(pd.Series.idxmin)]
def take_min_y(t):
    ix = t.y.argmin()
    return t.loc[[ix]]

%timeit tbl_mins = gr.apply(take_min_y)
1000 loops, best of 3: 1.08 ms per loop
100 loops, best of 3: 7.06 ms per loop

Collectives™ on Stack Overflow

How to efficiently select a rows from pandas DataFrame?

2 Answers 2

4 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Related