0

The following table contains some keys and values:

N = 100
tbl = pd.DataFrame({'key':np.random.randint(0, 10, N), 
    'y':np.random.rand(N), 'z':np.random.rand(N)})

I would like to obtain a DataFrame in which each row contains a key and all the fields that correspond to the minimal value of a specified field.

Since the original table is very large, I'm interested in the most efficient way.

NOTE getting the minimal value of a field is simple:

tbl.groupby('key').agg(pd.Series.min)

But this takes the minimum values of every field, independently, I would like to know what is the minimum value of y and what z value corresponds to it.

Below I post an answer to my question with my naive approach, but I suspect there are better ways

2 Answers 2

1

Here is a straightforward approach:

gr = tbl.groupby('key')
def take_min_y(t):
    ix = t.y.argmin()
    return t.loc[[ix]]

tbl_mins = gr.apply(take_min_y)

Is there a better way?

Sign up to request clarification or add additional context in comments.

4 Comments

You're missing the code that does the groupby, I'm assuming gr = tbl.groupby('key')? Anyway, do you find this faster: gr.agg(pd.Series.argmin) it takes 2.36 ms for me vs 7.2ms for your method
I wasn't clear enough about the problem I tried to solve. I edited the original question to make it clearer
So you want the min value of y grouped by 'key' and the corresponding value of 'z' is that correct? So you're not interested in the min value of 'z'?
If my previous comment is correct does the following achieve what you want: tbl.iloc[gr.agg(pd.Series.idxmin).y]?
1

Based on your updated edit I believe the following is what you want:

In [107]:

tbl.iloc[gr['y'].agg(pd.Series.idxmin)]
Out[107]:
    key         y         z
47    0  0.094841  0.221435
26    1  0.062200  0.748082
45    2  0.032497  0.160199
28    3  0.002242  0.064829
73    4  0.122438  0.723844
75    5  0.128193  0.638933
79    6  0.071833  0.952624
86    7  0.058974  0.113317
36    8  0.068757  0.611111
12    9  0.082604  0.271268

idxmin returns the index of the min value, we can then use this to filter the original dataframe to select these rows.

Timings show this method is approx 7 times faster:

In [108]:

%timeit tbl.iloc[gr['y'].agg(pd.Series.idxmin)]
def take_min_y(t):
    ix = t.y.argmin()
    return t.loc[[ix]]

%timeit tbl_mins = gr.apply(take_min_y)
1000 loops, best of 3: 1.08 ms per loop
100 loops, best of 3: 7.06 ms per loop

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.