2

I have pandas dataframe and I'd like to return the names of the columns with the three highest values. For example:

import numpy as np
import pandas as pd

a = np.array([[2., 1., 0., 5., 4.], [6., 10., 7., 1., 3.]])
df = pd.DataFrame(a, columns=['A', 'B', 'C', 'D', 'E'])

Gives:

   A   B  C  D  E
0  2   1  0  5  4
1  6  10  7  1  3

For each row, I want to add three new columns with the column names with the highest three values:

   A   B  C  D  E First Second Third
0  2   1  0  5  4     D      E     A
1  6  10  7  1  3     B      C     A

I've gotten as far as using argpartition to get the indices for the top three columns in each row:

inx = df.apply(np.argpartition, args=(-3,), axis=1).ix[:, -3:].values

Which then needs to get sorted

sorted_inx = inx.sort()

It isn't clear how I would then take these column indices, get the names, and then populate them back into df as three new columns

1 Answer 1

2

While Ed's answer works perfectly well and apply can be essential in some cases, I try to avoid using apply in pandas as much as possible and work completely with matrix operations as it usually results in much better performance.

In this case if you get the indices of the top three values using numpy's argsort applied to rows the resulting indices can be combined with the data frame's columns attribute to get the results you're looking for.

pd.concat((df, pd.DataFrame(df.columns[np.argsort(df.values, axis=1)[:, -3:][:, ::-1]], 
          columns=['First', 'Second', 'Third'])), axis=1)

   A   B  C  D  E First Second Third
0  2   1  0  5  4     D      E     A
1  6  10  7  1  3     B      C     A

While the performance improvement is small for the given example because of the overhead:

>>> %timeit pd.concat((df, pd.DataFrame(df.columns[np.argsort(df.values, axis=1)[:, -3:][:, ::-1]], columns=['First', 'Second', 'Third'])), axis=1)
100 loops, best of 3: 1.33 ms per loop

>>> %timeit df.apply(lambda x: pd.Series(x.sort_values(ascending=False).index[:3]), axis=1)
100 loops, best of 3: 3.55 ms per loop

when you scale the problem up the improvement becomes substantial with the apply method taking over 1,500x longer for only 20,000 rows:

a = np.array([[2., 1., 0., 5., 4.], [6., 10., 7., 1., 3.]] * 10000)
df = pd.DataFrame(a, columns=['A', 'B', 'C', 'D', 'E'])

>>> %timeit pd.concat((df, pd.DataFrame(df.columns[np.argsort(df.values, axis=1)[:, -3:][:, ::-1]], columns=['First', 'Second', 'Third'])), axis=1)
100 loops, best of 3: 4.14 ms per loop

>>> %timeit df.apply(lambda x: pd.Series(x.sort_values(ascending=False).index[:3]), axis=1)
1 loops, best of 3: 9.47 s per loop
Sign up to request clarification or add additional context in comments.

3 Comments

Can you explain a little what this part is achieving: [:, ::-1]? I can guess that it's grabbing all rows but the -1 in this case isn't clear
All its doing is creating a view of the data that is in the reverse order of each row since argsort sorts ascending. The first part [:, tells it to operate across all rows while the second part ::-1] tells numpy to take all the elements in the row and reverse their order.
Thank you! That makes sense

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.