Pandas multiindex sort

Question

In Pandas 0.19 I have a large dataframe with a Multiindex of the following form

          C0     C1     C2
A   B
bar one   4      2      4
    two   1      3      2
foo one   9      7      1
    two   2      1      3

I want to sort bar and foo (and many more double lines as them) according to "two" to get the following:

          C0     C1     C2
A   B
bar one   4      4      2
    two   1      2      3
foo one   7      9      1
    two   1      2      3

I am interested in speed (as I have many columns and many pairs of rows). I am also happy with re-arranging the data if it speeds up the sorting. Many thanks

Ted Petrou · Accepted Answer · 2017-04-05 04:28:05Z

Here is a mostly numpy solution that should yield good performance. It first selects only the 'two' rows and argsorts them. It then sets this order for each row of the original dataframe. It then unravels this order (after adding a constant to offset each row) and the original dataframe values. It then reorders all the original values based on this unraveled, offset and argsorted array before creating a new dataframe with the intended sort order.

rows, cols = df.shape
df_a = np.argsort(df.xs('two', level=1))
order = df_a.reindex(df.index.droplevel(-1)).values
offset = np.arange(len(df)) * cols
order_final = order + offset[:, np.newaxis]
pd.DataFrame(df.values.ravel()[order_final.ravel()].reshape(rows, cols), index=df.index, columns=df.columns)

Output

         C0  C1  C2
A   B              
bar one   4   4   2
    two   1   2   3
foo one   7   9   1
    two   1   2   3

Some Speed tests

# create much larger frame
import string
idx = pd.MultiIndex.from_product((list(string.ascii_letters), list(string.ascii_letters) + ['two']))
df1 = pd.DataFrame(index=idx, data=np.random.rand(len(idx), 3), columns=['C0', 'C1', 'C2'])

#scott boston
%timeit df1.groupby(level=0).apply(sortit)
10 loops, best of 3: 199 ms per loop

#Ted
1000 loops, best of 3: 5 ms per loop

Scott Boston · Accepted Answer · 2017-04-04 16:30:50Z

Here is a solution, albeit klugdy:

Input dataframe:

         C0  C1  C2
A   B              
bar one   4   2   4
    two   1   3   2
foo one   9   7   1
    two   2   1   3

Custom sorting function:

def sortit(x):
    xcolumns = x.columns.values
    x.index = x.index.droplevel()
    x.sort_values(by='two',axis=1,inplace=True)
    x.columns = xcolumns
    return x

df.groupby(level=0).apply(sortit)

Output:

         C0  C1  C2
A   B              
bar one   4   4   2
    two   1   2   3
foo one   7   9   1
    two   1   2   3

Collectives™ on Stack Overflow

Pandas multiindex sort

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related