529

Suppose I have a dataframe with columns a, b and c. I want to sort the dataframe by column b in ascending order, and by column c in descending order. How do I do this?

2

5 Answers 5

922

As of the 0.17.0 release, the sort method was deprecated in favor of sort_values. sort was completely removed in the 0.20.0 release. The arguments (and results) remain the same:

df.sort_values(['a', 'b'], ascending=[True, False])

You can use the ascending argument of sort:

df.sort(['a', 'b'], ascending=[True, False])

For example:

In [11]: df1 = pd.DataFrame(np.random.randint(1, 5, (10,2)), columns=['a','b'])

In [12]: df1.sort(['a', 'b'], ascending=[True, False])
Out[12]:
   a  b
2  1  4
7  1  3
1  1  2
3  1  2
4  3  2
6  4  4
0  4  3
9  4  3
5  4  1
8  4  1

As commented by @renadeen

Sort isn't in place by default! So you should assign result of the sort method to a variable or add inplace=True to method call.

that is, if you want to reuse df1 as a sorted DataFrame:

df1 = df1.sort(['a', 'b'], ascending=[True, False])

or

df1.sort(['a', 'b'], ascending=[True, False], inplace=True)
Sign up to request clarification or add additional context in comments.

5 Comments

Sort isn't in place by default! So you should assign result of the sort method to a variable or add inplace=True to method call.
@renadeen very good point, I've updated by answer with that comment.
@Snoozer Yeah, I don't think sort's ever going to go away (mainly as it's used extensively in Wes' book), but there has been some big changes in calling sort. Thanks! .. I really need to automate going through all my 1000s of pandas answers for deprecations!
Is there a way to for the sort to be 1,3,4 instead of 1,1,1,1,3,4,4,4,4 ?
I was using tuples, that is why it failed with me. Feels like tuples should be allowed
93

As of pandas 0.17.0, DataFrame.sort() is deprecated, and set to be removed in a future version of pandas. The way to sort a dataframe by its values is now is DataFrame.sort_values

As such, the answer to your question would now be

df.sort_values(['b', 'c'], ascending=[True, False], inplace=True)

Comments

17

For large dataframes of numeric data, you may see a significant performance improvement via numpy.lexsort, which performs an indirect sort using a sequence of keys:

import pandas as pd
import numpy as np

np.random.seed(0)

df1 = pd.DataFrame(np.random.randint(1, 5, (10,2)), columns=['a','b'])
df1 = pd.concat([df1]*100000)

def pdsort(df1):
    return df1.sort_values(['a', 'b'], ascending=[True, False])

def lex(df1):
    arr = df1.values
    return pd.DataFrame(arr[np.lexsort((-arr[:, 1], arr[:, 0]))])

assert (pdsort(df1).values == lex(df1).values).all()

%timeit pdsort(df1)  # 193 ms per loop
%timeit lex(df1)     # 143 ms per loop

One peculiarity is that the defined sorting order with numpy.lexsort is reversed: (-'b', 'a') sorts by series a first. We negate series b to reflect we want this series in descending order.

Be aware that np.lexsort only sorts with numeric values, while pd.DataFrame.sort_values works with either string or numeric values. Using np.lexsort with strings will give: TypeError: bad operand type for unary -: 'str'.

Comments

7

sort_values has a stable sorting option which can be invoking by passing kind='stable'. Note that we need to reverse the columns to sort by to use the stable sorting correctly.

So the following two methods produce the same output, i.e. df1 and df2 are equivalent.

df = pd.DataFrame(np.random.randint(10, size=(100,2)), columns=['a', 'b'])

df1 = df.sort_values(['a', 'b'], ascending=[True, False])  # sort by 'a' then 'b'

df2 = (
    df
    .sort_values('b', ascending=False)                     # sort by 'b' first
    .sort_values('a', ascending=True, kind='stable')       # then by 'a'
)

assert df1.eq(df2).all().all()

This is especially useful if you need a bit more involved sorting key.

Say, given df below, you want to sort by 'date' and 'value' but treat 'date' like datetime values even though they are strings. A straightforward sort_values with two sort by columns would produce a wrong result; however, calling sort_values twice with the relevant sorting key would produce the correct output.

df = pd.DataFrame({'date': ['10/1/2024', '10/1/2024', '2/23/2024'], 'value': [0, 1, 0]})

df1 = df.sort_values(['date', 'value'], ascending=[True, False])  # <--- wrong output

df2 = (
    df
    .sort_values('value', ascending=False)
    .sort_values('date', ascending=True, kind='stable', key=pd.to_datetime) 
)  # <--- correct output

N.B. We can get the same output by assigning a new datetime column and use it as a sort-by column but IMO, the stable sort with the sorting key is much cleaner.

df3 = df.assign(dummy=pd.to_datetime(df['date'])).sort_values(['dummy', 'value'], ascending=[True, False]).drop(columns='dummy')

Comments

0

For those that come here for multi-column DataFrame, use tuple with elements corresponding to each level.

tuple with elements corresponding to each level:

d = {}
d['first_level'] = pd.DataFrame(columns=['idx', 'a', 'b', 'c'],
                                         data=[[10, 0.89, 0.98, 0.31],
                                               [20, 0.34, 0.78, 0.34]]).set_index('idx')
d['second_level'] = pd.DataFrame(columns=['idx', 'a', 'b', 'c'],
                                          data=[[10, 0.29, 0.63, 0.99],
                                                [20, 0.23, 0.26, 0.98]]).set_index('idx')

df = pd.concat(d, axis=1)
df.sort_values(('second_level', 'b'))

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.