Numpy sort ndarray on multiple columns

Question

I get a ndarray reading it from a file, like this

my_data = np.genfromtxt(input_file, delimiter='\t', skip_header=0)

Example input (parsed)

[[   2.    1.    2.    0.]
 [   2.    2.  100.    0.]
 [   2.    3.  100.    0.]
 [   3.    1.    2.    0.]
 [   3.    2.    4.    0.]
 [   3.    3.    6.    0.]
 [   4.    1.    2.    0.]
 [   4.    2.    4.    0.]
 [   4.    3.    6.    0.]]

Longer example input (unparsed).

The first 2 columns are supposed to be int, while the last 2 columns are supposed to be float, but that's what I get. Suggestions are welcome.

The main problem is, I'm trying to sort it, using Numpy, so that rows get ordered giving precedence to the numbers on second column first, and on the first column next.

Example of desired output

[[   2.    1.    2.    0.]
 [   3.    1.    2.    0.]
 [   4.    1.    2.    0.]
 [   2.    2.  100.    0.]
 [   3.    2.    4.    0.]
 [   4.    2.    4.    0.]
 [   2.    3.  100.    0.]
 [   3.    3.    6.    0.]
 [   4.    3.    6.    0.]]

I'm aware of this answer, it works for sorting rows on a single column.

I tried sorting on the second column, since the first one is already sorted, but it's not enough. On occasion, the first column gets reordered too, badly.

new_data = my_data[my_data[:, 1].argsort()]
print(new_data)

#output
[[   2.    1.    2.    0.]
 [   4.    1.    2.    0.] #ouch
 [   3.    1.    2.    0.] #ouch
 [   2.    2.  100.    0.]
 [   3.    2.    4.    0.]
 [   4.    2.    4.    0.]
 [   2.    3.  100.    0.]
 [   3.    3.    6.    0.]
 [   4.    3.    6.    0.]]

I've also checked this question

The answer mentions

The problem here is that np.lexsort or np.sort do not work on arrays of dtype object. To get around that problem, you could sort the rows_list before creating order_list:

import operator
rows_list.sort(key=operator.itemgetter(0,1,2))

But I there is no key parameter in the sort function of type ndarray. And merging fields is not an alternative in my case.

Also, I don't have a header, so, if I try to sort using the order parameter, I get an error.

ValueError: Cannot specify order when the array has no fields.

I'd rather sort in place or at least obtain a result of the same type ndarray. Then I want to save it to a file.

How do I do this, without messing the datatypes?

Brainor · Accepted Answer · 2023-05-17 13:37:11Z

numpy ndarray sort by the 1st, 2nd or 3rd column:

>>> a = np.array([[1,30,200], [2,20,300], [3,10,100]])

>>> a
array([[  1,  30, 200],         
       [  2,  20, 300],          
       [  3,  10, 100]])

>>> a[a[:,2].argsort()]           #sort by the 3rd column ascending
array([[  3,  10, 100],
       [  1,  30, 200],
       [  2,  20, 300]])

>>> a[a[:,2].argsort()][::-1]     #sort by the 3rd column descending
array([[  2,  20, 300],
       [  1,  30, 200],
       [  3,  10, 100]])

>>> a[a[:,1].argsort()]        #sort by the 2nd column ascending
array([[  3,  10, 100],
       [  2,  20, 300],
       [  1,  30, 200]])

To explain what is going on here: argsort() is passing back an array containing integer sequence of its parent: https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

>>> x = np.array([15, 30, 4, 80, 6])
>>> np.argsort(x)
array([2, 4, 0, 1, 3])

Sort by column 1, then by column 2 then 3:

according to the doc, The last column is the primary sort key.

>>> a = np.array([[2,30,200], [1,30,200], [1,10,200]])

>>> a
array([[  2,  30, 200],
       [  1,  30, 200],
       [  1,  10, 200]])

>>> a[np.lexsort((a[:,2], a[:,1],a[:,0]))]
array([[  1,  10, 200],
       [  1,  30, 200],
       [  2,  30, 200]])

Same as above but reversed:

>>> a[np.lexsort((a[:,2], a[:,1],a[:,0]))][::-1]
array([[  2  30 200]
       [  1  30 200]
       [  1  10 200]])

Is it possible to do something like this a[a[:,2].argsort()[::-1]] instead of this a[a[:,2].argsort()][::-1]? Would it be more efficient?

Jim · Accepted Answer · 2020-10-02 16:31:51Z

With np.lexsort you can sort based on several columns simultaneously. The columns that you want to sort by need to be passed in reverse. That means np.lexsort((col_b,col_a)) first sorts by col_a, and then by col_b:

my_data = np.array([[   2.,    1.,    2.,    0.],
                    [   2.,    2.,  100.,    0.],
                    [   2.,    3.,  100.,    0.],
                    [   3.,    1.,    2.,    0.],
                    [   3.,    2.,    4.,    0.],
                    [   3.,    3.,    6.,    0.],
                    [   4.,    1.,    2.,    0.],
                    [   4.,    2.,    4.,    0.],
                    [   4.,    3.,    6.,    0.]])

ind = np.lexsort((my_data[:,0],my_data[:,1]))
my_data[ind]

result:

array([[  2.,   1.,   2.,   0.],
       [  3.,   1.,   2.,   0.],
       [  4.,   1.,   2.,   0.],
       [  2.,   2., 100.,   0.],
       [  3.,   2.,   4.,   0.],
       [  4.,   2.,   4.,   0.],
       [  2.,   3., 100.,   0.],
       [  3.,   3.,   6.,   0.],
       [  4.,   3.,   6.,   0.]])

If you know that your first column is already sorted, you can use:

ind = my_data[:,1].argsort(kind='stable')
my_data[ind]

This makes sure that order is preserved for equal items. The quick sort algorithm that is generally used does not do that, though it is faster.

Is the my_data you used the same as the one used in the other examples here? If so, please paste it as the input to complete your answer. Thanks.

Agostino · Accepted Answer · 2015-03-30 20:33:59Z

Import letting Numpy guess the type and sorting in place:

import numpy as np

# let numpy guess the type with dtype=None
my_data = np.genfromtxt(infile, dtype=None, names=["a", "b", "c", "d"])

# access columns by name
print(my_data["b"]) # column 1

# sort column 1 and column 0 
my_data.sort(order=["b", "a"])

# save specifying required format (tab separated values)
np.savetxt("sorted.tsv", my_data, fmt="%d\t%d\t%.6f\t%.6f"

Alternatively, specifying the input format and sorting to a new array:

import numpy as np

# tell numpy the first 2 columns are int and the last 2 are floats
my_data = np.genfromtxt(infile, dtype=[('a', '<i8'), ('b', '<i8'), ('x', '<f8'), ('d', '<f8')])

# access columns by name
print(my_data["b"]) # column 1

# get the indices to sort the array using lexsort
# the last element of the tuple (column 1) is used as the primary key
ind = np.lexsort((my_data["a"], my_data["b"]))

# create a new, sorted array
sorted_data = my_data[ind]

# save specifying required format (tab separated values)
np.savetxt("sorted.tsv", sorted_data, fmt="%d\t%d\t%.6f\t%.6f")

Output:

2   1   2.000000    0.000000
3   1   2.000000    0.000000
4   1   2.000000    0.000000
2   2   100.000000  0.000000
3   2   4.000000    0.000000
4   2   4.000000    0.000000
2   3   100.000000  0.000000
3   3   6.000000    0.000000
4   3   6.000000    0.000000

lee · Accepted Answer · 2020-05-12 13:27:55Z

this method works for any numpy array:

import numpy as np

my_data = [[   2.,    1.,    2.,    0.],
           [   2.,    2.,  100.,    0.],
           [   2.,    3.,  100.,    0.],
           [   3.,    1.,    2.,    0.],
           [   3.,    2.,    4.,    0.],
           [   3.,    3.,    6.,    0.],
           [   4.,    1.,    2.,    0.],
           [   4.,    2.,    4.,    0.],
           [   4.,    3.,    6.,    0.]]
my_data = np.array(my_data)
r = np.core.records.fromarrays([my_data[:,1],my_data[:,0]],names='a,b')
my_data = my_data[r.argsort()]
print(my_data)

Result:

[[  2.   1.   2.   0.]
 [  3.   1.   2.   0.]
 [  4.   1.   2.   0.]
 [  2.   2. 100.   0.]
 [  3.   2.   4.   0.]
 [  4.   2.   4.   0.]
 [  2.   3. 100.   0.]
 [  3.   3.   6.   0.]
 [  4.   3.   6.   0.]]

Your input and output look the same, what is being sorted here?
whoops forgot to swap [my_data[:,0],my_data[:,1]] from my code snippet to match the 1,0 order asked for. Thanks, updated.

Roger V. · Accepted Answer · 2024-06-07 12:47:03Z

For completeness: a possible solution is to use pandas, which has a built-in option for such sorting:

import numpy as np
import pandas as pd

df = pd.DataFrame([[   2.,    1.,    2.,    0.],
                    [   2.,    2.,  100.,    0.],
                    [   2.,    3.,  100.,    0.],
                    [   3.,    1.,    2.,    0.],
                    [   3.,    2.,    4.,    0.],
                    [   3.,    3.,    6.,    0.],
                    [   4.,    1.,    2.,    0.],
                    [   4.,    2.,    4.,    0.],
                    [   4.,    3.,    6.,    0.]], columns=['col1', 'col2', 'col3', 'col4'])

This then can be sorted in the desired order using

df.sort_values(by=['col2', 'col1'])

with the result

   col1  col2   col3  col4
0   2.0   1.0    2.0   0.0
3   3.0   1.0    2.0   0.0
6   4.0   1.0    2.0   0.0
1   2.0   2.0  100.0   0.0
4   3.0   2.0    4.0   0.0
7   4.0   2.0    4.0   0.0
2   2.0   3.0  100.0   0.0
5   3.0   3.0    6.0   0.0
8   4.0   3.0    6.0   0.0

it could be converted to a numpy array as:

df.sort_values(by=['col2', 'col1']).to_numpy()

Remark:
I find this solution handy, if manipulating a single dataset and already using pandas. However, if such sorting has to be done repeatedly (e.g., in an MCMC in a loop), it is slow, and I fell upon this thread while looking for a faster option (and preferably native numpy). I am now using the np.lexsort solution proposed in another answer.

Collectives™ on Stack Overflow

Numpy sort ndarray on multiple columns

5 Answers 5

numpy ndarray sort by the 1st, 2nd or 3rd column:

1 Comment

2 Comments

Comments

this method works for any numpy array:

Result:

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

numpy ndarray sort by the 1st, 2nd or 3rd column:

1 Comment

2 Comments

Comments

this method works for any numpy array:

Result:

2 Comments

Comments

Linked

Related