NumPy selecting specific column index per row by using a list of indexes

Question

I'm struggling to select the specific columns per row of a NumPy matrix.

Suppose I have the following matrix which I would call X:

[1, 2, 3]
[4, 5, 6]
[7, 8, 9]

I also have a list of column indexes per every row which I would call Y:

[1, 0, 2]

I need to get the values:

[2]
[4]
[9]

Instead of a list with indexes Y, I can also produce a matrix with the same shape as X where every column is a bool / int in the range 0-1 value, indicating whether this is the required column.

[0, 1, 0]
[1, 0, 0]
[0, 0, 1]

I know this can be done with iterating over the array and selecting the column values I need. However, this will be executed frequently on big arrays of data and that's why it has to run as fast as it can.

I was thus wondering if there is a better solution?

Is the answer better for you? stackoverflow.com/a/17081678/5046896 — GoingMyWay
– GoingMyWay, Commented Jan 17, 2020 at 12:17

wjandrea · Accepted Answer · 2022-07-30 16:28:12Z

166

If you've got a boolean array you can do direct selection based on that like so:

>>> a = np.array([True, True, True, False, False])
>>> b = np.array([1,2,3,4,5])
>>> b[a]
array([1, 2, 3])

To go along with your initial example you could do the following:

>>> a = np.array([[1,2,3], [4,5,6], [7,8,9]])
>>> b = np.array([[False,True,False],[True,False,False],[False,False,True]])
>>> a[b]
array([2, 4, 9])

You can also add in an arange and do direct selection on that, though depending on how you're generating your boolean array and what your code looks like YMMV.

>>> a = np.array([[1,2,3], [4,5,6], [7,8,9]])
>>> a[np.arange(len(a)), [1,0,2]]
array([2, 4, 9])

edited Jul 30, 2022 at 16:28

wjandrea

33.8k10 gold badges69 silver badges105 bronze badges

answered May 2, 2014 at 19:47

Slater Victoroff

22k23 gold badges92 silver badges149 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Gerrit-K Over a year ago

+1 for the example using arange . This was particularly useful to me for retrieving different blocks from multiple matrices (so basically the 3D case of this example)

marcotama Over a year ago

Hi, could you explain why we have to use arange instead of :? I know your way works and mine doesn't, but I would like to understand why.

Slater Victoroff Over a year ago

@tamzord because it's a numpy array and not a vanilla python list, so the : syntax doesn't work the same way.

marcotama Over a year ago

@SlaterTyranus, thanks for responding. My understanding, after some reading, is that mixing : with advanced indexing means: "for every sub-space along :, apply the given advanced indexing". Is my understanding correct?

Slater Victoroff Over a year ago

@tamzord explain what you mean by "sub-space"

|

Ashwini Chaudhary · Accepted Answer · 2014-05-02 19:50:04Z

59

You can do something like this:

In [7]: a = np.array([[1, 2, 3],
   ...: [4, 5, 6],
   ...: [7, 8, 9]])

In [8]: lst = [1, 0, 2]

In [9]: a[np.arange(len(a)), lst]
Out[9]: array([2, 4, 9])

More on indexing multi-dimensional arrays: http://docs.scipy.org/doc/numpy/user/basics.indexing.html#indexing-multi-dimensional-arrays

answered May 2, 2014 at 19:50

Ashwini Chaudhary

252k60 gold badges478 silver badges519 bronze badges

4 Comments

MadmanLee Over a year ago

struggling to understand why the arange is needed instead of simply ':' or range.

GoingMyWay Over a year ago

@MadmanLee Hi, using : will output multiple len(a) times of the results, instead, indicating the index of each row will print the anticipated results.

GoingMyWay Over a year ago

I think this is the exactly the right and elegant way to solve this problem.

spawn Jan 6 at 20:22

@GoingMyWay: not any more -- I prefer the np.take_along_axis approach of this answer

hpaulj · Accepted Answer · 2020-04-15 16:51:03Z

Recent numpy versions have added a take_along_axis (and put_along_axis) that does this indexing cleanly.

In [101]: a = np.arange(1,10).reshape(3,3)                                                             
In [102]: b = np.array([1,0,2])                                                                        
In [103]: np.take_along_axis(a, b[:,None], axis=1)                                                     
Out[103]: 
array([[2],
       [4],
       [9]])

It operates in the same way as:

In [104]: a[np.arange(3), b]                                                                           
Out[104]: array([2, 4, 9])

but with different axis handling. It's especially aimed at applying the results of argsort and argmax.

mbpaulus · Accepted Answer · 2018-11-10 11:56:38Z

8

A simple way might look like:

In [1]: a = np.array([[1, 2, 3],
   ...: [4, 5, 6],
   ...: [7, 8, 9]])

In [2]: y = [1, 0, 2]  #list of indices we want to select from matrix 'a'

range(a.shape[0]) will return array([0, 1, 2])

In [3]: a[range(a.shape[0]), y] #we're selecting y indices from every row
Out[3]: array([2, 4, 9])

edited Nov 10, 2018 at 11:56

mbpaulus

7,7313 gold badges35 silver badges41 bronze badges

answered May 30, 2018 at 9:19

Dhaval Mayatra

811 silver badge4 bronze badges

Comments

Kei Minagawa · Accepted Answer · 2014-05-05 01:54:46Z

3

You can do it by using iterator. Like this:

np.fromiter((row[index] for row, index in zip(X, Y)), dtype=int)

Time:

N = 1000
X = np.zeros(shape=(N, N))
Y = np.arange(N)

#@Aशwini चhaudhary
%timeit X[np.arange(len(X)), Y]
10000 loops, best of 3: 30.7 us per loop

#mine
%timeit np.fromiter((row[index] for row, index in zip(X, Y)), dtype=int)
1000 loops, best of 3: 1.15 ms per loop

#mine
%timeit np.diag(X.T[Y])
10 loops, best of 3: 20.8 ms per loop

edited May 5, 2014 at 1:54

answered May 2, 2014 at 23:56

Kei Minagawa

4,5515 gold badges31 silver badges43 bronze badges

2 Comments

user2379410 Over a year ago

OP mentioned it should run fast on large arrays, so your benchmarks are not very representative. I'm curious how your last method performs for (much) larger arrays!

Kei Minagawa Over a year ago

@moarningsun: Updated. np.diag(X.T[Y]) is so slow... But np.diag(X.T) is so fast(10us). I don't know why.

Rainald62 · Accepted Answer · 2022-03-05 20:39:41Z

The answer from hpaulj using take_along_axis should be the accepted one.

Here is a derived version with an N-dim index array:

>>> arr = np.arange(20).reshape((2,2,5))
>>> idx = np.array([[1,0],[2,4]])
>>> np.take_along_axis(arr, idx[...,None], axis=-1)
array([[[ 1],
        [ 5]],

       [[12],
        [19]]])

Note that the selection operation is ignorant about the shapes. I used this to refine a possibly vector-valued argmax result from histogram by fitting parabolas:

def interpol(arr):
    i = np.argmax(arr, axis=-1)
    a = lambda Δ: np.squeeze(np.take_along_axis(arr, i[...,None]+Δ, axis=-1), axis=-1)
    frac = .5*(a(1) - a(-1)) / (2*a(0) - a(-1) - a(1)) # |frac| < 0.5
    return i + frac

Note the squeeze to remove the dimension of size 1 resulting in the same shape of i and frac, the integer and fractional part of the peak position.

I'm quite sure that it is possible to avoid the lambda, but would the interpolation formula still look nice?

Thomas Devoogdt · Accepted Answer · 2018-05-30 15:16:19Z

Another clever way is to first transpose the array and index it thereafter. Finally, take the diagonal, its always the right answer.

X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
Y = np.array([1, 0, 2, 2])

np.diag(X.T[Y])

Step by step:

Original arrays:

>>> X
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

>>> Y
array([1, 0, 2, 2])

Transpose to make it possible to index it right.

>>> X.T
array([[ 1,  4,  7, 10],
       [ 2,  5,  8, 11],
       [ 3,  6,  9, 12]])

Get rows in the Y order.

>>> X.T[Y]
array([[ 2,  5,  8, 11],
       [ 1,  4,  7, 10],
       [ 3,  6,  9, 12],
       [ 3,  6,  9, 12]])

The diagonal should now become clear.

>>> np.diag(X.T[Y])
array([ 2,  4,  9, 12]

This technically works and looks very elegant. However, I find that this approach completely explodes when you’re dealing with large arrays. In my case, NumPy swallowed 30GB of swap and filled my SSD. I recommend using the advanced indexing approach instead.

Collectives™ on Stack Overflow

NumPy selecting specific column index per row by using a list of indexes

7 Answers 7

6 Comments

4 Comments

Comments

Comments

2 Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

6 Comments

4 Comments

Comments

Comments

2 Comments

Comments

1 Comment

Linked

Related