Vectorized matrix selection based on array

Question

I have a S x n array DATA with data. I have a (S x 1) array ARRAY with integer values <=n. For each row i in DATA, I want to

DATA[i, ARRAY[i]:] = np.nan

Here's how I'm doing it right now

from numpy.random import poisson as poissonN
from numpy.random import uniform
import numpy as np
S = 1000
n = 8
DATA = uniform(low=0, high=1, size=S*n).reshape((S, n))
ARRAY = poissonN(1, S).reshape((-1, 1))

for i, draw in enumerate(ARRAY):
    DATA[i, draw:] = np.nan

There must be a vectorized analogue to this which is more efficient if S goes in the hundred thousands, right? Whatever meshing I try, it won't work out - or is equally slow to this iterative approach.

There might be a fancy index trick to make this work (but I don't know it). If one doesn't show up, writing it up in Cython might help speed things up if this code snippet becomes a significant bottleneck. — mgilson
– mgilson, Commented Jun 14, 2016 at 20:04

Divakar · Accepted Answer · 2016-06-14 20:37:39Z

You can use NumPy broadcasting and boolean indexing -

DATA[ARRAY <= np.arange(DATA.shape[1])] = np.nan

Explanation

Let's take a sample case with S = 5 and n=4 and create DATA and ARRAY.

In [288]: S = 5
     ...: n = 4
     ...: DATA = uniform(low=0, high=1, size=S*n).reshape((S, n))
     ...: ARRAY = poissonN(1, S).reshape((-1, 1))
     ...: 

In [289]: DATA
Out[289]: 
array([[ 0.54235747,  0.01309313,  0.62664698,  0.92081697],
       [ 0.17877576,  0.36536259,  0.91874957,  0.81924979],
       [ 0.7518459 ,  0.73218436,  0.99685998,  0.26435871],
       [ 0.73130257,  0.77123956,  0.10437601,  0.09296549],
       [ 0.804398  ,  0.78675381,  0.71066382,  0.87481544]])

In [290]: ARRAY
Out[290]: 
array([[1],
       [1],
       [0],
       [2],
       [1]])

Now, run the loopy code and see what happens -

In [291]: for i, draw in enumerate(ARRAY):
     ...:     DATA[i, draw:] = np.nan
     ...:     

In [292]: DATA
Out[292]: 
array([[ 0.54235747,         nan,         nan,         nan],
       [ 0.17877576,         nan,         nan,         nan],
       [        nan,         nan,         nan,         nan],
       [ 0.73130257,  0.77123956,         nan,         nan],
       [ 0.804398  ,         nan,         nan,         nan]])

Now, with the proposed solution we are creating a boolean array of the same shape as DATA such that it covers all NaN elements as True and rest as False.

For the same, we are using broadcasting as shown here -

In [293]: ARRAY <= np.arange(DATA.shape[1])
Out[293]: 
array([[False,  True,  True,  True],
       [False,  True,  True,  True],
       [ True,  True,  True,  True],
       [False, False,  True,  True],
       [False,  True,  True,  True]], dtype=bool)

Thus, with boolean indexing we can set all those positions as NaNs in DATA. Let's create another instance of random elements and test out the NaNs with our proposed approach -

In [294]: DATA = uniform(low=0, high=1, size=S*n).reshape((S, n))

In [295]: DATA[ARRAY <= np.arange(DATA.shape[1])] = np.nan

In [296]: DATA
Out[296]: 
array([[ 0.87061908,         nan,         nan,         nan],
       [ 0.69237094,         nan,         nan,         nan],
       [        nan,         nan,         nan,         nan],
       [ 0.04257803,  0.82311917,         nan,         nan],
       [ 0.00723291,         nan,         nan,         nan]])

Please note the non-Nan values are different, because we have re-created DATA. The important thing to notice is that we have correctly set NaNs.

Runtime test

In [297]: # Inputs
     ...: S = 1000
     ...: n = 8
     ...: DATA = uniform(low=0, high=1, size=S*n).reshape((S, n))
     ...: ARRAY = poissonN(1, S).reshape((-1, 1))
     ...: 

In [298]: DATAc = DATA.copy() # Make copy for testing proposed ans

In [299]: def org_app(DATA,ARRAY):
     ...:     for i, draw in enumerate(ARRAY):
     ...:         DATA[i, draw:] = np.nan
     ...:         

In [301]: %timeit org_app(DATA,ARRAY)
100 loops, best of 3: 4.99 ms per loop

In [302]: %timeit DATAc[ARRAY <= np.arange(DATAc.shape[1])] = np.nan
10000 loops, best of 3: 94.1 µs per loop

In [305]: np.allclose(np.isnan(DATA),np.isnan(DATAc))
Out[305]: True

Collectives™ on Stack Overflow

Vectorized matrix selection based on array

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related