2

Using python 2.7, scipy 1.0.0-3

Apparently I have a misunderstanding of how the numpy where function is supposed to operate or there is a known bug in its operation. I'm hoping someone can tell me which and explain a work-around to suppress the annoying warning that I am trying to avoid. I'm getting the same behavior when I use the pandas Series where().

To make it simple, I'll use a numpy array as my example. Say I want to apply np.log() on the array and only so for the condition a value is a valid input, i.e., myArray>0.0. For values where this function should not be applied, I want to set the output flag of -999.9:

myArray = np.array([1.0, 0.75, 0.5, 0.25, 0.0])
np.where(myArray>0.0, np.log(myArray), -999.9)

I expected numpy.where() to not complain about the 0.0 value in the array since the condition is False there, yet it does and it appears to actually execute for that False condition:

-c:2: RuntimeWarning: divide by zero encountered in log 
array([  0.00000000e+00,  -2.87682072e-01,  -6.93147181e-01,
        -1.38629436e+00,  -9.99900000e+02])

The numpy documentation states:

If x and y are given and input arrays are 1-D, where is equivalent to: [xv if c else yv for (c,xv,yv) in zip(condition,x,y)]

I beg to differ with this statement since

[np.log(val) if val>0.0 else -999.9 for val in myArray]

provides no warning at all:

[0.0, -0.2876820724517809, -0.69314718055994529, -1.3862943611198906, -999.9] 

So, is this a known bug? I don't want to suppress the warning for my entire code.

3 Answers 3

5

You can have the log evaluated at the relevant places only using its optional where parameter

np.where(myArray>0.0, np.log(myArray, where=myArray>0.0), -999.9)

or more efficiently

mask = myArray > 0.0
np.where(mask, np.log(myArray, where=mask), -999)

or if you find the "double where" ugly

np.log(myArray, where=myArray>0.0, out=np.full(myArray.shape, -999.9))

Any one of those three should suppress the warning.

Sign up to request clarification or add additional context in comments.

Comments

2

This behavior of where should be understandable given a basic understanding of Python. This is a Python expression that uses a couple of numpy functions.

What happens in this expression?

np.where(myArray>0.0, np.log(myArray), -999.9)

The interpreter first evaluates all the arguments of the function, and then passes the results to the where. Effectively then:

cond = myArray>0.0
A = np.log(myArray)
B = -999.9
np.where(cond, A, B)

The warning is produced in the 2nd line, not in the 4th.

The 4th line is equivalent to:

[xv if c else yv for (c,xv,yv) in zip(cond, A, B)]

or

[A[i] if c else B for i,c in enumerate(cond)]

np.where is most often used with one argument, where it is a synonym for np.nonzero. We don't see this three-argument form that often on SO. It isn't that useful, in part because it doesn't save on calculations.

Masked assignment is more often, especially if there are more than 2 alternatives.

In [123]: mask = myArray>0
In [124]: out = np.full(myArray.shape, np.nan)
In [125]: out[mask] = np.log(myArray[mask])
In [126]: out
Out[126]: array([ 0.        , -0.28768207, -0.69314718, -1.38629436,         nan])

Paul Panzer showed how to do the same with the where parameter of log. That feature isn't being used as much as it could be.

In [127]: np.log(myArray, where=mask, out=out)
Out[127]: array([ 0.        , -0.28768207, -0.69314718, -1.38629436,         nan])

Comments

1

This is not a bug. See this related answer to a similar question. The example in the docs is misleading, but that answer looks at it in detail.

The issue is that ternary statements are processed by the interpreter at compile-time while numpy.where is a regular function. Therefore, ternary statements allow short-circuiting, whereas this is not possible when arguments are defined beforehand.

In other words, the arguments of numpy.where are calculated before the Boolean array is processed.

You may think this is inefficient: why build 2 separate arrays and then use a 3rd Boolean array to decide which item to choose? Surely that's double the work / double the memory?

However, this inefficiency is more than offset by the vectorisation provided by numpy functions acting on an entire array, e.g. np.log(arr).


Consider the example provided in the docs:

If x and y are given and input arrays are 1-D, where is equivalent to::

    [xv if c else yv for (c,xv,yv) in zip(condition,x,y)]

Notice the inputs are arrays. Try running:

c = np.array([0])

result = [xv if c else yv for (c, xv, yv) in zip(c==0, np.array([1]), np.log(c))]

You will notice that this errors.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.