66

I have a rank-1 numpy.array of which I want to make a boxplot. However, I want to exclude all values equal to zero in the array. Currently, I solved this by looping the array and copy the value to a new array if not equal to zero. However, as the array consists of 86 000 000 values and I have to do this multiple times, this takes a lot of patience.

Is there a more intelligent way to do this?

7 Answers 7

143

For a NumPy array a, you can use

a[a != 0]

to extract the values not equal to zero.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you very much, this works indeed much (!) more faster. Does similar action ca be done on higher rank NumMpy array or matrix ? Because here, the problem occurs that dimenions will no longer match properly ...
@rubae: If a has higher dimension, the result will be a flattened (one dimensional) array. It would also be possible to remove columns or rows that are all zero.
...where a is a np.array. This will not work on built-in python arrays.
41

This is a case where you want to use masked arrays, it keeps the shape of your array and it is automatically recognized by all numpy and matplotlib functions.

X = np.random.randn(1e3, 5)
X[np.abs(X)< .1]= 0 # some zeros
X = np.ma.masked_equal(X,0)
plt.boxplot(X) #masked values are not plotted

#other functionalities of masked arrays
X.compressed() # get normal array with masked values removed
X.mask # get a boolean array of the mask
X.mean() # it automatically discards masked values

1 Comment

19

I decided to compare the runtime of the different approaches mentioned here. I've used my library simple_benchmark for this.

The boolean indexing with array[array != 0] seems to be the fastest (and shortest) solution.

enter image description here

For smaller arrays the MaskedArray approach is very slow compared to the other approaches however is as fast as the boolean indexing approach. However for moderately sized arrays there is not much difference between them.

Here is the code I've used:

from simple_benchmark import BenchmarkBuilder

import numpy as np

bench = BenchmarkBuilder()

@bench.add_function()
def boolean_indexing(arr):
    return arr[arr != 0]

@bench.add_function()
def integer_indexing_nonzero(arr):
    return arr[np.nonzero(arr)]

@bench.add_function()
def integer_indexing_where(arr):
    return arr[np.where(arr != 0)]

@bench.add_function()
def masked_array(arr):
    return np.ma.masked_equal(arr, 0)

@bench.add_arguments('array size')
def argument_provider():
    for exp in range(3, 25):
        size = 2**exp
        arr = np.random.random(size)
        arr[arr < 0.1] = 0  # add some zeros
        yield size, arr

r = bench.run()
r.plot()

1 Comment

! The bench for masked_array is built incorrectly: np.ma.masked_equal(arr, 0) does not return a filtered array. It should be m = np.ma.masked_equal(arr, 0); return arr[~m.mask]
5

You can index with a Boolean array. For a NumPy array A:

res = A[A != 0]

You can use Boolean array indexing as above, bool type conversion, np.nonzero, or np.where. Here's some performance benchmarking:

# Python 3.7, NumPy 1.14.3

np.random.seed(0)

A = np.random.randint(0, 5, 10**8)

%timeit A[A != 0]          # 768 ms
%timeit A[A.astype(bool)]  # 781 ms
%timeit A[np.nonzero(A)]   # 1.49 s
%timeit A[np.where(A)]     # 1.58 s

Comments

4

I would like to suggest you to simply utilize NaN for cases like this, where you'll like to ignore some values, but still want to keep the procedure statistical as meaningful as possible. So

In []: X= randn(1e3, 5)
In []: X[abs(X)< .1]= NaN
In []: isnan(X).sum(0)
Out[: array([82, 84, 71, 81, 73])
In []: boxplot(X)

enter image description here

3 Comments

ah, the use of NaN seems indeed more appropriate here, thank you. As such i no longer need to copy my data to a new array with different sizing but i can keep the original array and as such location in the array. Thank you !
do you perhaps know a manner to loop this using list comprehension ? i.e. i'm having a dictionary a where a[k] is a NumPy array so i wanted to do [a[k][abs(a[k])<.1]=float('NaN') for k in data] but this seems to fail in the loop, whereas only executing the command in the loop seems to work ...
@rubae: I think you should make a separate question related to this list comprehension issue. Unfortunately it's not anymore so straightforward to figure out what you are actually aiming for :(. As far as I can guess; don't get fooled out with the list comprehension, perhaps you are only looking for something simple like this: for k in data: a[k][abs(a[k])< .1]= NaN?
4

A simple line of code can get you an array that excludes all '0' values:

np.argwhere(*array*)

example:

import numpy as np
array = [0, 1, 0, 3, 4, 5, 0]
array2 = np.argwhere(array)
print array2

[1, 3, 4, 5]

3 Comments

np.argwhere returns the indexes of the nonzero elements only
So this array, by sheer luck, happens to appear to satisfy the question, but is misleading. From the result of argwhere you could reconstitute the non-zero array, but it's an additional step.
Actually np.argwhere() doesn't return the list with non zeros excluded, it return a list of indices of the non zeros elements
1

[i for i in Array if i != 0.0] if the numbers are float or [i for i in SICER if i != 0] if the numbers are int.

1 Comment

your solution will likely be less efficient than numpy, to handle both types at once you could do [i for i in Array if i > 0]

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.