12

This is based on this question asked 2018-10.

Consider the following code. Three simple functions to count non-zero elements in a NumPy 3D array (1000 × 1000 × 1000).

import numpy as np

def f_1(arr):
    return np.sum(arr > 0)

def f_2(arr):
    ans = 0
    for val in range(arr.shape[0]):
        ans += np.sum(arr[val, :, :] > 0)
    return ans

def f_3(arr):
    return np.count_nonzero(arr)

if __name__ == '__main__':

    data = np.random.randint(0, 10, (1_000, 1_000, 1_000))
    print(f_1(data))
    print(f_2(data))
    print(f_3(data))

Runtimes on my machine (Python 3.7.?, Windows 10, NumPy 1.16.?):

%timeit f_1(data)
1.73 s ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2(data)
1.4 s ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_3(data)
2.38 s ± 956 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

So, f_2() works faster than f_1() and f_3(). However, it's not the case with data of smaller size. The question is - why so? Is it NumPy, Python, or something else?

3
  • You're declaring an 8Gb array, which won't fit on many desktops or laptops, so you're implicitly testing caching behavior. 3D numpy array (1000 × 1000 × 1000 cube) of type np.int64 is 10^9 elements at 8 bytes each. So this will boil down to whether an 8Gb array of ints fits/ is cached/ etc. on your particular machine. Since it won't fit on many desktops or laptops so if you don't want to test caching, use dimensions 100 × 100 × 100 instead, which will be 8Mb. Commented Jun 28, 2019 at 23:38
  • Also, to make this deterministic, set your random seed before the np.random.randint call. Commented Jun 28, 2019 at 23:40
  • @Barker you seem to have missed the difference between seconds and milliseconds. Commented Jun 29, 2019 at 0:32

3 Answers 3

7

This is due to memory access and caching. Each of these functions is doing two things, taking the first code as an example:

np.sum(arr > 0)

It first does a comparison to find where arr is greater than zero (or non-zero, since arr contains non-negative integers). This creates an intermediate array the same shape as arr. Then, it sums this array.

Straightforward, right? Well, when using np.sum(arr > 0) this is a large array. When it's large enough to not fit in cache, performance will decrease since when the processor starts to execute the sum most of the array elements will have been evicted from memory and need to be reloaded.

Since f_2 iterates over the first dimension, it is dealing with smaller sub-arrays. The same copy and sum is done, but this time the intermediate array fits in memory. It's created, used, and destroyed without ever leaving memory. This is much faster.

Now, you would think that f_3 would be fastest (using an in-built method and all), but looking at the source code shows that it uses the following operations:

a_bool = a.astype(np.bool_, copy=False)
return a_bool.sum(axis=axis, dtype=np.intp

a_bool is just another way of finding the non-zero entries, and creates a large intermediate array.

Conclusions

Rules of thumb are just that, and are frequently wrong. If you want faster code, profile it and see what the problems are (good work on that here).

Python does some things very well. In cases where it's optimized, it can be faster than numpy. Don't be afraid to use plain old python code or datatypes in combination with numpy.

If you find frequently yourself manually writing for loops for better performance you may want to take a look at numexpr - it automatically does some of this. I haven't used it much myself, but it should provide a good speedup if intermediate arrays are what's slowing down your program.

Sign up to request clarification or add additional context in comments.

7 Comments

Good catch. Indeed by removing the > 0 test numpy is the fastest
Thank you for the reply. Can you explain what cache do you mean? I ask because data is about 3.7GB, data > 0 is about 0.9GB, data[0, :, :] is about 3.8MB.
@Poolka, it's the CPU cache that causes this. I don't know exact numbers off the top of my head, but typically it has a few MB of memory (here's a page I googled that has some better numbers: makeuseof.com/tag/what-is-cpu-cache).
Any thoughts on the dependence on the array axis @BlackBear noticed in his answer? Running time increases tenfold which is very confusing to me.
@Poolka, Numpy does the correct thing with regards to memory layout here, and each of the options given in the question access memory in the same way, so that`s not the issue.
|
4

It's all a matter of how the data is laid out in memory and how the code accesses it. Essentially, data is fetched from the memory in blocks which are then cached; if an algorithm manages to use data from a block that is in the cache, there is no need to read from memory again. This can result in huge time savings, especially when the cache is much smaller than the data you are dealing with.

Consider these variations, which only differ in which axis we are iterating on:

def f_2_0(arr):
    ans = 0
    for val in range(arr.shape[0]):
        ans += np.sum(arr[val, :, :] > 0)
    return ans

def f_2_1(arr):
    ans = 0
    for val in range(arr.shape[1]):
        ans += np.sum(arr[:, val, :] > 0)
    return ans

def f_2_2(arr):
    ans = 0
    for val in range(arr.shape[2]):
        ans += np.sum(arr[:, :, val] > 0)
    return ans

And the results on my laptop:

%timeit f_1(data)
2.31 s ± 47.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2_0(data)
1.88 s ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2_1(data)
2.65 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2_2(data)
12.8 s ± 650 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can see that f_2_1 almost as fast as f_1, which makes me think that numpy is not using the optimal access pattern (the one used by f_2_0). The explanation for how exactly caching affects the timing is in the other answer.

2 Comments

Thank you for the answer. I see you edited the answer and referred to another one. However, that strong dependence on the array axis you noticed. Would it be correct to name it a numpy issue?
@Poolka it's a numpy issue in the sense that data > 0 creates a copy, and not a view (as opposed to indexing), It is not a problem of iteration though: if you run the same test and remove the > 0 part you will see that numpy is fastest (albeit not by much, my results are 500ms, 537ms, 945ms and 10.8s)
2

Let's remove the temporary array completely

As @user2699 already mentioned in his answer, allocating and writing to a large array that doesn't fit in cache can slow down the process quite a lot. To show this behavior I have written two small functions using Numba (JIT-Compiler).

In compiled languages (C,Fortran,..) you normally avoid temporary arrays. In interpreted Python (without using Cython or Numba) you often want to call a compiled function on a larger chunk of data (vectorization) because loops in interpreted code are extremely slow. But this can also have a view downsides (like temporary arrays, bad cache usage)

Function without temporary array allocation

@nb.njit(fastmath=True,parallel=False)
def f_4(arr):
    sum=0
    for i in nb.prange(arr.shape[0]):
        for j in range(arr.shape[1]):
            for k in range(arr.shape[2]):
                if arr[i,j,k]>0:
                    sum+=1
    return sum

With temporary array

Please note that if you turn on parallelization parallel=True, the compiler does not only try to parallelize the code, but also other optimizations like loop fusing are turned on.

@nb.njit(fastmath=True,parallel=False)
def f_5(arr):
    return np.sum(arr>0)

Timings

%timeit f_1(data)
1.65 s ± 48.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2(data)
1.27 s ± 5.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_3(data)
1.99 s ± 6.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_4(data) #parallel=false
216 ms ± 5.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_4(data) #parallel=true
121 ms ± 4.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_5(data) #parallel=False
1.12 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_5(data) #parallel=true Temp-Array is automatically optimized away
146 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comments