5

Usually I'm able to match Numba's performance when using Cython. However, in this example I have failed to do so - Numba is about 4 times faster than my Cython's version.

Here the Cython-version:

%%cython -c=-march=native -c=-O3
cimport numpy as np
import numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def cy_where(double[::1] df):
    cdef int i
    cdef int n = len(df)
    cdef np.ndarray[dtype=double] output = np.empty(n, dtype=np.float64)
    for i in range(n):
        if df[i]>0.5:
            output[i] = 2.0*df[i]
        else:
            output[i] = df[i]
    return output 

And here is the Numba-version:

import numba as nb
@nb.njit
def nb_where(df):
    n = len(df)
    output = np.empty(n, dtype=np.float64)
    for i in range(n):
        if df[i]>0.5:
            output[i] = 2.0*df[i]
        else:
            output[i] = df[i]
    return output

When tested, the Cython version is on par with numpy's where, but is clearly inferior to Numba:

#Python3.6 + Cython 0.28.3 + gcc-7.2
import numpy
np.random.seed(0)
n = 10000000
data = np.random.random(n)

assert (cy_where(data)==nb_where(data)).all()
assert (np.where(data>0.5,2*data, data)==nb_where(data)).all()

%timeit cy_where(data)       # 179ms
%timeit nb_where(data)       # 49ms (!!)
%timeit np.where(data>0.5,2*data, data)  # 278 ms

What is the reason for Numba's performance and how can it be matched when using Cython?


As suggested by @max9111, eliminating stride by using continuous memory-view, which doesn't improve the performance much:

@cython.boundscheck(False)
@cython.wraparound(False)
def cy_where_cont(double[::1] df):
    cdef int i
    cdef int n = len(df)
    cdef np.ndarray[dtype=double] output = np.empty(n, dtype=np.float64)
    cdef double[::1] view = output  # view as continuous!
    for i in range(n):
        if df[i]>0.5:
            view[i] = 2.0*df[i]
        else:
            view[i] = df[i]
    return output 

%timeit cy_where_cont(data)   #  165 ms
3
  • Does this cdef double[::1] output = np.empty(n, dtype=np.float64) improve the performance? It looks like cdef np.ndarray[dtype=double] output = np.empty(n, dtype=np.float64) causes strided memory access afterwards which often prevents SIMD-vectorization. (I looked that up in the html generated with the -a flag, but have no gcc available right now.) Commented Aug 27, 2018 at 22:10
  • 1
    @max9111 If SIMD-vectorization is the reason for the speed-up, than one should probably use continuous memory view as you suggested. In this case it didn't change much (see my edit). Maybe this is missed optimization from gcc? Commented Aug 28, 2018 at 4:09
  • 1
    Roughly equivalent in godbolt - godbolt.org/z/h_qNbH - does seem like clang does a lot 'more' - some of that is just loop unrolling, but its overall vectorization strategy is different too. Commented Aug 28, 2018 at 14:35

2 Answers 2

4

This seems to be completely driven by optimizations that LLVM is able to make. If I compile the cython example with clang, performance between the two examples is identical. For what it's worth, MSVC on windows shows a similar performance discrepancy to numba.

$ CC=clang ipython
<... setup code>

In [7]: %timeit cy_where(data)       # 179ms
   ...: %timeit nb_where(data)       # 49ms (!!) 

30.8 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
30.2 ms ± 498 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sign up to request clarification or add additional context in comments.

4 Comments

Which clang version do you use?
This was with clang 6.0 on ubuntu 18.04
I'm impressed, that the newer clang version get the performance even when nobody tells it, that the data is continuous (older clang-versions (3.8) weren't able to do it).
I don't understand why gcc is not able to match clang's performance - coding in pure C I cannot see much difference. But even when using C-code-vebatim in Cython there is huge difference between gcc and clang.
0

Interestingly, compiling the original Numpy code with pythran, using clang as a backend, yields the same performance as the Numba version.

import numpy as np
#pythran export work(float64[])

def work(df):
    return np.where(data>0.5,2*data, data)

Compiled with

CXX=clang++ CC=clang pythran pythran_work.py -O3 -march=native

and the benchmark session:

import numpy as np
np.random.seed(0)
n = 10000000
data = np.random.random(n)
import numba_work, pythran_work

%timeit numba_work.work(data)
12.7 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pythran_work.work(data)
12.7 ms ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

1 Comment

Do you think this is due to pythran or clang vs gcc?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.