Achieving Numba's performance with Cython

Question

Usually I'm able to match Numba's performance when using Cython. However, in this example I have failed to do so - Numba is about 4 times faster than my Cython's version.

Here the Cython-version:

%%cython -c=-march=native -c=-O3
cimport numpy as np
import numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def cy_where(double[::1] df):
    cdef int i
    cdef int n = len(df)
    cdef np.ndarray[dtype=double] output = np.empty(n, dtype=np.float64)
    for i in range(n):
        if df[i]>0.5:
            output[i] = 2.0*df[i]
        else:
            output[i] = df[i]
    return output

And here is the Numba-version:

import numba as nb
@nb.njit
def nb_where(df):
    n = len(df)
    output = np.empty(n, dtype=np.float64)
    for i in range(n):
        if df[i]>0.5:
            output[i] = 2.0*df[i]
        else:
            output[i] = df[i]
    return output

When tested, the Cython version is on par with numpy's where, but is clearly inferior to Numba:

#Python3.6 + Cython 0.28.3 + gcc-7.2
import numpy
np.random.seed(0)
n = 10000000
data = np.random.random(n)

assert (cy_where(data)==nb_where(data)).all()
assert (np.where(data>0.5,2*data, data)==nb_where(data)).all()

%timeit cy_where(data)       # 179ms
%timeit nb_where(data)       # 49ms (!!)
%timeit np.where(data>0.5,2*data, data)  # 278 ms

What is the reason for Numba's performance and how can it be matched when using Cython?

As suggested by @max9111, eliminating stride by using continuous memory-view, which doesn't improve the performance much:

@cython.boundscheck(False)
@cython.wraparound(False)
def cy_where_cont(double[::1] df):
    cdef int i
    cdef int n = len(df)
    cdef np.ndarray[dtype=double] output = np.empty(n, dtype=np.float64)
    cdef double[::1] view = output  # view as continuous!
    for i in range(n):
        if df[i]>0.5:
            view[i] = 2.0*df[i]
        else:
            view[i] = df[i]
    return output 

%timeit cy_where_cont(data)   #  165 ms

Does this cdef double[::1] output = np.empty(n, dtype=np.float64) improve the performance? It looks like cdef np.ndarray[dtype=double] output = np.empty(n, dtype=np.float64) causes strided memory access afterwards which often prevents SIMD-vectorization. (I looked that up in the html generated with the -a flag, but have no gcc available right now.) — max9111
– max9111, Commented Aug 27, 2018 at 22:10
@max9111 If SIMD-vectorization is the reason for the speed-up, than one should probably use continuous memory view as you suggested. In this case it didn't change much (see my edit). Maybe this is missed optimization from gcc? — ead
– ead, Commented Aug 28, 2018 at 4:09
Roughly equivalent in godbolt - godbolt.org/z/h_qNbH - does seem like clang does a lot 'more' - some of that is just loop unrolling, but its overall vectorization strategy is different too. — chrisb
– chrisb, Commented Aug 28, 2018 at 14:35

chrisb · Accepted Answer · 2018-09-02 13:57:05Z

4

This seems to be completely driven by optimizations that LLVM is able to make. If I compile the cython example with clang, performance between the two examples is identical. For what it's worth, MSVC on windows shows a similar performance discrepancy to numba.

$ CC=clang ipython
<... setup code>

In [7]: %timeit cy_where(data)       # 179ms
   ...: %timeit nb_where(data)       # 49ms (!!) 

30.8 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
30.2 ms ± 498 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

answered Sep 2, 2018 at 13:57

chrisb

52.7k8 gold badges73 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

ead Over a year ago

Which clang version do you use?

chrisb Over a year ago

This was with clang 6.0 on ubuntu 18.04

ead Over a year ago

I'm impressed, that the newer clang version get the performance even when nobody tells it, that the data is continuous (older clang-versions (3.8) weren't able to do it).

ead Over a year ago

I don't understand why gcc is not able to match clang's performance - coding in pure C I cannot see much difference. But even when using C-code-vebatim in Cython there is huge difference between gcc and clang.

serge-sans-paille · Accepted Answer · 2018-10-28 07:42:04Z

Interestingly, compiling the original Numpy code with pythran, using clang as a backend, yields the same performance as the Numba version.

import numpy as np
#pythran export work(float64[])

def work(df):
    return np.where(data>0.5,2*data, data)

Compiled with

CXX=clang++ CC=clang pythran pythran_work.py -O3 -march=native

and the benchmark session:

import numpy as np
np.random.seed(0)
n = 10000000
data = np.random.random(n)
import numba_work, pythran_work

%timeit numba_work.work(data)
12.7 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pythran_work.work(data)
12.7 ms ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Collectives™ on Stack Overflow

Achieving Numba's performance with Cython

2 Answers 2

4 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Linked

Related