pandas much slower than numpy?

Question

The code below suggests that pandas may be much slower than numpy, at least in the specifi case of the function clip(). What is surprising is that making a roundtrip from pandas to numpy and back to pandas, while performing the calculations in numpy, is still much faster than doing it in pandas.

Shouldn't the pandas function have been implemented in this roundabout way?

In [49]: arr = np.random.randn(1000, 1000)

In [50]: df=pd.DataFrame(arr)

In [51]: %timeit np.clip(arr, 0, None)
100 loops, best of 3: 8.18 ms per loop

In [52]: %timeit df.clip_lower(0)
1 loops, best of 3: 344 ms per loop

In [53]: %timeit pd.DataFrame(np.clip(df.values, 0, None))
100 loops, best of 3: 8.4 ms per loop

it ok, since pandas make lot of data checks, transformations and other stuff on top of clipping — alko
– alko, Commented Nov 7, 2013 at 11:09
I too was caught off guard discovering all the pandas overhead as @alko enumerates. Indexing is another reality that diverges pandas from numpy. Check out the talk Sofia Heisler No More Sad Pandas Optimizing Pandas Code for Speed and Efficiency PyCon 2017 for some commentary and ramifications of pandas overhead and a bunch of rich comparisons, not all numpy vs pandas. However, the examples that pull in .values make that direct comparison. — jxramos
– jxramos, Commented Dec 7, 2017 at 4:28

Jeff · Accepted Answer · 2013-11-07 15:21:14Z

11

In master/0.13 (release very shortly), this is much faster (still slightly slower that native numpy because of handling of alignment/dtype/nans).

In 0.12 it was applying per column, so this was a relatively expensive operation.

In [4]: arr = np.random.randn(1000, 1000)

In [5]: df=pd.DataFrame(arr)

In [6]: %timeit np.clip(arr, 0, None)
100 loops, best of 3: 6.62 ms per loop

In [7]: %timeit df.clip_lower(0)
100 loops, best of 3: 12.9 ms per loop

edited Nov 7, 2013 at 15:21

answered Nov 7, 2013 at 12:39

Jeff

130k21 gold badges223 silver badges189 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Muhammad Yasirroni · Accepted Answer · 2022-02-20 13:49:10Z

In my benchmark, np.maximum is the fastest, both operating in df and numpy.array.

arr = np.random.randn(1000, 1000)

df = pd.DataFrame(arr)

%%timeit
np.clip(arr, 0, None)
# 4.55 ms ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.clip(lower=0.0)
# 5.62 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
np.maximum(arr, 0)
# 4.53 ms ± 9.23 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
np.maximum(df, 0)
# 4.65 ms ± 5.13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Collectives™ on Stack Overflow

pandas much slower than numpy?

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related