7

The code below suggests that pandas may be much slower than numpy, at least in the specifi case of the function clip(). What is surprising is that making a roundtrip from pandas to numpy and back to pandas, while performing the calculations in numpy, is still much faster than doing it in pandas.

Shouldn't the pandas function have been implemented in this roundabout way?

In [49]: arr = np.random.randn(1000, 1000)

In [50]: df=pd.DataFrame(arr)

In [51]: %timeit np.clip(arr, 0, None)
100 loops, best of 3: 8.18 ms per loop

In [52]: %timeit df.clip_lower(0)
1 loops, best of 3: 344 ms per loop

In [53]: %timeit pd.DataFrame(np.clip(df.values, 0, None))
100 loops, best of 3: 8.4 ms per loop
2
  • 2
    it ok, since pandas make lot of data checks, transformations and other stuff on top of clipping Commented Nov 7, 2013 at 11:09
  • I too was caught off guard discovering all the pandas overhead as @alko enumerates. Indexing is another reality that diverges pandas from numpy. Check out the talk Sofia Heisler No More Sad Pandas Optimizing Pandas Code for Speed and Efficiency PyCon 2017 for some commentary and ramifications of pandas overhead and a bunch of rich comparisons, not all numpy vs pandas. However, the examples that pull in .values make that direct comparison. Commented Dec 7, 2017 at 4:28

2 Answers 2

11

In master/0.13 (release very shortly), this is much faster (still slightly slower that native numpy because of handling of alignment/dtype/nans).

In 0.12 it was applying per column, so this was a relatively expensive operation.

In [4]: arr = np.random.randn(1000, 1000)

In [5]: df=pd.DataFrame(arr)

In [6]: %timeit np.clip(arr, 0, None)
100 loops, best of 3: 6.62 ms per loop

In [7]: %timeit df.clip_lower(0)
100 loops, best of 3: 12.9 ms per loop
Sign up to request clarification or add additional context in comments.

Comments

0

In my benchmark, np.maximum is the fastest, both operating in df and numpy.array.

arr = np.random.randn(1000, 1000)

df = pd.DataFrame(arr)

%%timeit
np.clip(arr, 0, None)
# 4.55 ms ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.clip(lower=0.0)
# 5.62 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
np.maximum(arr, 0)
# 4.53 ms ± 9.23 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
np.maximum(df, 0)
# 4.65 ms ± 5.13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.