0

With Pandas using Numpy under the hood I was curious as to why straight numpy code (509 ms) was 12x faster than doing the same operation with a dataframe (6.38 s) in the example below?

# function with numpy arrays
def f_np(freq, asd):
    for f in np.arange(21.,2000.,1.):
        fi = freq/f
        gi =  (1+fi**2) / ((1-fi**2)**2 + fi**2) * asd
        df['fi'] = fi
        df['gi'] = gi
        # process each df ...

# function with dataframe
def f_df(df):
    for f in np.arange(21.,2000.,1.):
        df['fi'] = df.Freq/f
        df['gi'] = (1+df.fi**2) / ((1-df.fi**2)**2 + df.fi**2) * df.ASD
        # process each df ...


freq =  np.arange(20., 2000., .1)
asd = np.ones(len(freq))
df = pd.DataFrame({'Freq':freq, 'ASD':asd})    

%timeit f_np(freq, asd)
%timeit f_df(df)

509 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.38 s ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2
  • 2
    generally creation of Pandas Series/DataFrame has small overhead for naming/indexing. Also, df['fi'] = someSeries has some significant overhead due to index alignment. Same goes for all other arithmetic operators. All that adds up. Commented Jun 2, 2020 at 17:29
  • Adding .values to the dataframes in the f_df function (like, df.fi.values**2) speeds things up considerably (only 1.5x slower), but I would have thought that pandas might have handled that optimization (reducing to numpy array) behind the scenes. Commented Jun 2, 2020 at 20:14

1 Answer 1

1

Are you sure that the difference in speed is because of "some operation with a dataframe" in this specific case? I think the difference in speed is attributed to the fact that you created fi and gi variables and assigned the variables on the columns in the first example, but you didn't do that in the second example. The results were similar when I assigned a variable in both.

import pandas as pd,numpy as np
# function with numpy arrays
def f_np(freq, asd):
    for f in np.arange(21.,2000.,1.):
        fi = freq/f
        gi =  (1+fi**2) / ((1-fi**2)**2 + fi**2) * asd
        df['fi'] = fi
        df['gi'] = gi
        # process each df ...

# function with dataframe
def f_df(df):
    for f in np.arange(21.,2000.,1.):
        fi = freq/f
        gi =  (1+fi**2) / ((1-fi**2)**2 + fi**2) * asd
        df['fi'] = fi
        df['gi'] = gi
        # process each df ...


freq =  np.arange(20., 2000., .1)
asd = np.ones(len(freq))
df = pd.DataFrame({'Freq':freq, 'ASD':asd})    

%timeit f_np(freq, asd)
%timeit f_df(df)
#562 ms ± 9.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#569 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

3 Comments

as written, there is no difference between your f_np and f_df in the code.
But, the f_df function is calling a dataframe with pandas and the f_np function is calling a numpy array, so isn't the difference and ultimately the question more relevant later in these lines of code: freq = np.arange(20., 2000., .1) asd = np.ones(len(freq)) df = pd.DataFrame({'Freq':freq, 'ASD':asd})
The whole point was to show the difference when calculating with a dataframe column as opposed to a strict numpy array. In your second function the dataframe that is passed in is not used in the calculations, only in assignment.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.