Optimize Python code. Optimize Pandas apply. Numba slow than pure python

Question

I'm facing a huge bottleneck where I apply a method() to each row in Pandas DataFrame. The execution time is in sorts of 15-20 minutes.

Now, the code I use is as follows:

def FillTarget(self, df):
    backup = df.copy()

    target = list(set(df['ACTL_CNTRS_BY_DAY']))
    df = df[~df['ACTL_CNTRS_BY_DAY'].isnull()]
    tmp = df[df['ACTL_CNTRS_BY_DAY'].isin(target)]
    tmp = tmp[['APPT_SCHD_ARVL_D', 'ACTL_CNTRS_BY_DAY']]
    tmp.drop_duplicates(subset='APPT_SCHD_ARVL_D', inplace=True)
    t1 = dt.datetime.now()
    backup['ACTL_CNTRS_BY_DAY'] = backup.apply(self.ImputeTargetAcrossSameDate,args=(tmp, ), axis=1)
    # backup['ACTL_CNTRS_BY_DAY'] = self.compute_(tmp, backup)
    t2 = dt.datetime.now()
    print("Time for the bottleneck is ", (t2-t1).microseconds)

    print("step f")

    return backup

And, the method ImputeTargetAcrossSameDate() method is as follows:

def ImputeTargetAcrossSameDate(self, x, tmp):
    ret = tmp[tmp['APPT_SCHD_ARVL_D'] == x['APPT_SCHD_ARVL_D']]
    ret = ret['ACTL_CNTRS_BY_DAY']

    if ret.empty:
        r = 0
    else:
        r = ret.values
        r = r[0]

    return r

Is there any way to optimize this apply() call to reduce the overall time. Note that, I'll have to run this process on DataFrame which stores data for 2 years. I was running it for 15 days, and it took me 15-20minutes, while when ran for 1 month of data, it was executing for more than 45 minutes, after which I had to force stop the process, thus while running on full dataset, it ll be huge problem.

Also note that, I came across few examples http://pandas.pydata.org/pandas-docs/stable/enhancingperf.html for introducing numba to optimize code, and the following is my numba implementation:

Statement to call numba method:

backup['ACTL_CNTRS_BY_DAY'] = self.compute_(tmp, backup)

Compute method of numba:

@numba.jit
def compute_(self, df1, df2):
    n = len(df2)
    result = np.empty(n, dtype='float64')
    for i in range(n):
        d = df2.iloc[i]
        result[i] = self.apply_ImputeTargetAcrossSameDate_method(df1['APPT_SCHD_ARVL_D'].values, df1['ACTL_CNTRS_BY_DAY'].values,
                                                                    d['APPT_SCHD_ARVL_D'], d['ACTL_CNTRS_BY_DAY'])
    return result

This is wrapper method which replaces Pandas' apply to call the Impute method on each row. The impute method using numba is as follows:

@numba.jit
def apply_ImputeTargetAcrossSameDate_method(self, df1col1, df1col2, df2col1, df2col2):

    dd = np.datetime64(df2col1)

    idx1 = np.where(df1col1 == dd)[0]

    if idx1.size == 0:
        idx1 = idx1
    else:
        idx1 = idx1[0]

    val = df1col2[idx1]

    if val.size == 0:
        r = 0
    else:
        r = val

    return r

I ran the normal apply() method as well as numba() method for data having time period of 5 days, and following were my results:

With Numba:
749805 microseconds

With DF.apply()
484603 microseconds.

As you can see numba is slower, which should not happen, so in case I'm missing something, lemme know so that I can optimize this piece of code.

Thanks in advance

Edit 1 As requested, the data snipped (head of top 20 rows) is added as following: Before:

    APPT_SCHD_ARVL_D  ACTL_CNTRS_BY_DAY
919       2020-11-17                NaN
917       2020-11-17                NaN
916       2020-11-17                NaN
915       2020-11-17                NaN
918       2020-11-17                NaN
905       2014-06-01                NaN
911       2014-06-01                NaN
913       2014-06-01                NaN
912       2014-06-01                NaN
910       2014-06-01                NaN
914       2014-06-01                NaN
908       2014-06-01                NaN
906       2014-06-01                NaN
909       2014-06-01                NaN
907       2014-06-01                NaN
898       2014-05-29                NaN
892       2014-05-29                NaN
893       2014-05-29                NaN
894       2014-05-29                NaN
895       2014-05-29                NaN

After:

APPT_SCHD_ARVL_D  ACTL_CNTRS_BY_DAY
919       2020-11-17                0.0
917       2020-11-17                0.0
916       2020-11-17                0.0
915       2020-11-17                0.0
918       2020-11-17                0.0
905       2014-06-01                0.0
911       2014-06-01                0.0
913       2014-06-01                0.0
912       2014-06-01                0.0
910       2014-06-01                0.0
914       2014-06-01                0.0
908       2014-06-01                0.0
906       2014-06-01                0.0
909       2014-06-01                0.0
907       2014-06-01                0.0
898       2014-05-29                0.0
892       2014-05-29                0.0
893       2014-05-29                0.0
894       2014-05-29                0.0
895       2014-05-29                0.0

What the method does? In the above data example, you can see some dates are repeated, and values against them is NaN. If all the rows having same date has value NaN, it replaces them with 0. But there are some cases, lets say for example: 2014-05-29 where there will be 10 rows having same date, and only 1 row against that date where there will be some value. (Lets say 10). Then the method() shall populate all values against that particular date with 10 instead of NaNs.

Example:

898       2014-05-29                NaN
892       2014-05-29                NaN
893       2014-05-29                NaN
894       2014-05-29                10
895       2014-05-29                NaN

The above shall become:

898       2014-05-29                10
892       2014-05-29                10
893       2014-05-29                10
894       2014-05-29                10
895       2014-05-29                10

Instead of ~df['ACTL_CNTRS_BY_DAY'].isnull() you can use df['ACTL_CNTRS_BY_DAY'].notnull(), just as a minor improvement. — Khris
– Khris, Commented Oct 14, 2016 at 10:44
That statement is not bottleneck, but rather the apply() method for each row is the bottleneck. — Debasish Kanhar
– Debasish Kanhar, Commented Oct 14, 2016 at 10:46
I know, that was just a minor thing I noticed. Can you give an example of what your data looks like before and how it should look like after? — Khris
– Khris, Commented Oct 14, 2016 at 10:49
Please have a look at sample data, and how data is supposed to behave after method call. — Debasish Kanhar
– Debasish Kanhar, Commented Oct 14, 2016 at 11:06

Khris · Accepted Answer · 2016-10-17 06:36:46Z

This is a bit rushed solution because I'm about to leave into the weekend now, but it works.

Input Dataframe:

index    APPT_SCHD_ARVL_D  ACTL_CNTRS_BY_DAY
919       2020-11-17                NaN
917       2020-11-17                NaN
916       2020-11-17                NaN
915       2020-11-17                NaN
918       2020-11-17                NaN
905       2014-06-01                NaN
911       2014-06-01                NaN
913       2014-06-01                NaN
912       2014-06-01                NaN
910       2014-06-01                NaN
914       2014-06-01                NaN
908       2014-06-01                NaN
906       2014-06-01                NaN
909       2014-06-01                NaN
907       2014-06-01                NaN
898       2014-05-29                NaN
892       2014-05-29                NaN
893       2014-05-29                NaN
894       2014-05-29                10
895       2014-05-29                NaN
898       2014-05-29                NaN

The code:

tt = df[pd.notnull(df.ACTL_CNTRS_BY_DAY)].APPT_SCHD_ARVL_D.unique()
vv = df[pd.notnull(df.ACTL_CNTRS_BY_DAY)]
for i,_ in df.iterrows():
    if df.ix[i,"APPT_SCHD_ARVL_D"] in tt:
        df.ix[i,"ACTL_CNTRS_BY_DAY"] = vv[vv.APPT_SCHD_ARVL_D == df.ix[i,"APPT_SCHD_ARVL_D"]]["ACTL_CNTRS_BY_DAY"].values[0]
df = df.fillna(0.0)

Basically there is no need to apply a function. What I'm doing here is:

Get all unique dates with a value that is not null. -> tt
Create a dataframe of only the non-null values. -> vv
Iterate over all rows and test if the date in each row is present in tt.
If true take the value from vv where the date in df is the same and assign it to df.
Then fill all other null values with 0.0.

Iterating over rows isn't a fast thing, but I hope it's faster than your old code. If I had more time I would think of a solution without iteration, maybe on Monday.

EDIT: Solution without iteration using pd.merge() instead:

dg = df[pd.notnull(df.ACTL_CNTRS_BY_DAY)].groupby("APPT_SCHD_ARVL_D").first()["ACTL_CNTRS_BY_DAY"].to_frame().reset_index()
df = pd.merge(df,dg,on="APPT_SCHD_ARVL_D",how='outer').rename(columns={"ACTL_CNTRS_BY_DAY_y":"ACTL_CNTRS_BY_DAY"}).drop("ACTL_CNTRS_BY_DAY_x",axis=1).fillna(0.0)

Your data implies that there's at most only one value in ACTL_CNTRS_BY_DAY that is not null, so I'm using first() in the groupby to pick the only value that exists.

Collectives™ on Stack Overflow

Optimize Python code. Optimize Pandas apply. Numba slow than pure python

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related