I'm facing a huge bottleneck where I apply a method() to each row in Pandas DataFrame. The execution time is in sorts of 15-20 minutes.
Now, the code I use is as follows:
def FillTarget(self, df):
backup = df.copy()
target = list(set(df['ACTL_CNTRS_BY_DAY']))
df = df[~df['ACTL_CNTRS_BY_DAY'].isnull()]
tmp = df[df['ACTL_CNTRS_BY_DAY'].isin(target)]
tmp = tmp[['APPT_SCHD_ARVL_D', 'ACTL_CNTRS_BY_DAY']]
tmp.drop_duplicates(subset='APPT_SCHD_ARVL_D', inplace=True)
t1 = dt.datetime.now()
backup['ACTL_CNTRS_BY_DAY'] = backup.apply(self.ImputeTargetAcrossSameDate,args=(tmp, ), axis=1)
# backup['ACTL_CNTRS_BY_DAY'] = self.compute_(tmp, backup)
t2 = dt.datetime.now()
print("Time for the bottleneck is ", (t2-t1).microseconds)
print("step f")
return backup
And, the method ImputeTargetAcrossSameDate() method is as follows:
def ImputeTargetAcrossSameDate(self, x, tmp):
ret = tmp[tmp['APPT_SCHD_ARVL_D'] == x['APPT_SCHD_ARVL_D']]
ret = ret['ACTL_CNTRS_BY_DAY']
if ret.empty:
r = 0
else:
r = ret.values
r = r[0]
return r
Is there any way to optimize this apply() call to reduce the overall time. Note that, I'll have to run this process on DataFrame which stores data for 2 years. I was running it for 15 days, and it took me 15-20minutes, while when ran for 1 month of data, it was executing for more than 45 minutes, after which I had to force stop the process, thus while running on full dataset, it ll be huge problem.
Also note that, I came across few examples http://pandas.pydata.org/pandas-docs/stable/enhancingperf.html for introducing numba to optimize code, and the following is my numba implementation:
Statement to call numba method:
backup['ACTL_CNTRS_BY_DAY'] = self.compute_(tmp, backup)
Compute method of numba:
@numba.jit
def compute_(self, df1, df2):
n = len(df2)
result = np.empty(n, dtype='float64')
for i in range(n):
d = df2.iloc[i]
result[i] = self.apply_ImputeTargetAcrossSameDate_method(df1['APPT_SCHD_ARVL_D'].values, df1['ACTL_CNTRS_BY_DAY'].values,
d['APPT_SCHD_ARVL_D'], d['ACTL_CNTRS_BY_DAY'])
return result
This is wrapper method which replaces Pandas' apply to call the Impute method on each row. The impute method using numba is as follows:
@numba.jit
def apply_ImputeTargetAcrossSameDate_method(self, df1col1, df1col2, df2col1, df2col2):
dd = np.datetime64(df2col1)
idx1 = np.where(df1col1 == dd)[0]
if idx1.size == 0:
idx1 = idx1
else:
idx1 = idx1[0]
val = df1col2[idx1]
if val.size == 0:
r = 0
else:
r = val
return r
I ran the normal apply() method as well as numba() method for data having time period of 5 days, and following were my results:
With Numba:
749805 microseconds
With DF.apply()
484603 microseconds.
As you can see numba is slower, which should not happen, so in case I'm missing something, lemme know so that I can optimize this piece of code.
Thanks in advance
Edit 1 As requested, the data snipped (head of top 20 rows) is added as following: Before:
APPT_SCHD_ARVL_D ACTL_CNTRS_BY_DAY
919 2020-11-17 NaN
917 2020-11-17 NaN
916 2020-11-17 NaN
915 2020-11-17 NaN
918 2020-11-17 NaN
905 2014-06-01 NaN
911 2014-06-01 NaN
913 2014-06-01 NaN
912 2014-06-01 NaN
910 2014-06-01 NaN
914 2014-06-01 NaN
908 2014-06-01 NaN
906 2014-06-01 NaN
909 2014-06-01 NaN
907 2014-06-01 NaN
898 2014-05-29 NaN
892 2014-05-29 NaN
893 2014-05-29 NaN
894 2014-05-29 NaN
895 2014-05-29 NaN
After:
APPT_SCHD_ARVL_D ACTL_CNTRS_BY_DAY
919 2020-11-17 0.0
917 2020-11-17 0.0
916 2020-11-17 0.0
915 2020-11-17 0.0
918 2020-11-17 0.0
905 2014-06-01 0.0
911 2014-06-01 0.0
913 2014-06-01 0.0
912 2014-06-01 0.0
910 2014-06-01 0.0
914 2014-06-01 0.0
908 2014-06-01 0.0
906 2014-06-01 0.0
909 2014-06-01 0.0
907 2014-06-01 0.0
898 2014-05-29 0.0
892 2014-05-29 0.0
893 2014-05-29 0.0
894 2014-05-29 0.0
895 2014-05-29 0.0
What the method does? In the above data example, you can see some dates are repeated, and values against them is NaN. If all the rows having same date has value NaN, it replaces them with 0. But there are some cases, lets say for example: 2014-05-29 where there will be 10 rows having same date, and only 1 row against that date where there will be some value. (Lets say 10). Then the method() shall populate all values against that particular date with 10 instead of NaNs.
Example:
898 2014-05-29 NaN
892 2014-05-29 NaN
893 2014-05-29 NaN
894 2014-05-29 10
895 2014-05-29 NaN
The above shall become:
898 2014-05-29 10
892 2014-05-29 10
893 2014-05-29 10
894 2014-05-29 10
895 2014-05-29 10
~df['ACTL_CNTRS_BY_DAY'].isnull()you can usedf['ACTL_CNTRS_BY_DAY'].notnull(), just as a minor improvement.