I have pandas.DataFrame object called df, and I want to interpolate its missing values via parallelization. This is what I do:
def func(df):
return df.interpolate(method='linear', axis=1)
ddf = dd.from_pandas(df, npartitions=8)
res = ddf.map_partitions(func)
res2 = res.compute()
The outcome is,
print(res2)
0 None
1 None
2 None
3 None
4 None
5 None
6 None
7 None
dtype: object
and
type(res)
dask.dataframe.core.Series
Edit 1 After following @mdurant suggestion, I've changed the function to this
def func(df):
return df.interpolate(method='linear', axis=1, inplace=True)
and now the result is the expected one.
However, I'm still having some newbie questions regarding this code. The benchmarks below show that the non parallel version is faster than the parallel one.
Non parallel:
%time df.interpolate(method='linear', axis=1, inplace=True)
Interpolating missing values.
CPU times: user 19.5 s, sys: 162 ms, total: 19.7 s
Wall time: 19.8 s
Parallel:
res = ddf.map_partitions(func)
%time res2 = res.compute()
Interpolating missing values.Interpolating missing values.
Interpolating missing values.Interpolating missing values.
Interpolating missing values.Interpolating missing values.
Interpolating missing values.Interpolating missing values.
CPU times: user 29.1 s, sys: 2.3 s, total: 31.4 s
Wall time: 26.5 s
res.visualize()
This interpolation is a row-wise operation (interpolation is in row=1), so any chunk(thread) show run without penalization (chunking happens between indices).

funcin Edit 1 is the same as before.