Question regarding map_partitions in a dask.dataframe object

Question

I have pandas.DataFrame object called df, and I want to interpolate its missing values via parallelization. This is what I do:

def func(df):
    return df.interpolate(method='linear', axis=1)


ddf = dd.from_pandas(df, npartitions=8)
res = ddf.map_partitions(func)
res2 = res.compute()

The outcome is,

print(res2)
0    None
1    None
2    None
3    None
4    None
5    None
6    None
7    None
dtype: object

and

type(res)
dask.dataframe.core.Series

Edit 1 After following @mdurant suggestion, I've changed the function to this

def func(df):
    return df.interpolate(method='linear', axis=1, inplace=True)

and now the result is the expected one.

However, I'm still having some newbie questions regarding this code. The benchmarks below show that the non parallel version is faster than the parallel one.

Non parallel:

%time df.interpolate(method='linear', axis=1, inplace=True)
Interpolating missing values.
CPU times: user 19.5 s, sys: 162 ms, total: 19.7 s
Wall time: 19.8 s

Parallel:

res = ddf.map_partitions(func)
%time res2 = res.compute()
Interpolating missing values.Interpolating missing values.
Interpolating missing values.Interpolating missing values.
Interpolating missing values.Interpolating missing values.
Interpolating missing values.Interpolating missing values.
CPU times: user 29.1 s, sys: 2.3 s, total: 31.4 s
Wall time: 26.5 s

res.visualize()

This interpolation is a row-wise operation (interpolation is in row=1), so any chunk(thread) show run without penalization (chunking happens between indices).

The func in Edit 1 is the same as before.

rpanai
– rpanai

2018-10-30 11:12:31 +00:00
Commented Oct 30, 2018 at 11:12 — rpanai
– rpanai, Commented Oct 30, 2018 at 11:12

mdurant · Accepted Answer · 2018-10-29 20:32:50Z

3

The problem here is inplace=True - with this, the call to interpolate doesn't return anything, so the output of func() is None, and you get the results you see. Generally, Dask functions should return processed data rather than trying to change data in-place. Simply remove the keyword, and things will probably work.

answered Oct 29, 2018 at 20:32

mdurant

28.8k5 gold badges49 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Question regarding map_partitions in a dask.dataframe object

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related