dask map_partitions returns pandas data frame, not dask

Question

Everything I can find indicates that dask map_partitions should return a dask dataframe object. But the following code snippet and the corresponding output (using logzero) does not. (note -- calc_delta returns a np.array of floats).

352         logger.debug(type(self.dd))
353         self.dd = self.dd.map_partitions(
354             lambda df: df.assign(
355                 duration1=lambda r: calc_delta(r['a'], r['b'])
356                 , duration2=lambda r: calc_delta(r['a'], r['c'])
357             )
358         ).compute(scheduler='processes')
359         logger.debug(type(self.dd))

[D 200316 19:19:28 exploratory:352] <class'dask.dataframe.core.DataFrame'>

[D 200316 19:19:43 exploratory:359] <class 'pandas.core.frame.DataFrame'>

All the guidance (with lots of hacking) suggests that this is the way to add (logical) columns to the partitioned dask dataframe. But not if it doesn't actually return a dask dataframe.

What am I missing?

ava_punksmash · Accepted Answer · 2020-03-17 14:57:37Z

1

Is it not because you are calling "compute"?

Maybe this:

self.dd.map_partitions(
             lambda df: df.assign(
                 duration1=lambda r: calc_delta(r['a'], r['b'])
                 , duration2=lambda r: calc_delta(r['a'], r['c'])
             )
         )

actually returns a dask dataframe. But then you call compute which is supposed to return you a result, hence the pandas dataframe, no?

answered Mar 17, 2020 at 14:57

ava_punksmash

4572 gold badges7 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

GoneAsync Over a year ago

Thank you @ava_punksmash -- I was evidently mistaken about what compute is meant for. I thought it was only how lazy execution was triggered, which I thought I'd also confirmed in my early trial-and-error iterations. I've removed the 'compute', and things work as expected now. Thank you!

Collectives™ on Stack Overflow

dask map_partitions returns pandas data frame, not dask

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related