0

Everything I can find indicates that dask map_partitions should return a dask dataframe object. But the following code snippet and the corresponding output (using logzero) does not. (note -- calc_delta returns a np.array of floats).

352         logger.debug(type(self.dd))
353         self.dd = self.dd.map_partitions(
354             lambda df: df.assign(
355                 duration1=lambda r: calc_delta(r['a'], r['b'])
356                 , duration2=lambda r: calc_delta(r['a'], r['c'])
357             )
358         ).compute(scheduler='processes')
359         logger.debug(type(self.dd))

[D 200316 19:19:28 exploratory:352] <class'dask.dataframe.core.DataFrame'>

[D 200316 19:19:43 exploratory:359] <class 'pandas.core.frame.DataFrame'>

All the guidance (with lots of hacking) suggests that this is the way to add (logical) columns to the partitioned dask dataframe. But not if it doesn't actually return a dask dataframe.

What am I missing?

1 Answer 1

1

Is it not because you are calling "compute"?

Maybe this:

self.dd.map_partitions(
             lambda df: df.assign(
                 duration1=lambda r: calc_delta(r['a'], r['b'])
                 , duration2=lambda r: calc_delta(r['a'], r['c'])
             )
         )

actually returns a dask dataframe. But then you call compute which is supposed to return you a result, hence the pandas dataframe, no?

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you @ava_punksmash -- I was evidently mistaken about what compute is meant for. I thought it was only how lazy execution was triggered, which I thought I'd also confirmed in my early trial-and-error iterations. I've removed the 'compute', and things work as expected now. Thank you!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.