GroupBy /Map_partitions in Dask

Question

I have a dask dataframe with 2438 partitions ,each partition is 1.1GB a total of 7B rows I want to do a groupby on multiple columns and aggregate one of the columns

agg = {'total_x':'sum'}
df_s = df_s.map_partitions(lambda dff: dff.groupby(['y', 'z', 'a', 'b','c']).agg(agg) , meta=pd.DataFrame({'y':'int','z':'int', 'a':'int', 'b':'int','c':'object' ,'total_x':'f64'}))

I get the error If using all scalar values, you must pass an index

How do I resolve that ? I am have RAM of 160 GB RAM and 24 workers ,is that computation even possible with that environment ?

If not ,which is anther feasible way ?

the error is because your "meta" object is not specified correctly. if you just run your meta code: pd.DataFrame({'y':'int','z':'int', 'a':'int', 'b':'int','c':'object' ,'total_x':'f64'}) this returns the error you're seeing. Make sure to always include the full traceback when asking about errors - there's ton of useful info in there, including which modules are the ones raising the errors! you can fix this issue by just passing a dictionary instead of a dataframe. — Michael Delgado
– Michael Delgado, Commented Mar 31, 2022 at 8:16

SultanOrazbayev · Accepted Answer · 2022-03-31 15:21:51Z

1

As suggested by @Michael Delgado, there is a problem with meta definition. This should fix the definition of meta:

import pandas as pd

dtypes = {
    "y": "int",
    "z": "int",
    "a": "int",
    "b": "int",
    "c": "object",
    "total_x": "f64",
}
meta = pd.DataFrame(columns=dtypes.keys())

Then, this meta can be passed as a kwarg. See the reproducible example below:

import dask
import pandas as pd

dtypes = {"name": "str", "x": "f64"}
meta = pd.DataFrame(columns=dtypes.keys())


agg = {"x": "sum"}
ddf = dask.datasets.timeseries().map_partitions(
    lambda df: df.groupby(["name"], as_index=False).agg(agg), meta=meta
)

ddf.head()

answered Mar 31, 2022 at 15:21

SultanOrazbayev

16.7k3 gold badges24 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

waithira Over a year ago

Thank you for the response .A follow up ,if I run ddf.dtypes from the above example ,the name and x columns both are objects ,yet the dtypes are specified as "str" and "f64"

SultanOrazbayev Over a year ago

That's a good question, dask always represents 'str' as an 'object', not sure why...

waithira Over a year ago

Note also that in this case ,X which was defined as "f64" also was an object

Collectives™ on Stack Overflow

GroupBy /Map_partitions in Dask

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related