2

I have a dask dataframe with 2438 partitions ,each partition is 1.1GB a total of 7B rows I want to do a groupby on multiple columns and aggregate one of the columns

agg = {'total_x':'sum'}
df_s = df_s.map_partitions(lambda dff: dff.groupby(['y', 'z', 'a', 'b','c']).agg(agg) , meta=pd.DataFrame({'y':'int','z':'int', 'a':'int', 'b':'int','c':'object' ,'total_x':'f64'}))

I get the error If using all scalar values, you must pass an index

How do I resolve that ? I am have RAM of 160 GB RAM and 24 workers ,is that computation even possible with that environment ?

If not ,which is anther feasible way ?

1
  • 2
    the error is because your "meta" object is not specified correctly. if you just run your meta code: pd.DataFrame({'y':'int','z':'int', 'a':'int', 'b':'int','c':'object' ,'total_x':'f64'}) this returns the error you're seeing. Make sure to always include the full traceback when asking about errors - there's ton of useful info in there, including which modules are the ones raising the errors! you can fix this issue by just passing a dictionary instead of a dataframe. Commented Mar 31, 2022 at 8:16

1 Answer 1

1

As suggested by @Michael Delgado, there is a problem with meta definition. This should fix the definition of meta:

import pandas as pd

dtypes = {
    "y": "int",
    "z": "int",
    "a": "int",
    "b": "int",
    "c": "object",
    "total_x": "f64",
}
meta = pd.DataFrame(columns=dtypes.keys())

Then, this meta can be passed as a kwarg. See the reproducible example below:

import dask
import pandas as pd

dtypes = {"name": "str", "x": "f64"}
meta = pd.DataFrame(columns=dtypes.keys())


agg = {"x": "sum"}
ddf = dask.datasets.timeseries().map_partitions(
    lambda df: df.groupby(["name"], as_index=False).agg(agg), meta=meta
)

ddf.head()
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for the response .A follow up ,if I run ddf.dtypes from the above example ,the name and x columns both are objects ,yet the dtypes are specified as "str" and "f64"
That's a good question, dask always represents 'str' as an 'object', not sure why...
Note also that in this case ,X which was defined as "f64" also was an object

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.