nunique performance for groupby with large number of groups

It looks like len(set) beats both len(np.unique) and pd.Series.nunique if done naively -- here's an example with a large number of groups where we try to compute unique counts of a column when grouping by another column:

>>> df = pd.DataFrame({'a': np.random.randint(10000, size=100000),
                       'b': np.random.randint(10, size=100000)})
>>> g = df.groupby('a')

>>> %timeit g.b.nunique()
1 loops, best of 3: 1 s per loop

>>> %timeit g.b.apply(pd.Series.nunique)
1 loops, best of 3: 992 ms per loop

>>> %timeit g.b.apply(lambda x: np.unique(x.values).size)
1 loops, best of 3: 652 ms per loop

>>> %timeit g.b.apply(lambda x: len(set(x.values)))
1 loops, best of 3: 469 ms per loop

The fastest way I know to accomplish the same thing is this:

>>> g = df.groupby(['a', 'b'])

>>> %timeit g.b.first().groupby(level=0).size()
100 loops, best of 3: 6.2 ms per loop

... which is a LOT faster apparently.

Wonder if something similar could be done in GroupBy.nunique since it's quite a common use case?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

nunique performance for groupby with large number of groups #10820

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

nunique performance for groupby with large number of groups #10820

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions