-
-
Couldn't load subscription status.
- Fork 19.2k
Closed
Labels
Milestone
Description
It looks like len(set) beats both len(np.unique) and pd.Series.nunique if done naively -- here's an example with a large number of groups where we try to compute unique counts of a column when grouping by another column:
>>> df = pd.DataFrame({'a': np.random.randint(10000, size=100000),
'b': np.random.randint(10, size=100000)})
>>> g = df.groupby('a')
>>> %timeit g.b.nunique()
1 loops, best of 3: 1 s per loop
>>> %timeit g.b.apply(pd.Series.nunique)
1 loops, best of 3: 992 ms per loop
>>> %timeit g.b.apply(lambda x: np.unique(x.values).size)
1 loops, best of 3: 652 ms per loop
>>> %timeit g.b.apply(lambda x: len(set(x.values)))
1 loops, best of 3: 469 ms per loopThe fastest way I know to accomplish the same thing is this:
>>> g = df.groupby(['a', 'b'])
>>> %timeit g.b.first().groupby(level=0).size()
100 loops, best of 3: 6.2 ms per loop... which is a LOT faster apparently.
Wonder if something similar could be done in GroupBy.nunique since it's quite a common use case?