Skip to content

Conversation

@behzadnouri
Copy link
Contributor

closes #10804

In [1]: np.random.seed(2718281)

In [2]: n = 500000

In [3]: u = int(0.1*n)

In [4]: arr = ["s%04d" % i for i in np.random.randint(0, u, size=n)]

In [5]: ts = pd.Series(arr).astype('category')

In [6]: %timeit ts.value_counts()
10 loops, best of 3: 82.7 ms per loop

on branch:

In [6]: %timeit ts.value_counts()
10 loops, best of 3: 31.3 ms per loop
@jreback
Copy link
Contributor

jreback commented Aug 21, 2015

I think the soln in the issues is faster than this no?

@behzadnouri
Copy link
Contributor Author

I get

In [9]: %timeit Series(np.arange(len(ts.cat.categories)),ts.cat.categories).map(ts.cat.codes.value_counts()).order(ascending=False)
10 loops, best of 3: 28.3 ms per loop

but this does not check for nulls, and the index is not categorical.

@behzadnouri
Copy link
Contributor Author

with dropna=False the branch performs better:

In [9]: %timeit ts.value_counts(dropna=True)
10 loops, best of 3: 32.3 ms per loop

In [10]: %timeit ts.value_counts(dropna=False)
10 loops, best of 3: 25.7 ms per loop

In [11]: %timeit Series(np.arange(len(ts.cat.categories)),ts.cat.categories).map(ts.cat.codes.value_counts()).order(ascending=False) 
10 loops, best of 3: 29.5 ms per loop
@jreback
Copy link
Contributor

jreback commented Aug 21, 2015

wow, this does even better!

In [7]: %timeit ts.value_counts(dropna=True)
100 loops, best of 3: 11 ms per loop

In [8]: %timeit ts.value_counts(dropna=False)
100 loops, best of 3: 9.53 ms per loop

In [9]: %timeit Series(np.arange(len(ts.cat.categories)),ts.cat.categories).map(ts.cat.codes.value_counts()).order(ascending=False) 
100 loops, best of 3: 17.4 ms per loop
@jreback jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels Aug 21, 2015
@jreback jreback added this to the 0.17.0 milestone Aug 21, 2015
@jreback
Copy link
Contributor

jreback commented Aug 21, 2015

ping when green

@jorisvandenbossche
Copy link
Member

Maybe worth adding a benchmark?

@behzadnouri
Copy link
Contributor Author

I will add benchmark later today

@behzadnouri
Copy link
Contributor Author

added the benchmark, all green.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should have only 1 action per timing function (so make 2 functions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should it be only 1 action?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You get a timing per function. So if you want to track performance of both with dropna True and False, it has to be in two functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added separate calls

jreback added a commit that referenced this pull request Aug 22, 2015
PERF: uses bincount instead of hash table in categorical value counts
@jreback jreback merged commit 1cf18cd into pandas-dev:master Aug 22, 2015
@jreback
Copy link
Contributor

jreback commented Aug 22, 2015

thank you sir!

@behzadnouri behzadnouri deleted the cat-val-cnt branch August 22, 2015 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Categorical Categorical Data Type Performance Memory or execution speed performance

3 participants