Pandas DataFrame and numpy standard deviation are different

Question

simply asking, why this std are different?

>>> import numpy
>>> import pandas as pd
>>>
>>> arr = [10, 386, 479, 627, 20, 523, 482, 483, 542, 699, 535, 617, 577, 471, 615, 583, 441, 562, 5
63, 527, 453, 530, 433, 541, 585, 704, 443, 569, 430, 637, 331, 511, 552, 496, 484, 566, 554, 472, 3
35, 440, 579, 341, 545, 615, 548, 604, 439, 556, 442, 461, 624, 611, 444, 578, 405, 487, 490, 496, 3
98, 512, 422, 455, 449, 432, 607, 679, 434, 597, 639, 565, 415, 486, 668, 414, 665, 763, 557, 304, 4
04, 454, 689, 610, 483, 441, 657, 590, 492, 476, 437, 483, 529, 363, 711, 543]
>>> elements = numpy.asarray(arr)
>>> arr_D = {"A":arr}
>>> df = pd.DataFrame(arr_D)
>>>
>>> print(numpy.std(elements, axis=0))
118.51857760182034
>>> print(numpy.std(df['A']))
118.5185776018204
>>> print(df['A'].std(axis=0))
119.15407050904474

Is it problem with my comprehension of topic? As far as i know there pandas use numpy. datafram std and numpy std of same column should be same.

Is it a bug?

Sarthak Kumar · Accepted Answer · 2020-06-24 11:23:03Z

2

pandas uses the Unbiased estimation by default and numpy does not by default, So neither of them are incorrect they use different approach to calculate std
To make numpy use Unbiased estimation pass ddof=1 to std

>>> import numpy
>>> import pandas

>>> df = pandas.DataFrame(numpy.random.rand(100))

>>> numpy.std(df[0]) #default std biased estimation
0.2877601644414916

>>> numpy.std(df[0],ddof=1) #with ddof=1 i.e unbiased estimation
0.2892098469889083

>>> df[0].std() # unbiased estimation match with numpy std with ddof=1
0.2892098469889083

edited Jun 24, 2020 at 11:23

answered Jun 24, 2020 at 11:09

Sarthak Kumar

3045 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ehsan · Accepted Answer · 2020-06-24 11:09:52Z

2

Numpy uses biased std and pandas unbiased. In other words, numpy divides by n (number of elements) and pandas divides by n-1. Try following to see that if matches:

print(df['A'].std(axis=0)/np.sqrt(len(arr))*np.sqrt((len(arr)-1)))
#118.51857760182033

answered Jun 24, 2020 at 11:09

Ehsan

12.5k2 gold badges24 silver badges36 bronze badges

Collectives™ on Stack Overflow

Pandas DataFrame and numpy standard deviation are different

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related