今天在做实验时,发现pandas中std()函数计算出来的标准差与手工计算出来的值存在差异,怀疑之下,经查阅文档,发现pandas的std()与numpy的std()存在差异,实验流程如下;
import pandas as pd
import numpy as np
# df = pd.DataFrame([['a','b','c','d','e','f'],[1,2,3,4,5,6]],index=['one','two'],columns=['A','B','C','D','E','F'])
df = pd.DataFrame([[1,2,3,4,5,6],[6,4,1,2,3,5],[2,43,2,2,7,9]],index=['one','two','three'],columns=['A','B','C','D','E','F'])
print(df)
# print(df.count(axis=1))
# print(df.sum(1))
# print(df.mean())
# print(df.mode(axis=0))
# print(df.mode(axis=1))
print(df.std())
# print(df.describe())
# print(np.sqrt())
l1 = np.array([1,2,3])
print(l1.std())
从结果中我们可以看到,pandas计算出来的std与numpy计算出来的不一致。
根据手工计算,1,2,3的均值为2,方差为((1-2)^2+(2-2)^2+(3-2)^2)/3=2/3=0.666667,std肯定不会为1,所以numpy计算的应该是与我们知道情况一致;
但pandas的又是怎么回事呢,经计算,出现上述情况是pandas在计算过程中使用了index的标签所导致的,按照pandas的index标签,第一行的标签为0,第3行的标签为2,方差为((1-2)^2+(2-2)^2+(3-2)^2)/2 = 1,所以std为1。
要想正常计算pandas的std,需要建ddof设置为0即可;pandas的ddof默认为1;
import pandas as pd
import numpy as np
# df = pd.DataFrame([['a','b','c','d','e','f'],[1,2,3,4,5,6]],index=['one','two'],columns=['A','B','C','D','E','F'])
df = pd.DataFrame([[1,2,3,4,5,6],[6,4,1,2,3,5],[2,43,2,2,7,9]],index=['one','two','three'],columns=['A','B','C','D','E','F'])
print(df)
# print(df.count(axis=1))
# print(df.sum(1))
# print(df.mean())
# print(df.mode(axis=0))
# print(df.mode(axis=1))
print(df.std(ddof=0))
# print(df.describe())
# print(np.sqrt())
l1 = np.array([1,2,3])
print(l1.std())
需要注意的是在pandas中的describe()函数的统计中也有std信息,这里计算出来的std与未调整之前的std保持一致,且没有ddof设置,无法调整该值。
import pandas as pd
import numpy as np
# df = pd.DataFrame([['a','b','c','d','e','f'],[1,2,3,4,5,6]],index=['one','two'],columns=['A','B','C','D','E','F'])
df = pd.DataFrame([[1,2,3,4,5,6],[6,4,1,2,3,5],[2,43,2,2,7,9]],index=['one','two','three'],columns=['A','B','C','D','E','F'])
print(df)
# print(df.count(axis=1))
# print(df.sum(1))
# print(df.mean())
# print(df.mode(axis=0))
# print(df.mode(axis=1))
print(df.std(ddof=0))
print(df.describe())
# print(np.sqrt())
l1 = np.array([1,2,3])
print(l1.std())
A B C D E F
one 1 2 3 4 5 6
two 6 4 1 2 3 5
three 2 43 2 2 7 9
A 2.160247
B 18.873850
C 0.816497
D 0.942809
E 1.632993
F 1.699673
dtype: float64
A B C D E F
count 3.000000 3.000000 3.0 3.000000 3.0 3.000000
mean 3.000000 16.333333 2.0 2.666667 5.0 6.666667
std 2.645751 23.115651 1.0 1.154701 2.0 2.081666
min 1.000000 2.000000 1.0 2.000000 3.0 5.000000
25% 1.500000 3.000000 1.5 2.000000 4.0 5.500000
50% 2.000000 4.000000 2.0 2.000000 5.0 6.000000
75% 4.000000 23.500000 2.5 3.000000 6.0 7.500000
max 6.000000 43.000000 3.0 4.000000 7.0 9.000000
0.816496580927726
Process finished with exit code 0