统计学第七周 python验证正态分布/卡方分布/T

最新推荐文章于 2024-08-08 23:01:10 发布

rungedu

最新推荐文章于 2024-08-08 23:01:10 发布

阅读量2.2k

点赞数

分类专栏：统计学 python python数据挖掘文章标签： python 统计学

本文链接：https://blog.csdn.net/long636/article/details/103530853

版权

python 同时被 3 个专栏收录

23 篇文章 0 订阅

订阅专栏

统计学

19 篇文章 4 订阅

订阅专栏

python数据挖掘

6 篇文章 0 订阅

订阅专栏

统计学第七周

一.知识回顾

上周已经学习过正态分布/卡方分布/T分布等知识，但是如何选择那？

🛰正态分布

🛰卡方分布

🛰T分布

二.实践

1.场景：泰坦尼克号数据，主要是age年龄，Fare价格即船票价格，Embark登船的港口，需要验证数据是否服从正态分布，T分布，卡方分布？

具体数据如下：

ID	Age	Fare	Embarked
1	22	7.25	S
2	38	71.2833	C
3	26	7.925	S
4	35	53.1	S
5	35	8.05	S
6	54	51.8625	S
7	2	21.075	S

…

#coding=utf-8

import  pandas  as  pd
import numpy  as  np
import  matplotlib.pyplot  as plt
import  seaborn as  sbn

plt.rcParams['font.sans-serif'] = ['SimHei']

df  =  pd.read_excel('d:\\excel\\tj-week7_data.xlsx')
print(df.head())

embark = df.groupby(['Embarked'])
#print(embark)

embark_basic = df.groupby(['Embarked']).agg(['count','min','max','median','mean','var','std'])
print(embark_basic)
age_basic = embark_basic['Age']
fare_basic = embark_basic['Fare']
#print(age_basic)
#print(fare_basic)

sbn.set_palette('hls')
sbn.distplot(df['Age'],color='r',bins=10,kde=True)
plt.title('Age')
plt.xlim(-10,80)
plt.grid(True)
plt.show()

在这里插入图片描述

根据年龄的图像，可以认为和正态分布比较接近。

验证是否符合正态分布

验证分布系数：

#####
ks_test = stats.kstest(df['Age'],'norm')
shapiron_test = stats.shapiro(df['Age'])
normaltest_test = stats.normaltest(df['Age'],axis=0)
print('ks_test: ',ks_test)
print('shapiro_test: ',shapiron_test)
print('normaltest_test: ',normaltest_test)

ks_test:  KstestResult(statistic=0.9649422367998306, pvalue=0.0)
shapiro_test:  (0.9815102219581604, 7.906476895414016e-08)
normaltest_test:  NormaltestResult(statistic=18.12938011101228, pvalue=0.00011567916063448067)

根据检验pvalue<0.05 ,因此可以将我们凭主观图表分析的与正态分布较接近的假设进行推翻，认为不服从正态分布。

附：kstest shapiro normaltest

#####验证是否服从正态分布 ，分别用scipy中kstest   shapiro  normaltest

ks_test = stats.kstest(df['Age'],'norm')

#kstest(rvs, cdf, args=(), N=20, alternative=’two_sided’, mode=’approx’, **kwds)
#kstest 是一个很强大的检验模块，除了正态性检验，还能检验其他数据分布类型
#rvs待检验数据
#cdf 设置检验方法，设置 'norm' 正态性检验
#输出结果中第一个数为统计数，第二个为P值

shapiron_test = stats.shapiro(df['Age'])

#scipy.stats.shapiro是专门用来做正态性检验的模块
#shapiro不适合做样本数>5000的正态性检验，检验结果的P值可能不准确
#scipy.stats.shapiro(x,a=None,reta=False) ,是需要传进去数据x即可
#输出结果，第一个为统计数，第二个为P值

normaltest_test = stats.normaltest(df['Age'],axis=0)

#scipy.stats.normaltest(a,axis=0,nan_policy='propagate')
#a待检验数据，axis=0 ,表示在0轴上检验，即对数据的每一行做正态性检验，
#axis=None来对整个数据做检验
#nan_policy 当输入数据中有空值时的处理办法
#输出结果，第一个为统计数，第二个为P值

print('ks_test: ',ks_test)
print('shapiro_test: ',shapiron_test)
print('normaltest_test: ',normaltest_test)

#scipy.stats.anderson   时修改版的kstest
# scipy.stats.anderson(x, dist=’norm’)
#anderson 有三个输出值，第一个为统计数，第二个为评判值，第三个为显著性水平，
#  评判值与显著性水平对应
#对于正态性检验，显著性水平为：15%, 10%, 5%, 2.5%, 1%

#绘制拟合正态分布曲线
age = df['Age']
plt.figure()
age.plot(kind='kde') 

#原始数据的正态分布
M_S = stats.norm.fit(age)  

#正态分布拟合的平均值loc,标准差scale
normalDistribution = stats.norm(M_S[0],M_S[1]) 
 
#绘制拟合的正态分布图
x= np.linspace(normalDistribution.ppf(0.01),normalDistribution.ppf(0.99),100)
plt.plot(x,normalDistribution.pdf(x),c='orange')
plt.xlabel('Age about Titanic')
plt.title('Age on NormalDistribution', size = 20)
plt.legend(['age','NormDistribution'])
plt.show()

在这里插入图片描述

验证是否符合T分布

#coding=utf-8

import  pandas  as  pd
import numpy  as  np
import  matplotlib.pyplot  as plt
import  seaborn as  sbn
from scipy  import  stats

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus']=False #解决负数坐标显示

df  =  pd.read_excel('d:\\excel\\tj-week7_data.xlsx')
print(df.head())

embark = df.groupby(['Embarked'])
#print(embark)

embark_basic = df.groupby(['Embarked']).agg(['count','min','max','median','mean','var','std'])
print(embark_basic)
age_basic = embark_basic['Age']
fare_basic = embark_basic['Fare']

age = df['Age']

np.random.seed(1)
ks = stats.t.fit(age)

df1 = ks[0]
loc = ks[1]
scale = ks[2]
ks2 = stats.t.rvs(df = df1 ,loc = loc ,scale = scale , size = len(age) )
result = stats.ks_2samp(age,ks2)

print(result)

Ks_2sampResult(statistic=0.08286516853932585, pvalue=0.014103597072570409)

P< 0.05 ,因此这里也可以拒绝T分布的假设，拟合T分布：

#绘制拟合的T分布
plt.figure()
age.plot(kind='kde')
TDistribution = stats.t(ks[0],ks[1],ks[2])
x = np.linspace(TDistribution.ppf(0.01),TDistribution.ppf(0.99),100)
plt.plot(x,TDistribution.pdf(x),c='orange')
plt.xlabel('age about Titanic')
plt.title('age on TDistribution',size = 20)
plt.legend(['age','TDistribution'])
plt.show()

在这里插入图片描述
3. 验证数据是否符合卡方分布

#是否符合卡方分布
chi_S = stats.chi2.fit(age)
df_chi = chi_S[0]
loc_chi = chi_S[1]
scale_chi = chi_S[2]
chi2 = stats.chi2.rvs(df=df_chi,loc=loc_chi,scale=scale_chi,size =len(age))
result_x = stats.ks_2samp(age,chi2)
print(result_x)

#结果如下，P>0.05，可以认为时符合卡方分布

Ks_2sampResult(statistic=0.058988764044943826, pvalue=0.16233843312998728)

此使可以通过拟合在看一下两者数据结果：

   #对数据进行卡方拟合
   plt.figure()
   age.plot(kind='kde')
   chiDistribution = stats.chi2(chi_S[0],chi_S[1],chi_S[2])
   x = np.linspace(chiDistribution.ppf(0.01),chiDistribution.ppf(0.99),100)
   plt.plot(x,chiDistribution.pdf(x),c='orange')
   plt.xlabel('age  about  Titanic')
   plt.title('age  on  chi-square_distribution',size = 20)
   plt.legend(['age','chi-square_distribution'])
   plt.show()