卡方检验python程序_Python从零开始第二章（1）卡方检验(python)

最新推荐文章于 2023-01-26 14:17:47 发布

weixin_39800387

最新推荐文章于 2023-01-26 14:17:47 发布

阅读量341

点赞数

文章标签：卡方检验python程序

如果我们想确定两个独立分类数据组的统计显着性，会发生什么？这是卡方检验独立性有用的地方。

Chi-Square检验

我们将在1994年查看人口普查数据。具体来说，我们对“性别和“每周工作时间”之间的关系感兴趣。在我们的案例中，每个人只能有一个“性别”，且只有一个工作时间类别。为了这个例子，我们将使用pandas将数字列'每周小时'转换为一个分类列。然后我们将'sex'和'hours_per_week_categories'分配给新的数据帧。# -*- coding: utf-8 -*-

"""

Created on Sun Feb 3 19:24:55 2019

@author: czh

"""

# In[*]

import matplotlib.pyplot as plt

import numpy as np

import math

import seaborn as sns

import pandas as pd

%matplotlib inline

import os

os.chdir('D:\\train')

# In[*]

cols = ['age', 'workclass', 'fnlwg', 'education', 'education-num',

'marital-status','occupation','relationship', 'race','sex',

'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

data = pd.read_csv('census.csv', names=cols,sep=', ')

# In[*]

#Create a column for work hour categories.

def process_hours(df):

cut_points = [0,9,19,29,39,49,1000]

label_names = ["0-9","10-19","20-29","30-39","40-49","50+"]

df["hours_per_week_categories"] = pd.cut(df["hours-per-week"],

cut_points,labels=label_names)

return df

# In[*]

data = process_hours(data)

workhour_by_sex = data[['sex', 'hours_per_week_categories']]

workhour_by_sex.head()sex hours_per_week_categories

0 Male 40-49

1 Male 10-19

2 Male 40-49

3 Male 40-49

4 Female 40-49

查看workhour_by_sex['sex'].value_counts()

Out[31]:

Male 21790

Female 10771

Name: sex, dtype: int64workhour_by_sex['hours_per_week_categories'].value_counts()

Out[33]:

40-49 18336

50+ 6462

30-39 3667

20-29 2392

10-19 1246

0-9 458

Name: hours_per_week_categories, dtype: int64原假设

回想一下，我们有兴趣知道'sex'和'hours_per_week_categories'之间是否存在关系。为此，我们必须使用卡方检验。但首先，让我们陈述我们的零假设和另类假设。H0：性别与每周工作小时数没有统计学上的显着关系.H0：性别与每周工作小时数之间没有统计学上的显着关系。

H1：性别和每周工作小时数之间存在统计学上的显着关系.

下一步是将数据格式化为频率计数表。这称为列联表，我们可以通过在pandas中使用pd.crosstab（）函数来实现。contingency_table = pd.crosstab(

workhour_by_sex['sex'],

workhour_by_sex['hours_per_week_categories'],

margins = True

)

contingency_table

Out[34]:

hours_per_week_categories 0-9 10-19 20-29 30-39 40-49 50+ All

sex

Female 235 671 1287 1914 5636 1028 10771

Male 223 575 1105 1753 12700 5434 21790

All 6462 1246 18336 3667 458 2392 32561

该表中的每个单元表示频率计数。例如，表格中“男性”行和“10 -19”列的交集将表示从我们的样本数据集中每周工作10-19小时的男性人数。 “全部”行和“50 +”列的交叉点表示每周工作50小时以上的人员总数。# In[*]

#Assigns the frequency values

malecount = contingency_table.iloc[0][0:6].values

femalecount = contingency_table.iloc[1][0:6].values

#Plots the bar chart

fig = plt.figure(figsize=(10, 5))

sns.set(font_scale=1.8)

categories = ["0-9","10-19","20-29","30-39","40-49","50+"]

p1 = plt.bar(categories, malecount, 0.55, color='#d62728')

p2 = plt.bar(categories, femalecount, 0.55, bottom=malecount)

plt.legend((p2[0], p1[0]), ('Male', 'Female'))

plt.xlabel('Hours per Week Worked')

plt.ylabel('Count')

plt.show()

image.png

上图显示了人口普查中的样本数据。如果性别与每周工作小时数之间确实没有关系。然后，数据将显示每个时间类别的“男性”和“女性”之间的均匀比率。例如，如果5％的女性工作50+小时，我们预计工作50小时以上的男性的百分比相同。

使用Scipy进行卡方检验

现在我们已经完成了所有计算，现在是时候寻找捷径了。f_obs = np.array([contingency_table.iloc[0][0:6].values,

contingency_table.iloc[1][0:6].values])

f_obs

from scipy import stats

stats.chi2_contingency(f_obs)[0:3]

Out[38]: (2287.190943926107, 0.0, 5)

p值= ~0，自由度= 5。

结论

如果p值<0.05，我们可以拒绝零假设。 “性别”和“每周工作时间”之间肯定存在某种关系。我们不知道这种关系是什么，但我们知道这两个变量并不是彼此独立的。

weixin_39800387

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
卡方检验python程序_Python从零开始第二章（1）卡方检验(python)

如果我们想确定两个独立分类数据组的统计显着性，会发生什么？这是卡方检验独立性有用的地方。Chi-Square检验我们将在1994年查看人口普查数据。具体来说，我们对“性别和“每周工作时间”之间的关系感兴趣。在我们的案例中，每个人只能有一个“性别”，且只有一个工作时间类别。为了这个例子，我们将使用pandas将数字列'每周小时'转换为一个分类列。然后我们将'sex'和'hours_per_week_...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。