步骤一、证明Markov不等式
P
(
X
⩾
ϵ
)
⩽
E
(
X
)
ϵ
(1.1)
P\left(X\geqslant \epsilon\right) \leqslant \frac{E\left(X\right)}{\epsilon}\tag{1.1}
P(X⩾ϵ)⩽ϵE(X)(1.1)
Proof:
E
(
x
)
=
∫
1
∞
x
f
(
x
)
d
x
⩾
∫
ϵ
∞
x
f
(
x
)
d
x
⩾
ϵ
∫
ϵ
∞
f
(
x
)
d
x
=
ϵ
P
(
x
⩾
ϵ
)
(1.2)
E(x)=\int_{1}^\infty xf(x)dx\geqslant \int_{\epsilon}^\infty xf(x)dx \geqslant \epsilon \int_{\epsilon}^\infty f(x)dx=\epsilon P(x \geqslant\epsilon)\tag{1.2}
E(x)=∫1∞xf(x)dx⩾∫ϵ∞xf(x)dx⩾ϵ∫ϵ∞f(x)dx=ϵP(x⩾ϵ)(1.2)
步骤二、hoffding引理证明
E
(
e
λ
x
)
⩽
e
λ
2
(
b
−
a
)
2
8
(2.1)
E(e^{\lambda x})\leqslant e^{\frac{\lambda^2(b-a)^2}{8}}\tag{2.1}
E(eλx)⩽e8λ2(b−a)2(2.1)
Proof:
根据凸函数的性质有(b > a),
f
(
x
)
<
f
(
a
)
+
f
(
b
)
−
f
(
a
)
b
−
a
(
x
−
a
)
(2.2)
f(x) < f(a) + \frac{f(b)-f(a)}{b-a}(x-a)\tag{2.2}
f(x)<f(a)+b−af(b)−f(a)(x−a)(2.2)
令
f
(
x
)
=
e
λ
x
(2.3)
f(x)=e^{\lambda x}\tag{2.3}
f(x)=eλx(2.3)
代入有
e
λ
x
⩽
e
λ
a
e
λ
b
−
e
λ
a
b
−
a
(
x
−
a
)
=
(
b
−
a
)
b
−
a
e
λ
a
+
(
x
−
a
)
b
−
a
(
e
λ
b
−
e
λ
a
)
=
x
−
a
b
−
a
e
λ
b
+
b
−
x
b
−
a
e
λ
a
(2.4)
\begin{aligned} e^{\lambda x}&\leqslant e^{\lambda a}\frac{e^{\lambda b}- e^{\lambda a}}{b-a}(x-a) \\ &= \frac{(b-a)}{b-a}e^{\lambda a}+\frac{(x-a)}{b-a}(e^{\lambda b}-e^{\lambda a}) \\ &= \frac{x-a}{b-a}e^{\lambda b}+\frac{b-x}{b-a}e^{\lambda a}\end{aligned}\tag{2.4}
eλx⩽eλab−aeλb−eλa(x−a)=b−a(b−a)eλa+b−a(x−a)(eλb−eλa)=b−ax−aeλb+b−ab−xeλa(2.4)
两边同时取期望,令EX=0,则有
E
(
e
λ
x
)
⩽
−
a
b
−
a
e
λ
b
+
b
b
−
a
e
λ
a
=
−
a
b
−
a
e
λ
a
(
−
b
a
+
e
λ
(
b
−
a
)
)
(2.5)
\begin{aligned} E(e^{\lambda x}) &\leqslant \frac{-a}{b-a}e^{\lambda b} + \frac{b}{b-a}e^{\lambda a} \\ &=\frac{-a}{b-a}e^{\lambda a}(-\frac{b}{a}+e^{\lambda(b-a)})\end{aligned}\tag{2.5}
E(eλx)⩽b−a−aeλb+b−abeλa=b−a−aeλa(−ab+eλ(b−a))(2.5)
令
q
=
−
a
b
−
a
,
h
=
λ
(
b
−
a
)
(2.6)
令q=-\frac{a}{b-a},h=\lambda(b-a)\tag{2.6}
令q=−b−aa,h=λ(b−a)(2.6)
−
a
b
−
a
e
λ
a
(
−
b
a
+
e
λ
(
b
−
a
)
)
=
q
e
−
q
h
(
1
q
−
1
+
e
h
)
=
e
−
q
h
+
l
n
(
1
−
q
+
q
e
h
)
(2.7)
\begin{aligned} \frac{-a}{b-a}e^{\lambda a}(-\frac{b}{a}+e^{\lambda(b-a)})&=qe^{-qh}(\frac{1}{q}-1+e^{h})\\ &= e^{-qh+ln(1-q+qe^{h})}\end{aligned}\tag{2.7}
b−a−aeλa(−ab+eλ(b−a))=qe−qh(q1−1+eh)=e−qh+ln(1−q+qeh)(2.7)
令
M
(
h
)
=
−
q
h
+
l
n
(
1
−
q
+
q
e
h
)
,
有
E
(
e
λ
x
)
⩽
e
M
(
h
)
(2.8)
令M(h)=-qh+ln(1-q+qe^{h}),\\ 有E(e^{\lambda x})\leqslant e^{M(h)}\tag{2.8}
令M(h)=−qh+ln(1−q+qeh),有E(eλx)⩽eM(h)(2.8)
在0点处进行Taylor展开
M
(
h
)
=
M
(
0
)
0
!
h
0
+
M
′
(
0
)
1
!
h
+
M
′
′
(
0
)
2
!
h
2
+
.
.
.
∵
M
′
(
0
)
=
0
∴
M
(
h
)
⩽
M
′
′
(
0
)
2
!
h
2
=
1
8
λ
2
(
b
−
a
)
2
∴
E
(
e
λ
x
)
⩽
e
λ
2
(
b
−
a
)
2
8
(2.9)
M(h)=\frac{M(0)}{0!}h^{0}+\frac{M^{'}(0)}{1!}h+\frac{M^{''}(0)}{2!}h^{2}+... \\ \because M^{'}(0)=0 \\ \therefore M(h)\leqslant \frac{M^{''}(0)}{2!}h^{2}=\frac{1}{8}\lambda^2(b-a)^2 \\ \therefore E(e^{\lambda x})\leqslant e^{\frac{\lambda^2(b-a)^2}{8}}\tag{2.9}
M(h)=0!M(0)h0+1!M′(0)h+2!M′′(0)h2+...∵M′(0)=0∴M(h)⩽2!M′′(0)h2=81λ2(b−a)2∴E(eλx)⩽e8λ2(b−a)2(2.9)
步骤三、证明hoffding不等式
当
P
(
x
i
∈
[
a
i
,
b
i
)
=
1
,
x
‾
=
x
1
+
x
2
+
.
.
.
+
x
n
n
则
满
足
下
面
的
不
等
式
P
(
x
‾
−
E
(
x
‾
)
⩾
t
)
⩽
e
−
2
t
2
n
2
∑
i
=
1
n
(
b
i
−
a
i
)
2
P
(
∣
x
‾
−
E
(
x
‾
)
∣
⩾
t
)
⩽
2
e
−
2
t
2
n
2
∑
i
=
1
n
(
b
i
−
a
i
)
2
(3.1)
当P(x_{i}\in [a_{i}, b_{i})=1, \overline{x}=\frac{x_1+x_2+...+x_n}{n}则满足下面的不等式 \\ P(\overline{x}-E(\overline{x})\geqslant t)\leqslant e^{-\frac{2t^{2}n^{2}}{\sum\limits_{i=1}^{n}(b_i-a_i)^2}} \\ P(|\overline{x}-E(\overline{x})|\geqslant t)\leqslant 2e^{-\frac{2t^{2}n^{2}}{\sum\limits_{i=1}^{n}(b_i-a_i)^2}}\tag{3.1}
当P(xi∈[ai,bi)=1,x=nx1+x2+...+xn则满足下面的不等式P(x−E(x)⩾t)⩽e−i=1∑n(bi−ai)22t2n2P(∣x−E(x)∣⩾t)⩽2e−i=1∑n(bi−ai)22t2n2(3.1)
Proof:
对
于
x
1
,
x
2
,
.
.
.
,
x
n
,
n
个
相
互
独
立
的
随
机
变
量
,
P
(
x
i
∈
[
a
i
,
b
i
]
)
=
1
,
取
S
n
=
∑
i
=
1
n
x
i
P
(
S
n
−
E
(
S
n
)
⩾
t
)
=
P
(
e
s
(
S
n
−
E
(
S
n
)
)
⩾
e
s
t
)
⩽
e
−
s
t
E
(
e
s
(
S
n
−
E
(
S
n
)
)
)
=
e
−
s
t
∏
i
=
1
n
E
(
e
s
(
x
i
−
E
(
x
i
)
)
)
(3.2)
对于x_1,x_2,...,x_n,n个相互独立的随机变量,P(x_i\in[a_i, b_i])=1,取\\ S_n=\sum\limits_{i=1}^{n}x_i \\ \begin{aligned} P(S_n-E(S_n)\geqslant t) &= P(e^{s(S_n-E(S_n))}\geqslant e^{st})\\ &\leqslant e^{-st}E(e^{s(S_n-E(S_n))})\\ &=e^{-st}\prod_{i=1}^{n}E(e^{s(x_i-E(x_i))}) \end{aligned}\tag{3.2}
对于x1,x2,...,xn,n个相互独立的随机变量,P(xi∈[ai,bi])=1,取Sn=i=1∑nxiP(Sn−E(Sn)⩾t)=P(es(Sn−E(Sn))⩾est)⩽e−stE(es(Sn−E(Sn)))=e−sti=1∏nE(es(xi−E(xi)))(3.2)
P
(
S
n
−
E
(
S
n
)
>
t
)
<
e
−
s
t
∏
i
=
1
n
e
s
2
(
b
i
−
a
i
)
2
8
=
e
−
s
t
+
1
8
s
2
∑
i
=
1
n
(
b
i
−
a
i
)
2
(3.3)
\begin{aligned} P(S_n-E(S_n)>t)&<e^{-st}\prod_{i=1}^{n}e^{\frac{s^2(b_i-a_i)^2}{8}}\\ &= e^{-st+\frac{1}{8}s^2\sum\limits_{i=1}^{n}(b_i-a_i)^2} \end{aligned}\tag{3.3}
P(Sn−E(Sn)>t)<e−sti=1∏ne8s2(bi−ai)2=e−st+81s2i=1∑n(bi−ai)2(3.3)
令
f
(
s
)
=
−
s
t
+
1
8
s
2
∑
i
=
1
n
(
b
i
−
a
i
)
2
令
f
′
(
s
)
=
0
,
有
(3.4)
令f(s)=-st+\frac{1}{8}s^2\sum\limits_{i=1}^{n}(b_i-a_i)^2\\ 令f^{'}(s)=0,有\tag{3.4}
令f(s)=−st+81s2i=1∑n(bi−ai)2令f′(s)=0,有(3.4)
s
=
4
t
∑
i
=
1
n
(
b
i
−
a
i
)
2
P
(
S
n
−
E
(
S
n
)
⩾
t
)
⩽
e
−
2
t
2
∑
i
=
1
n
(
b
i
−
a
i
)
2
P
(
x
‾
−
E
(
x
‾
)
⩾
t
)
⩽
e
−
2
t
2
n
2
∑
i
=
1
n
(
b
i
−
a
i
)
2
P
(
∣
x
‾
−
E
(
x
‾
)
∣
⩾
t
)
⩽
2
e
−
2
t
2
n
2
∑
i
=
1
n
(
b
i
−
a
i
)
2
(3.5)
s=\frac{4t}{\sum\limits_{i=1}^{n}(b_i-a_i)^2}\\ P(S_n-E(S_n)\geqslant t)\leqslant e^{\frac{-2t^2}{\sum\limits_{i=1}^{n}(b_i-a_i)^2}}\\ P(\overline{x}-E(\overline{x})\geqslant t)\leqslant e^{-\frac{2t^{2}n^{2}}{\sum\limits_{i=1}^{n}(b_i-a_i)^2}}\\ P(|\overline{x}-E(\overline{x})|\geqslant t)\leqslant 2e^{-\frac{2t^{2}n^{2}}{\sum\limits_{i=1}^{n}(b_i-a_i)^2}}\tag{3.5}
s=i=1∑n(bi−ai)24tP(Sn−E(Sn)⩾t)⩽ei=1∑n(bi−ai)2−2t2P(x−E(x)⩾t)⩽e−i=1∑n(bi−ai)22t2n2P(∣x−E(x)∣⩾t)⩽2e−i=1∑n(bi−ai)22t2n2(3.5)
-
(2)对adult.data数据集进行列联表分析
概述:对UCI中的adult数据库其中的education列和race列进行相关性分析,计算要求有以下几点: -
协方差
-
列联表
-
计算期望频数
-
卡方统计量
-
p值
-
置信度
-
相关性
# 导入包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
# 读取数据
'''
注:这里对下载的数据进行简单的处理,在原数据中没有标明列名以及格式不是csv格式,因此在进行相对应的转化
'''
adult_data = pd.read_csv('C:/Users/LENVOV/Desktop/数据挖掘作业/adult_data.csv')
# 查看数据
adult_data.head(3).append(adult_data.tail(3))# 读取前3行和后3行
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
32558 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
32559 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K |
32560 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States | >50K |
均值计算公式
x
‾
=
∑
i
=
1
n
x
i
n
\overline{x} = \frac{\sum\limits_{i=1}^{n}x_{i}}{n}
x=ni=1∑nxi
# education列的均值
education_value = adult_data['education'].value_counts(dropna = False, normalize = True)
print("education's mean:\n", np.array(education_value).reshape(-1, 1))
education's mean:
[[0.32250238]
[0.22391818]
[0.16446055]
[0.05291607]
[0.04244341]
[0.03608612]
[0.03276926]
[0.02865391]
[0.01983969]
[0.01768987]
[0.01578576]
[0.01329812]
[0.01268389]
[0.01022696]
[0.00515955]
[0.00156629]]
# race列的均值
race_value = adult_data['race'].value_counts(dropna = False, normalize = True)
print("race's mean:\n", np.array(race_value).reshape(-1, 1))
race's mean:
[[0.85427352]
[0.095943 ]
[0.03190934]
[0.0095513 ]
[0.00832284]]
样本均值计算公式
μ
^
=
1
n
∑
i
=
1
n
x
i
=
1
n
(
∑
i
=
1
m
1
n
i
1
e
1
i
∑
j
=
1
m
2
n
j
2
e
2
j
)
\begin{aligned}\hat{\mu} &=\frac{1}{n}\sum\limits_{i=1}^{n}x_i\\ &=\frac{1}{n}\left(\begin{array}{c}\sum_{i=1}^{m_1}n_{i}^{1}e_{1i}\\ \sum_{j=1}^{m_2}n_{j}^{2}e_{2j}\end{array}\right)\end{aligned}
μ^=n1i=1∑nxi=n1(∑i=1m1ni1e1i∑j=1m2nj2e2j)
edu_mean = np.array(education_value).reshape(-1, 1)
race_mean = np.array(race_value).reshape(-1, 1)
print('样本均值是:\n', np.vstack((edu_mean, race_mean)))
样本均值是:
[[0.32250238]
[0.22391818]
[0.16446055]
[0.05291607]
[0.04244341]
[0.03608612]
[0.03276926]
[0.02865391]
[0.01983969]
[0.01768987]
[0.01578576]
[0.01329812]
[0.01268389]
[0.01022696]
[0.00515955]
[0.00156629]
[0.85427352]
[0.095943 ]
[0.03190934]
[0.0095513 ]
[0.00832284]]
# 列联表计算联合概率
edu_race_crosstab = pd.crosstab(adult_data['education'], adult_data['race'])
edu_race_crosstab = np.array(edu_race_crosstab)
P_edu_race = edu_race_crosstab / len(adult_data)
print(P_edu_race)
[[4.91385400e-04 3.99250637e-04 4.08464114e-03 2.76404287e-04
2.34022297e-02]
[4.29962225e-04 6.44943337e-04 4.69887288e-03 3.07115875e-04
3.00052210e-02]
[1.53557937e-04 2.76404287e-04 2.14981112e-03 4.29962225e-04
1.02883818e-02]
[1.22846350e-04 1.53557937e-04 4.91385400e-04 2.76404287e-04
4.11535272e-03]
[6.14231750e-05 5.52808575e-04 6.44943337e-04 3.99250637e-04
8.56853291e-03]
[2.76404287e-04 3.37827462e-04 1.71984890e-03 5.22096987e-04
1.69835079e-02]
[1.53557937e-04 2.76404287e-04 2.73333129e-03 2.45692700e-04
1.23767698e-02]
[2.45692700e-04 8.90636037e-04 3.28613986e-03 2.45692700e-04
2.81011025e-02]
[5.83520162e-04 1.16704032e-03 3.43969780e-03 1.84269525e-04
3.70688861e-02]
[6.44943337e-04 8.87564878e-03 1.01348239e-02 1.01348239e-03
1.43791653e-01]
[9.21347624e-05 8.59924449e-04 3.37827462e-04 6.14231750e-05
1.13325758e-02]
[3.65467891e-03 6.94081877e-03 3.60554037e-02 2.39550382e-03
2.73455975e-01]
[1.53557937e-04 2.70261970e-03 2.64119652e-03 2.14981112e-04
4.72037100e-02]
[0.00000000e+00 1.84269525e-04 1.53557937e-04 6.14231750e-05
1.16704032e-03]
[6.14231750e-05 1.25917509e-03 4.60673812e-04 1.22846350e-04
1.57857560e-02]
[2.42621541e-03 6.38801020e-03 2.29108443e-02 1.56629096e-03
1.90626824e-01]]
协方差矩阵计算公式
Σ
=
(
Σ
11
Σ
12
Σ
21
Σ
22
)
=
(
P
1
−
p
1
p
1
T
P
12
−
p
1
p
2
T
P
12
−
p
1
p
2
T
P
2
−
p
2
p
2
T
)
\varSigma=\left(\begin{matrix}\varSigma_{11}& \varSigma_{12}\\ \varSigma_{21}& \varSigma_{22}\\\end{matrix}\right)=\left(\begin{matrix}P_{1}-p_{1}p_{1}^{T}& P_{12}-p_{1}p_{2}^{T}\\ P_{12}-p_{1}p_{2}^{T}& P_{2}-p_{2}p_{2}^{T}\end{matrix}\right)
Σ=(Σ11Σ21Σ12Σ22)=(P1−p1p1TP12−p1p2TP12−p1p2TP2−p2p2T)
KaTeX parse error: Can't use function '$' in math mode at position 30: …left(p_1\right)$̲,$P_{2}=diag\le…
其 中 Σ 12 定 义 如 下 其中\varSigma_{12}定义如下 其中Σ12定义如下
Σ 12 = E [ ( X 1 − μ 1 ) ( X 2 − μ 2 ) ] T = E [ X 1 X 2 T ] − E [ X 1 ] E [ X 2 ] T = P 12 − μ 1 μ 2 T = ( p 11 − p 1 1 p 1 2 p 12 − p 1 1 p 2 2 ⋯ p 1 m 2 − p 1 1 p m 2 2 p 21 − p 2 1 p 1 2 p 22 − p 2 1 p 2 2 ⋯ p 2 m 2 − p 2 1 p m 2 2 ⋮ ⋮ ⋱ ⋮ p m 1 1 − p m 1 1 p 1 2 p m 1 2 − p m 1 1 p 2 2 ⋯ p m 1 m 2 − p m 1 1 p m 2 2 ) \begin{aligned}\varSigma_{12}&=E[(X_1-\mu_{1})(X_2-\mu_{2})]^T \\ &= E[X_1X_{2}^{T}]-E[X_1]E[X_2]^{T}\\ &= P_{12}-\mu_{1}\mu_{2}^{T}\\ &=\left(\begin{matrix}p_{11}-p_{1}^{1}p_{1}^{2}& p_{12}-p_{1}^{1}p_{2}^{2}&\cdots&p_{1m_{2}}-p_{1}^{1}p_{m_{2}}^{2}\\ p_{21}-p_{2}^{1}p_{1}^{2}&p_{22}-p_{2}^{1}p_{2}^{2}&\cdots&p_{2m_{2}}-p_{2}^{1}p_{m_{2}}^{2}\\ \vdots&\vdots&\ddots&\vdots\\ p_{m_{1}1}-p_{m_{1}}^{1}p_{1}^{2}&p_{m_{1}2}-p_{m_1}^{1}p_{2}^{2}&\cdots&p_{m_{1}m_{2}}-p_{m_{1}}^{1}p_{m_{2}}^{2}\end{matrix}\right) \end{aligned} Σ12=E[(X1−μ1)(X2−μ2)]T=E[X1X2T]−E[X1]E[X2]T=P12−μ1μ2T=⎝⎜⎜⎜⎛p11−p11p12p21−p21p12⋮pm11−pm11p12p12−p11p22p22−p21p22⋮pm12−pm11p22⋯⋯⋱⋯p1m2−p11pm22p2m2−p21pm22⋮pm1m2−pm11pm22⎠⎟⎟⎟⎞
# 计算sigama12
sigama12 = P_edu_race - np.array(education_value).reshape(-1, 1) * np.array(race_value)
# 计算sigama21
sigama21 = sigama12.T
# 计算sigama11
P1 = np.diag(np.array(education_value))
sigama11 = P1 - np.array(education_value).reshape(-1, 1) * np.array(education_value)
# 计算sigama12
P2 = np.diag(np.array(race_value))
sigama22 = P2 - np.array(race_value) * np.array(race_value).reshape(-1, 1)
sigama1112 = np.append(sigama11, sigama12, axis = 1)
sigama2122 = np.append(sigama21, sigama22, axis = 1)
# 协方差矩阵
sigama = np.append(sigama1112, sigama2122, axis = 0)
print('协方差矩阵是:\n', sigama)
协方差矩阵是:
[[ 2.18494595e-01 -7.22141474e-02 -5.30389191e-02 -1.70655570e-02
-1.36881020e-02 -1.16378581e-02 -1.05681656e-02 -9.24095454e-03
-6.39834580e-03 -5.70502660e-03 -5.09094387e-03 -4.28867451e-03
-4.09058331e-03 -3.29821850e-03 -1.66396609e-03 -5.05132563e-04
-2.75013857e-01 -3.05425950e-02 -6.20619677e-03 -2.80391389e-03
2.07180939e-02]
[-7.22141474e-02 1.73778831e-01 -3.68257080e-02 -1.18488692e-02
-9.50385218e-03 -8.08033742e-03 -7.33763406e-03 -6.41613175e-03
-4.44246636e-03 -3.96108455e-03 -3.53471781e-03 -2.97769030e-03
-2.84015264e-03 -2.29000201e-03 -1.15531633e-03 -3.50721028e-04
-1.90857413e-01 -2.08384389e-02 -2.44620846e-03 -1.83159471e-03
2.81415857e-02]
[-5.30389191e-02 -3.68257080e-02 1.37413278e-01 -8.70260524e-03
-6.98026723e-03 -5.93474240e-03 -5.38925119e-03 -4.71243801e-03
-3.26284561e-03 -2.90928649e-03 -2.59613412e-03 -2.18701571e-03
-2.08599882e-03 -1.68193125e-03 -8.48541893e-04 -2.57593075e-04
-1.40340735e-01 -1.55024342e-02 -3.09801641e-03 -1.14085045e-03
8.91960292e-03]
[-1.70655570e-02 -1.18488692e-02 -8.70260524e-03 5.01159553e-02
-2.24593846e-03 -1.90953523e-03 -1.73402050e-03 -1.51625223e-03
-1.04983809e-03 -9.36078547e-04 -8.35320092e-04 -7.03684047e-04
-6.71181319e-04 -5.41170410e-04 -2.73022910e-04 -8.28819547e-05
-4.50819468e-02 -4.92336807e-03 -1.19713129e-03 -2.29013123e-04
3.67494077e-03]
[-1.36881020e-02 -9.50385218e-03 -6.98026723e-03 -2.24593846e-03
4.06419705e-02 -1.53161793e-03 -1.39083943e-03 -1.21616981e-03
-8.42063984e-04 -7.50818661e-04 -6.70001374e-04 -5.64417500e-04
-5.38347408e-04 -4.34067038e-04 -2.18988776e-04 -6.64787356e-05
-3.61968613e-02 -3.51933986e-03 -7.09397962e-04 -6.13929925e-06
8.21528316e-03]
[-1.16378581e-02 -8.08033742e-03 -5.93474240e-03 -1.90953523e-03
-1.53161793e-03 3.47839076e-02 -1.18251543e-03 -1.03400834e-03
-7.15937179e-04 -6.38358847e-04 -5.69646610e-04 -4.79877397e-04
-4.57712159e-04 -3.69051208e-04 -1.86187997e-04 -5.65213562e-05
-3.05510084e-02 -3.12438267e-03 5.68364799e-04 1.77427540e-04
1.66831689e-02]
[-1.05681656e-02 -7.33763406e-03 -5.38925119e-03 -1.73402050e-03
-1.39083943e-03 -1.18251543e-03 3.16954392e-02 -9.38967574e-04
-6.50131889e-04 -5.79684161e-04 -5.17287602e-04 -4.35769517e-04
-4.15641595e-04 -3.35129906e-04 -1.69074547e-04 -5.13262018e-05
-2.78403563e-02 -2.86757717e-03 1.68768572e-03 -6.72964914e-05
1.21040364e-02]
[-9.24095454e-03 -6.41613175e-03 -4.71243801e-03 -1.51625223e-03
-1.21616981e-03 -1.03400834e-03 -9.38967574e-04 2.78328645e-02
-5.68484585e-04 -5.06884088e-04 -4.52323648e-04 -3.81043073e-04
-3.63442932e-04 -2.93042364e-04 -1.47841192e-04 -4.48803620e-05
-2.42325847e-02 -1.85850614e-03 2.37181249e-03 -2.79895076e-05
2.78626206e-02]
[-6.39834580e-03 -4.44246636e-03 -3.26284561e-03 -1.04983809e-03
-8.42063984e-04 -7.15937179e-04 -6.50131889e-04 -5.68484585e-04
1.94460724e-02 -3.50961545e-04 -3.13184434e-04 -2.63830467e-04
-2.51644302e-04 -2.02899643e-04 -1.02363784e-04 -3.10747201e-05
-1.63649978e-02 -7.36438609e-04 2.80662654e-03 -5.22533690e-06
3.69037636e-02]
[-5.70502660e-03 -3.96108455e-03 -2.90928649e-03 -9.36078547e-04
-7.50818661e-04 -6.38358847e-04 -5.79684161e-04 -5.06884088e-04
-3.50961545e-04 1.73769427e-02 -2.79248040e-04 -2.35242026e-04
-2.24376344e-04 -1.80913614e-04 -9.12717330e-05 -2.77074904e-05
-1.44670479e-02 7.17842918e-03 9.57035166e-03 8.44521024e-04
1.43644423e-01]
[-5.09094387e-03 -3.53471781e-03 -2.59613412e-03 -8.35320092e-04
-6.70001374e-04 -5.69646610e-04 -5.17287602e-04 -4.52323648e-04
-3.13184434e-04 -2.79248040e-04 1.55365659e-02 -2.09920836e-04
-2.00224723e-04 -1.61440273e-04 -8.14473451e-05 -2.47250869e-05
-1.33932185e-02 -6.54608324e-04 -1.65885582e-04 -8.93513745e-05
1.12011935e-02]
[-4.28867451e-03 -2.97769030e-03 -2.18701571e-03 -7.03684047e-04
-5.64417500e-04 -4.79877397e-04 -4.35769517e-04 -3.81043073e-04
-2.63830467e-04 -2.35242026e-04 -2.09920836e-04 1.31212775e-02
-1.68671800e-04 -1.35999296e-04 -6.86122576e-05 -2.08287211e-05
-7.70555060e-03 5.66495750e-03 3.56310696e-02 2.26848947e-03
2.73345297e-01]
[-4.09058331e-03 -2.84015264e-03 -2.08599882e-03 -6.71181319e-04
-5.38347408e-04 -4.57712159e-04 -4.15641595e-04 -3.63442932e-04
-2.51644302e-04 -2.24376344e-04 -2.00224723e-04 -1.68671800e-04
1.25230047e-02 -1.29717574e-04 -6.54431002e-05 -1.98666554e-05
-1.06819497e-02 1.48568967e-03 2.23646211e-03 9.38334685e-05
4.70981440e-02]
[-3.29821850e-03 -2.29000201e-03 -1.68193125e-03 -5.41170410e-04
-4.34067038e-04 -3.69051208e-04 -3.35129906e-04 -2.93042364e-04
-2.02899643e-04 -1.80913614e-04 -1.61440273e-04 -1.35999296e-04
-1.29717574e-04 1.01223679e-02 -5.27664706e-05 -1.60183929e-05
-8.73661992e-03 -7.96935560e-04 -1.72777557e-04 -3.62576129e-05
1.08192298e-03]
[-1.66396609e-03 -1.15531633e-03 -8.48541893e-04 -2.73022910e-04
-2.18988776e-04 -1.86187997e-04 -1.69074547e-04 -1.47841192e-04
-1.02363784e-04 -9.12717330e-05 -8.14473451e-05 -6.86122576e-05
-6.54431002e-05 -5.27664706e-05 5.13292577e-03 -8.08135136e-06
-4.34624093e-03 7.64152702e-04 2.96036086e-04 7.35659524e-05
1.57428139e-02]
[-5.05132563e-04 -3.50721028e-04 -2.57593075e-04 -8.28819547e-05
-6.64787356e-05 -5.65213562e-05 -5.13262018e-05 -4.48803620e-05
-3.10747201e-05 -2.77074904e-05 -2.47250869e-05 -2.08287211e-05
-1.98666554e-05 -1.60183929e-05 -8.08135136e-06 1.56383769e-03
1.08817452e-03 6.23773554e-03 2.28608650e-02 1.55133084e-03
1.90613788e-01]
[-2.75013857e-01 -1.90857413e-01 -1.40340735e-01 -4.50819468e-02
-3.61968613e-02 -3.05510084e-02 -2.78403563e-02 -2.42325847e-02
-1.63649978e-02 -1.44670479e-02 -1.33932185e-02 -7.70555060e-03
-1.06819497e-02 -8.73661992e-03 -4.34624093e-03 1.08817452e-03
1.24490275e-01 -8.19615635e-02 -2.72593036e-02 -8.15942581e-03
-7.10998198e-03]
[-3.05425950e-02 -2.08384389e-02 -1.55024342e-02 -4.92336807e-03
-3.51933986e-03 -3.12438267e-03 -2.86757717e-03 -1.85850614e-03
-7.36438609e-04 7.17842918e-03 -6.54608324e-04 5.66495750e-03
1.48568967e-03 -7.96935560e-04 7.64152702e-04 6.23773554e-03
-8.19615635e-02 8.67379402e-02 -3.06147773e-03 -9.16380725e-04
-7.98518252e-04]
[-6.20619677e-03 -2.44620846e-03 -3.09801641e-03 -1.19713129e-03
-7.09397962e-04 5.68364799e-04 1.68768572e-03 2.37181249e-03
2.80662654e-03 9.57035166e-03 -1.65885582e-04 3.56310696e-02
2.23646211e-03 -1.72777557e-04 2.96036086e-04 2.28608650e-02
-2.72593036e-02 -3.06147773e-03 3.08911335e-02 -3.04775792e-04
-2.65576333e-04]
[-2.80391389e-03 -1.83159471e-03 -1.14085045e-03 -2.29013123e-04
-6.13929925e-06 1.77427540e-04 -6.72964914e-05 -2.79895076e-05
-5.22533690e-06 8.44521024e-04 -8.93513745e-05 2.26848947e-03
9.38334685e-05 -3.62576129e-05 7.35659524e-05 1.55133084e-03
-8.15942581e-03 -9.16380725e-04 -3.04775792e-04 9.46007630e-03
-7.94939745e-05]
[ 2.07180939e-02 2.81415857e-02 8.91960292e-03 3.67494077e-03
8.21528316e-03 1.66831689e-02 1.21040364e-02 2.78626206e-02
3.69037636e-02 1.43644423e-01 1.12011935e-02 2.73345297e-01
4.70981440e-02 1.08192298e-03 1.57428139e-02 1.90613788e-01
-7.10998198e-03 -7.98518252e-04 -2.65576333e-04 -7.94939745e-05
8.25357054e-03]]
#列联表矩阵形式
Crosstab = pd.crosstab(adult_data['education'], adult_data['race'], margins = True)
#每一对值出现的观察频数
print('列联表为:\n',Crosstab)
Crosstab_1 = pd.crosstab(adult_data['education'], adult_data['race'],margins = False)
Ct = np.array(Crosstab_1)
Ct_1=Ct/len(adult_data)
列联表为:
race Amer-Indian-Eskimo Asian-Pac-Islander Black Other \
education
10th 16 13 133 9
11th 14 21 153 10
12th 5 9 70 14
1st-4th 4 5 16 9
5th-6th 2 18 21 13
7th-8th 9 11 56 17
9th 5 9 89 8
Assoc-acdm 8 29 107 8
Assoc-voc 19 38 112 6
Bachelors 21 289 330 33
Doctorate 3 28 11 2
HS-grad 119 226 1174 78
Masters 5 88 86 7
Preschool 0 6 5 2
Prof-school 2 41 15 4
Some-college 79 208 746 51
All 311 1039 3124 271
race White All
education
10th 762 933
11th 977 1175
12th 335 433
1st-4th 134 168
5th-6th 279 333
7th-8th 553 646
9th 403 514
Assoc-acdm 915 1067
Assoc-voc 1207 1382
Bachelors 4682 5355
Doctorate 369 413
HS-grad 8904 10501
Masters 1537 1723
Preschool 38 51
Prof-school 514 576
Some-college 6207 7291
All 27816 32561
#列计数
Ct_row = np.sum(Ct, axis = 0)
#行计数
Ct_column = np.sum(Ct, axis = 1)
Ct_column_T = Ct_column.reshape(-1, 1)
#每一对值的期望出现频率
e = (Ct_column_T * Ct_row) / len(adult_data)
print('每一对值的期望出现频率为:\n', e)
每一对值的期望出现频率为:
[[8.91136636e+00 2.97714137e+01 8.95148183e+01 7.76520991e+00
7.97037192e+02]
[1.12227819e+01 3.74934738e+01 1.12733024e+02 9.77933724e+00
1.00377138e+03]
[4.13571451e+00 1.38167440e+01 4.15433187e+01 3.60378981e+00
3.69900433e+02]
[1.60461902e+00 5.36076902e+00 1.61184239e+01 1.39823715e+00
1.43517951e+02]
[3.18058413e+00 1.06258100e+01 3.19490188e+01 2.77150579e+00
2.84473081e+02]
[6.17014219e+00 2.06134332e+01 6.19791775e+01 5.37655477e+00
5.51860692e+02]
[4.90937011e+00 1.64014004e+01 4.93147016e+01 4.27793987e+00
4.39096588e+02]
[1.01912411e+01 3.40472651e+01 1.02371180e+02 8.88047050e+00
9.11509843e+02]
[1.31999017e+01 4.40987070e+01 1.32593225e+02 1.15021652e+01
1.18060600e+03]
[5.11472314e+01 1.70874512e+02 5.13774761e+02 4.45688093e+01
4.57463469e+03]
[3.94468843e+00 1.31785572e+01 3.96244587e+01 3.43733301e+00
3.52814963e+02]
[1.00298240e+02 3.35079973e+02 1.00749744e+03 8.73981450e+01
8.97072621e+03]
[1.64568963e+01 5.49797918e+01 1.65309788e+02 1.43402537e+01
1.47191327e+03]
[4.87116489e-01 1.62737631e+00 4.89309296e+00 4.24464851e-01
4.35679494e+01]
[5.50155094e+00 1.83797795e+01 5.52631676e+01 4.79395596e+00
4.92061546e+02]
[6.96385553e+01 2.32650994e+02 6.99520408e+02 6.06818280e+01
6.22850822e+03]]
卡方统计量计算公式
χ
2
=
∑
i
=
1
m
1
∑
i
=
1
m
2
(
n
i
j
−
e
i
j
)
2
e
i
j
\chi^{2}=\sum\limits_{i=1}^{m_{1}}\sum\limits_{i=1}^{m_{2}}\frac{(n_{ij}-e_{ij})^{2}}{e_{ij}}
χ2=i=1∑m1i=1∑m2eij(nij−eij)2
其
中
观
察
频
数
n
i
j
和
期
望
频
数
e
i
j
其中观察频数n_{ij}和期望频数e_{ij}
其中观察频数nij和期望频数eij
# 计算卡方统计量量化每一对值出现的观察频数与期望频数之间的差异 X_2
X_2 = 0 # 初始化
for i in range(0, 16):
for j in range(0, 5):
k = ((Ct[i, j] - e[i, j])**2) / e[i, j]
X_2 = X_2 + k
print('卡方统计量:', X_2)
卡方统计量: 730.6712962254584
自由度计算公式
p
=
(
m
1
−
1
)
(
m
2
−
1
)
p = \left(m_{1}-1)(m_{2}-1\right)
p=(m1−1)(m2−1)
# 自由度
q = (16 - 1) * (5 - 1)
print('自由度:', q)
自由度: 60
p-value计算公式
p
(
z
)
=
P
(
θ
⩾
z
)
=
1
−
F
(
θ
)
p\left(z\right)=P\left(\theta\geqslant z\right)=1-F\left(\theta\right)
p(z)=P(θ⩾z)=1−F(θ)
# 计算p值
F = stats.chi2.cdf(X_2, q)
p_value = 1-F
print('p值为:', p_value)
p值为: 0.0
# 计算出p值为0.0,取a=0.1>p值
a = 0.1
z = stats.chi2.pdf(1 - a, q)
print('在a = 0.1的显著性水平下统计量的临界值:', z)
在a = 0.1的显著性水平下统计量的临界值: 3.163454313384917e-42
# 置信度
confidence_level = 1 - a
print('置信度为:', confidence_level)
置信度为: 0.9
# 假设检验
if p_value < 0.1:
print('二者相关')
else:
print('二者独立')
二者相关