demp1

数据挖掘第一次作业
- (1)证明霍夫丁不等式

步骤一、证明Markov不等式

P ( X ⩾ ϵ ) ⩽ E ( X ) ϵ (1.1) P\left(X\geqslant \epsilon\right) \leqslant \frac{E\left(X\right)}{\epsilon}\tag{1.1} P(Xϵ)ϵE(X)(1.1)
Proof:
E ( x ) = ∫ 1 ∞ x f ( x ) d x ⩾ ∫ ϵ ∞ x f ( x ) d x ⩾ ϵ ∫ ϵ ∞ f ( x ) d x = ϵ P ( x ⩾ ϵ ) (1.2) E(x)=\int_{1}^\infty xf(x)dx\geqslant \int_{\epsilon}^\infty xf(x)dx \geqslant \epsilon \int_{\epsilon}^\infty f(x)dx=\epsilon P(x \geqslant\epsilon)\tag{1.2} E(x)=1xf(x)dxϵxf(x)dxϵϵf(x)dx=ϵP(xϵ)(1.2)

步骤二、hoffding引理证明

E ( e λ x ) ⩽ e λ 2 ( b − a ) 2 8 (2.1) E(e^{\lambda x})\leqslant e^{\frac{\lambda^2(b-a)^2}{8}}\tag{2.1} E(eλx)e8λ2(ba)2(2.1)
Proof:
根据凸函数的性质有(b > a),
f ( x ) < f ( a ) + f ( b ) − f ( a ) b − a ( x − a ) (2.2) f(x) < f(a) + \frac{f(b)-f(a)}{b-a}(x-a)\tag{2.2} f(x)<f(a)+baf(b)f(a)(xa)(2.2)

f ( x ) = e λ x (2.3) f(x)=e^{\lambda x}\tag{2.3} f(x)=eλx(2.3)
代入有
e λ x ⩽ e λ a e λ b − e λ a b − a ( x − a ) = ( b − a ) b − a e λ a + ( x − a ) b − a ( e λ b − e λ a ) = x − a b − a e λ b + b − x b − a e λ a (2.4) \begin{aligned} e^{\lambda x}&\leqslant e^{\lambda a}\frac{e^{\lambda b}- e^{\lambda a}}{b-a}(x-a) \\ &= \frac{(b-a)}{b-a}e^{\lambda a}+\frac{(x-a)}{b-a}(e^{\lambda b}-e^{\lambda a}) \\ &= \frac{x-a}{b-a}e^{\lambda b}+\frac{b-x}{b-a}e^{\lambda a}\end{aligned}\tag{2.4} eλxeλabaeλbeλa(xa)=ba(ba)eλa+ba(xa)(eλbeλa)=baxaeλb+babxeλa(2.4)
两边同时取期望,令EX=0,则有
E ( e λ x ) ⩽ − a b − a e λ b + b b − a e λ a = − a b − a e λ a ( − b a + e λ ( b − a ) ) (2.5) \begin{aligned} E(e^{\lambda x}) &\leqslant \frac{-a}{b-a}e^{\lambda b} + \frac{b}{b-a}e^{\lambda a} \\ &=\frac{-a}{b-a}e^{\lambda a}(-\frac{b}{a}+e^{\lambda(b-a)})\end{aligned}\tag{2.5} E(eλx)baaeλb+babeλa=baaeλa(ab+eλ(ba))(2.5)
令 q = − a b − a , h = λ ( b − a ) (2.6) 令q=-\frac{a}{b-a},h=\lambda(b-a)\tag{2.6} q=baa,h=λ(ba)(2.6)
− a b − a e λ a ( − b a + e λ ( b − a ) ) = q e − q h ( 1 q − 1 + e h ) = e − q h + l n ( 1 − q + q e h ) (2.7) \begin{aligned} \frac{-a}{b-a}e^{\lambda a}(-\frac{b}{a}+e^{\lambda(b-a)})&=qe^{-qh}(\frac{1}{q}-1+e^{h})\\ &= e^{-qh+ln(1-q+qe^{h})}\end{aligned}\tag{2.7} baaeλa(ab+eλ(ba))=qeqh(q11+eh)=eqh+ln(1q+qeh)(2.7)
令 M ( h ) = − q h + l n ( 1 − q + q e h ) , 有 E ( e λ x ) ⩽ e M ( h ) (2.8) 令M(h)=-qh+ln(1-q+qe^{h}),\\ 有E(e^{\lambda x})\leqslant e^{M(h)}\tag{2.8} M(h)=qh+ln(1q+qeh),E(eλx)eM(h)(2.8)
在0点处进行Taylor展开
M ( h ) = M ( 0 ) 0 ! h 0 + M ′ ( 0 ) 1 ! h + M ′ ′ ( 0 ) 2 ! h 2 + . . . ∵ M ′ ( 0 ) = 0 ∴ M ( h ) ⩽ M ′ ′ ( 0 ) 2 ! h 2 = 1 8 λ 2 ( b − a ) 2 ∴ E ( e λ x ) ⩽ e λ 2 ( b − a ) 2 8 (2.9) M(h)=\frac{M(0)}{0!}h^{0}+\frac{M^{'}(0)}{1!}h+\frac{M^{''}(0)}{2!}h^{2}+... \\ \because M^{'}(0)=0 \\ \therefore M(h)\leqslant \frac{M^{''}(0)}{2!}h^{2}=\frac{1}{8}\lambda^2(b-a)^2 \\ \therefore E(e^{\lambda x})\leqslant e^{\frac{\lambda^2(b-a)^2}{8}}\tag{2.9} M(h)=0!M(0)h0+1!M(0)h+2!M(0)h2+...M(0)=0M(h)2!M(0)h2=81λ2(ba)2E(eλx)e8λ2(ba)2(2.9)

步骤三、证明hoffding不等式

当 P ( x i ∈ [ a i , b i ) = 1 , x ‾ = x 1 + x 2 + . . . + x n n 则 满 足 下 面 的 不 等 式 P ( x ‾ − E ( x ‾ ) ⩾ t ) ⩽ e − 2 t 2 n 2 ∑ i = 1 n ( b i − a i ) 2 P ( ∣ x ‾ − E ( x ‾ ) ∣ ⩾ t ) ⩽ 2 e − 2 t 2 n 2 ∑ i = 1 n ( b i − a i ) 2 (3.1) 当P(x_{i}\in [a_{i}, b_{i})=1, \overline{x}=\frac{x_1+x_2+...+x_n}{n}则满足下面的不等式 \\ P(\overline{x}-E(\overline{x})\geqslant t)\leqslant e^{-\frac{2t^{2}n^{2}}{\sum\limits_{i=1}^{n}(b_i-a_i)^2}} \\ P(|\overline{x}-E(\overline{x})|\geqslant t)\leqslant 2e^{-\frac{2t^{2}n^{2}}{\sum\limits_{i=1}^{n}(b_i-a_i)^2}}\tag{3.1} P(xi[ai,bi)=1,x=nx1+x2+...+xnP(xE(x)t)ei=1n(biai)22t2n2P(xE(x)t)2ei=1n(biai)22t2n2(3.1)
Proof:
对 于 x 1 , x 2 , . . . , x n , n 个 相 互 独 立 的 随 机 变 量 , P ( x i ∈ [ a i , b i ] ) = 1 , 取 S n = ∑ i = 1 n x i P ( S n − E ( S n ) ⩾ t ) = P ( e s ( S n − E ( S n ) ) ⩾ e s t ) ⩽ e − s t E ( e s ( S n − E ( S n ) ) ) = e − s t ∏ i = 1 n E ( e s ( x i − E ( x i ) ) ) (3.2) 对于x_1,x_2,...,x_n,n个相互独立的随机变量,P(x_i\in[a_i, b_i])=1,取\\ S_n=\sum\limits_{i=1}^{n}x_i \\ \begin{aligned} P(S_n-E(S_n)\geqslant t) &= P(e^{s(S_n-E(S_n))}\geqslant e^{st})\\ &\leqslant e^{-st}E(e^{s(S_n-E(S_n))})\\ &=e^{-st}\prod_{i=1}^{n}E(e^{s(x_i-E(x_i))}) \end{aligned}\tag{3.2} x1,x2,...,xn,n,P(xi[ai,bi])=1,Sn=i=1nxiP(SnE(Sn)t)=P(es(SnE(Sn))est)estE(es(SnE(Sn)))=esti=1nE(es(xiE(xi)))(3.2)
P ( S n − E ( S n ) > t ) < e − s t ∏ i = 1 n e s 2 ( b i − a i ) 2 8 = e − s t + 1 8 s 2 ∑ i = 1 n ( b i − a i ) 2 (3.3) \begin{aligned} P(S_n-E(S_n)>t)&<e^{-st}\prod_{i=1}^{n}e^{\frac{s^2(b_i-a_i)^2}{8}}\\ &= e^{-st+\frac{1}{8}s^2\sum\limits_{i=1}^{n}(b_i-a_i)^2} \end{aligned}\tag{3.3} P(SnE(Sn)>t)<esti=1ne8s2(biai)2=est+81s2i=1n(biai)2(3.3)
令 f ( s ) = − s t + 1 8 s 2 ∑ i = 1 n ( b i − a i ) 2 令 f ′ ( s ) = 0 , 有 (3.4) 令f(s)=-st+\frac{1}{8}s^2\sum\limits_{i=1}^{n}(b_i-a_i)^2\\ 令f^{'}(s)=0,有\tag{3.4} f(s)=st+81s2i=1n(biai)2f(s)=0,(3.4)
s = 4 t ∑ i = 1 n ( b i − a i ) 2 P ( S n − E ( S n ) ⩾ t ) ⩽ e − 2 t 2 ∑ i = 1 n ( b i − a i ) 2 P ( x ‾ − E ( x ‾ ) ⩾ t ) ⩽ e − 2 t 2 n 2 ∑ i = 1 n ( b i − a i ) 2 P ( ∣ x ‾ − E ( x ‾ ) ∣ ⩾ t ) ⩽ 2 e − 2 t 2 n 2 ∑ i = 1 n ( b i − a i ) 2 (3.5) s=\frac{4t}{\sum\limits_{i=1}^{n}(b_i-a_i)^2}\\ P(S_n-E(S_n)\geqslant t)\leqslant e^{\frac{-2t^2}{\sum\limits_{i=1}^{n}(b_i-a_i)^2}}\\ P(\overline{x}-E(\overline{x})\geqslant t)\leqslant e^{-\frac{2t^{2}n^{2}}{\sum\limits_{i=1}^{n}(b_i-a_i)^2}}\\ P(|\overline{x}-E(\overline{x})|\geqslant t)\leqslant 2e^{-\frac{2t^{2}n^{2}}{\sum\limits_{i=1}^{n}(b_i-a_i)^2}}\tag{3.5} s=i=1n(biai)24tP(SnE(Sn)t)ei=1n(biai)22t2P(xE(x)t)ei=1n(biai)22t2n2P(xE(x)t)2ei=1n(biai)22t2n2(3.5)

  • (2)对adult.data数据集进行列联表分析
    概述:对UCI中的adult数据库其中的education列和race列进行相关性分析,计算要求有以下几点:

  • 协方差

  • 列联表

  • 计算期望频数

  • 卡方统计量

  • p

  • 置信度

  • 相关性

# 导入包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
# 读取数据
'''
注:这里对下载的数据进行简单的处理,在原数据中没有标明列名以及格式不是csv格式,因此在进行相对应的转化
'''
adult_data = pd.read_csv('C:/Users/LENVOV/Desktop/数据挖掘作业/adult_data.csv')
# 查看数据
adult_data.head(3).append(adult_data.tail(3))# 读取前3行和后3行
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countryclass
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
3255858Private151910HS-grad9WidowedAdm-clericalUnmarriedWhiteFemale0040United-States<=50K
3255922Private201490HS-grad9Never-marriedAdm-clericalOwn-childWhiteMale0020United-States<=50K
3256052Self-emp-inc287927HS-grad9Married-civ-spouseExec-managerialWifeWhiteFemale15024040United-States>50K

均值计算公式
x ‾ = ∑ i = 1 n x i n \overline{x} = \frac{\sum\limits_{i=1}^{n}x_{i}}{n} x=ni=1nxi

# education列的均值
education_value = adult_data['education'].value_counts(dropna = False, normalize = True)
print("education's mean:\n", np.array(education_value).reshape(-1, 1))
education's mean:
 [[0.32250238]
 [0.22391818]
 [0.16446055]
 [0.05291607]
 [0.04244341]
 [0.03608612]
 [0.03276926]
 [0.02865391]
 [0.01983969]
 [0.01768987]
 [0.01578576]
 [0.01329812]
 [0.01268389]
 [0.01022696]
 [0.00515955]
 [0.00156629]]
# race列的均值
race_value = adult_data['race'].value_counts(dropna = False, normalize = True)
print("race's mean:\n", np.array(race_value).reshape(-1, 1))
race's mean:
 [[0.85427352]
 [0.095943  ]
 [0.03190934]
 [0.0095513 ]
 [0.00832284]]

样本均值计算公式
μ ^ = 1 n ∑ i = 1 n x i = 1 n ( ∑ i = 1 m 1 n i 1 e 1 i ∑ j = 1 m 2 n j 2 e 2 j ) \begin{aligned}\hat{\mu} &=\frac{1}{n}\sum\limits_{i=1}^{n}x_i\\ &=\frac{1}{n}\left(\begin{array}{c}\sum_{i=1}^{m_1}n_{i}^{1}e_{1i}\\ \sum_{j=1}^{m_2}n_{j}^{2}e_{2j}\end{array}\right)\end{aligned} μ^=n1i=1nxi=n1(i=1m1ni1e1ij=1m2nj2e2j)

edu_mean =  np.array(education_value).reshape(-1, 1)
race_mean = np.array(race_value).reshape(-1, 1)
print('样本均值是:\n', np.vstack((edu_mean, race_mean)))
样本均值是:
 [[0.32250238]
 [0.22391818]
 [0.16446055]
 [0.05291607]
 [0.04244341]
 [0.03608612]
 [0.03276926]
 [0.02865391]
 [0.01983969]
 [0.01768987]
 [0.01578576]
 [0.01329812]
 [0.01268389]
 [0.01022696]
 [0.00515955]
 [0.00156629]
 [0.85427352]
 [0.095943  ]
 [0.03190934]
 [0.0095513 ]
 [0.00832284]]
# 列联表计算联合概率
edu_race_crosstab = pd.crosstab(adult_data['education'], adult_data['race'])
edu_race_crosstab = np.array(edu_race_crosstab)
P_edu_race = edu_race_crosstab / len(adult_data)
print(P_edu_race)
[[4.91385400e-04 3.99250637e-04 4.08464114e-03 2.76404287e-04
  2.34022297e-02]
 [4.29962225e-04 6.44943337e-04 4.69887288e-03 3.07115875e-04
  3.00052210e-02]
 [1.53557937e-04 2.76404287e-04 2.14981112e-03 4.29962225e-04
  1.02883818e-02]
 [1.22846350e-04 1.53557937e-04 4.91385400e-04 2.76404287e-04
  4.11535272e-03]
 [6.14231750e-05 5.52808575e-04 6.44943337e-04 3.99250637e-04
  8.56853291e-03]
 [2.76404287e-04 3.37827462e-04 1.71984890e-03 5.22096987e-04
  1.69835079e-02]
 [1.53557937e-04 2.76404287e-04 2.73333129e-03 2.45692700e-04
  1.23767698e-02]
 [2.45692700e-04 8.90636037e-04 3.28613986e-03 2.45692700e-04
  2.81011025e-02]
 [5.83520162e-04 1.16704032e-03 3.43969780e-03 1.84269525e-04
  3.70688861e-02]
 [6.44943337e-04 8.87564878e-03 1.01348239e-02 1.01348239e-03
  1.43791653e-01]
 [9.21347624e-05 8.59924449e-04 3.37827462e-04 6.14231750e-05
  1.13325758e-02]
 [3.65467891e-03 6.94081877e-03 3.60554037e-02 2.39550382e-03
  2.73455975e-01]
 [1.53557937e-04 2.70261970e-03 2.64119652e-03 2.14981112e-04
  4.72037100e-02]
 [0.00000000e+00 1.84269525e-04 1.53557937e-04 6.14231750e-05
  1.16704032e-03]
 [6.14231750e-05 1.25917509e-03 4.60673812e-04 1.22846350e-04
  1.57857560e-02]
 [2.42621541e-03 6.38801020e-03 2.29108443e-02 1.56629096e-03
  1.90626824e-01]]

协方差矩阵计算公式
Σ = ( Σ 11 Σ 12 Σ 21 Σ 22 ) = ( P 1 − p 1 p 1 T P 12 − p 1 p 2 T P 12 − p 1 p 2 T P 2 − p 2 p 2 T ) \varSigma=\left(\begin{matrix}\varSigma_{11}& \varSigma_{12}\\ \varSigma_{21}& \varSigma_{22}\\\end{matrix}\right)=\left(\begin{matrix}P_{1}-p_{1}p_{1}^{T}& P_{12}-p_{1}p_{2}^{T}\\ P_{12}-p_{1}p_{2}^{T}& P_{2}-p_{2}p_{2}^{T}\end{matrix}\right) Σ=(Σ11Σ21Σ12Σ22)=(P1p1p1TP12p1p2TP12p1p2TP2p2p2T)
KaTeX parse error: Can't use function '$' in math mode at position 30: …left(p_1\right)$̲,$P_{2}=diag\le…

其 中 Σ 12 定 义 如 下 其中\varSigma_{12}定义如下 Σ12

Σ 12 = E [ ( X 1 − μ 1 ) ( X 2 − μ 2 ) ] T = E [ X 1 X 2 T ] − E [ X 1 ] E [ X 2 ] T = P 12 − μ 1 μ 2 T = ( p 11 − p 1 1 p 1 2 p 12 − p 1 1 p 2 2 ⋯ p 1 m 2 − p 1 1 p m 2 2 p 21 − p 2 1 p 1 2 p 22 − p 2 1 p 2 2 ⋯ p 2 m 2 − p 2 1 p m 2 2 ⋮ ⋮ ⋱ ⋮ p m 1 1 − p m 1 1 p 1 2 p m 1 2 − p m 1 1 p 2 2 ⋯ p m 1 m 2 − p m 1 1 p m 2 2 ) \begin{aligned}\varSigma_{12}&=E[(X_1-\mu_{1})(X_2-\mu_{2})]^T \\ &= E[X_1X_{2}^{T}]-E[X_1]E[X_2]^{T}\\ &= P_{12}-\mu_{1}\mu_{2}^{T}\\ &=\left(\begin{matrix}p_{11}-p_{1}^{1}p_{1}^{2}& p_{12}-p_{1}^{1}p_{2}^{2}&\cdots&p_{1m_{2}}-p_{1}^{1}p_{m_{2}}^{2}\\ p_{21}-p_{2}^{1}p_{1}^{2}&p_{22}-p_{2}^{1}p_{2}^{2}&\cdots&p_{2m_{2}}-p_{2}^{1}p_{m_{2}}^{2}\\ \vdots&\vdots&\ddots&\vdots\\ p_{m_{1}1}-p_{m_{1}}^{1}p_{1}^{2}&p_{m_{1}2}-p_{m_1}^{1}p_{2}^{2}&\cdots&p_{m_{1}m_{2}}-p_{m_{1}}^{1}p_{m_{2}}^{2}\end{matrix}\right) \end{aligned} Σ12=E[(X1μ1)(X2μ2)]T=E[X1X2T]E[X1]E[X2]T=P12μ1μ2T=p11p11p12p21p21p12pm11pm11p12p12p11p22p22p21p22pm12pm11p22p1m2p11pm22p2m2p21pm22pm1m2pm11pm22

# 计算sigama12
sigama12 = P_edu_race - np.array(education_value).reshape(-1, 1) * np.array(race_value)
# 计算sigama21
sigama21 = sigama12.T
# 计算sigama11
P1 = np.diag(np.array(education_value))
sigama11 = P1 - np.array(education_value).reshape(-1, 1) * np.array(education_value)
# 计算sigama12
P2 = np.diag(np.array(race_value))
sigama22 = P2 - np.array(race_value) * np.array(race_value).reshape(-1, 1)

sigama1112 = np.append(sigama11, sigama12, axis = 1)
sigama2122 = np.append(sigama21, sigama22, axis = 1)
# 协方差矩阵
sigama = np.append(sigama1112, sigama2122, axis = 0)
print('协方差矩阵是:\n', sigama)
协方差矩阵是:
 [[ 2.18494595e-01 -7.22141474e-02 -5.30389191e-02 -1.70655570e-02
  -1.36881020e-02 -1.16378581e-02 -1.05681656e-02 -9.24095454e-03
  -6.39834580e-03 -5.70502660e-03 -5.09094387e-03 -4.28867451e-03
  -4.09058331e-03 -3.29821850e-03 -1.66396609e-03 -5.05132563e-04
  -2.75013857e-01 -3.05425950e-02 -6.20619677e-03 -2.80391389e-03
   2.07180939e-02]
 [-7.22141474e-02  1.73778831e-01 -3.68257080e-02 -1.18488692e-02
  -9.50385218e-03 -8.08033742e-03 -7.33763406e-03 -6.41613175e-03
  -4.44246636e-03 -3.96108455e-03 -3.53471781e-03 -2.97769030e-03
  -2.84015264e-03 -2.29000201e-03 -1.15531633e-03 -3.50721028e-04
  -1.90857413e-01 -2.08384389e-02 -2.44620846e-03 -1.83159471e-03
   2.81415857e-02]
 [-5.30389191e-02 -3.68257080e-02  1.37413278e-01 -8.70260524e-03
  -6.98026723e-03 -5.93474240e-03 -5.38925119e-03 -4.71243801e-03
  -3.26284561e-03 -2.90928649e-03 -2.59613412e-03 -2.18701571e-03
  -2.08599882e-03 -1.68193125e-03 -8.48541893e-04 -2.57593075e-04
  -1.40340735e-01 -1.55024342e-02 -3.09801641e-03 -1.14085045e-03
   8.91960292e-03]
 [-1.70655570e-02 -1.18488692e-02 -8.70260524e-03  5.01159553e-02
  -2.24593846e-03 -1.90953523e-03 -1.73402050e-03 -1.51625223e-03
  -1.04983809e-03 -9.36078547e-04 -8.35320092e-04 -7.03684047e-04
  -6.71181319e-04 -5.41170410e-04 -2.73022910e-04 -8.28819547e-05
  -4.50819468e-02 -4.92336807e-03 -1.19713129e-03 -2.29013123e-04
   3.67494077e-03]
 [-1.36881020e-02 -9.50385218e-03 -6.98026723e-03 -2.24593846e-03
   4.06419705e-02 -1.53161793e-03 -1.39083943e-03 -1.21616981e-03
  -8.42063984e-04 -7.50818661e-04 -6.70001374e-04 -5.64417500e-04
  -5.38347408e-04 -4.34067038e-04 -2.18988776e-04 -6.64787356e-05
  -3.61968613e-02 -3.51933986e-03 -7.09397962e-04 -6.13929925e-06
   8.21528316e-03]
 [-1.16378581e-02 -8.08033742e-03 -5.93474240e-03 -1.90953523e-03
  -1.53161793e-03  3.47839076e-02 -1.18251543e-03 -1.03400834e-03
  -7.15937179e-04 -6.38358847e-04 -5.69646610e-04 -4.79877397e-04
  -4.57712159e-04 -3.69051208e-04 -1.86187997e-04 -5.65213562e-05
  -3.05510084e-02 -3.12438267e-03  5.68364799e-04  1.77427540e-04
   1.66831689e-02]
 [-1.05681656e-02 -7.33763406e-03 -5.38925119e-03 -1.73402050e-03
  -1.39083943e-03 -1.18251543e-03  3.16954392e-02 -9.38967574e-04
  -6.50131889e-04 -5.79684161e-04 -5.17287602e-04 -4.35769517e-04
  -4.15641595e-04 -3.35129906e-04 -1.69074547e-04 -5.13262018e-05
  -2.78403563e-02 -2.86757717e-03  1.68768572e-03 -6.72964914e-05
   1.21040364e-02]
 [-9.24095454e-03 -6.41613175e-03 -4.71243801e-03 -1.51625223e-03
  -1.21616981e-03 -1.03400834e-03 -9.38967574e-04  2.78328645e-02
  -5.68484585e-04 -5.06884088e-04 -4.52323648e-04 -3.81043073e-04
  -3.63442932e-04 -2.93042364e-04 -1.47841192e-04 -4.48803620e-05
  -2.42325847e-02 -1.85850614e-03  2.37181249e-03 -2.79895076e-05
   2.78626206e-02]
 [-6.39834580e-03 -4.44246636e-03 -3.26284561e-03 -1.04983809e-03
  -8.42063984e-04 -7.15937179e-04 -6.50131889e-04 -5.68484585e-04
   1.94460724e-02 -3.50961545e-04 -3.13184434e-04 -2.63830467e-04
  -2.51644302e-04 -2.02899643e-04 -1.02363784e-04 -3.10747201e-05
  -1.63649978e-02 -7.36438609e-04  2.80662654e-03 -5.22533690e-06
   3.69037636e-02]
 [-5.70502660e-03 -3.96108455e-03 -2.90928649e-03 -9.36078547e-04
  -7.50818661e-04 -6.38358847e-04 -5.79684161e-04 -5.06884088e-04
  -3.50961545e-04  1.73769427e-02 -2.79248040e-04 -2.35242026e-04
  -2.24376344e-04 -1.80913614e-04 -9.12717330e-05 -2.77074904e-05
  -1.44670479e-02  7.17842918e-03  9.57035166e-03  8.44521024e-04
   1.43644423e-01]
 [-5.09094387e-03 -3.53471781e-03 -2.59613412e-03 -8.35320092e-04
  -6.70001374e-04 -5.69646610e-04 -5.17287602e-04 -4.52323648e-04
  -3.13184434e-04 -2.79248040e-04  1.55365659e-02 -2.09920836e-04
  -2.00224723e-04 -1.61440273e-04 -8.14473451e-05 -2.47250869e-05
  -1.33932185e-02 -6.54608324e-04 -1.65885582e-04 -8.93513745e-05
   1.12011935e-02]
 [-4.28867451e-03 -2.97769030e-03 -2.18701571e-03 -7.03684047e-04
  -5.64417500e-04 -4.79877397e-04 -4.35769517e-04 -3.81043073e-04
  -2.63830467e-04 -2.35242026e-04 -2.09920836e-04  1.31212775e-02
  -1.68671800e-04 -1.35999296e-04 -6.86122576e-05 -2.08287211e-05
  -7.70555060e-03  5.66495750e-03  3.56310696e-02  2.26848947e-03
   2.73345297e-01]
 [-4.09058331e-03 -2.84015264e-03 -2.08599882e-03 -6.71181319e-04
  -5.38347408e-04 -4.57712159e-04 -4.15641595e-04 -3.63442932e-04
  -2.51644302e-04 -2.24376344e-04 -2.00224723e-04 -1.68671800e-04
   1.25230047e-02 -1.29717574e-04 -6.54431002e-05 -1.98666554e-05
  -1.06819497e-02  1.48568967e-03  2.23646211e-03  9.38334685e-05
   4.70981440e-02]
 [-3.29821850e-03 -2.29000201e-03 -1.68193125e-03 -5.41170410e-04
  -4.34067038e-04 -3.69051208e-04 -3.35129906e-04 -2.93042364e-04
  -2.02899643e-04 -1.80913614e-04 -1.61440273e-04 -1.35999296e-04
  -1.29717574e-04  1.01223679e-02 -5.27664706e-05 -1.60183929e-05
  -8.73661992e-03 -7.96935560e-04 -1.72777557e-04 -3.62576129e-05
   1.08192298e-03]
 [-1.66396609e-03 -1.15531633e-03 -8.48541893e-04 -2.73022910e-04
  -2.18988776e-04 -1.86187997e-04 -1.69074547e-04 -1.47841192e-04
  -1.02363784e-04 -9.12717330e-05 -8.14473451e-05 -6.86122576e-05
  -6.54431002e-05 -5.27664706e-05  5.13292577e-03 -8.08135136e-06
  -4.34624093e-03  7.64152702e-04  2.96036086e-04  7.35659524e-05
   1.57428139e-02]
 [-5.05132563e-04 -3.50721028e-04 -2.57593075e-04 -8.28819547e-05
  -6.64787356e-05 -5.65213562e-05 -5.13262018e-05 -4.48803620e-05
  -3.10747201e-05 -2.77074904e-05 -2.47250869e-05 -2.08287211e-05
  -1.98666554e-05 -1.60183929e-05 -8.08135136e-06  1.56383769e-03
   1.08817452e-03  6.23773554e-03  2.28608650e-02  1.55133084e-03
   1.90613788e-01]
 [-2.75013857e-01 -1.90857413e-01 -1.40340735e-01 -4.50819468e-02
  -3.61968613e-02 -3.05510084e-02 -2.78403563e-02 -2.42325847e-02
  -1.63649978e-02 -1.44670479e-02 -1.33932185e-02 -7.70555060e-03
  -1.06819497e-02 -8.73661992e-03 -4.34624093e-03  1.08817452e-03
   1.24490275e-01 -8.19615635e-02 -2.72593036e-02 -8.15942581e-03
  -7.10998198e-03]
 [-3.05425950e-02 -2.08384389e-02 -1.55024342e-02 -4.92336807e-03
  -3.51933986e-03 -3.12438267e-03 -2.86757717e-03 -1.85850614e-03
  -7.36438609e-04  7.17842918e-03 -6.54608324e-04  5.66495750e-03
   1.48568967e-03 -7.96935560e-04  7.64152702e-04  6.23773554e-03
  -8.19615635e-02  8.67379402e-02 -3.06147773e-03 -9.16380725e-04
  -7.98518252e-04]
 [-6.20619677e-03 -2.44620846e-03 -3.09801641e-03 -1.19713129e-03
  -7.09397962e-04  5.68364799e-04  1.68768572e-03  2.37181249e-03
   2.80662654e-03  9.57035166e-03 -1.65885582e-04  3.56310696e-02
   2.23646211e-03 -1.72777557e-04  2.96036086e-04  2.28608650e-02
  -2.72593036e-02 -3.06147773e-03  3.08911335e-02 -3.04775792e-04
  -2.65576333e-04]
 [-2.80391389e-03 -1.83159471e-03 -1.14085045e-03 -2.29013123e-04
  -6.13929925e-06  1.77427540e-04 -6.72964914e-05 -2.79895076e-05
  -5.22533690e-06  8.44521024e-04 -8.93513745e-05  2.26848947e-03
   9.38334685e-05 -3.62576129e-05  7.35659524e-05  1.55133084e-03
  -8.15942581e-03 -9.16380725e-04 -3.04775792e-04  9.46007630e-03
  -7.94939745e-05]
 [ 2.07180939e-02  2.81415857e-02  8.91960292e-03  3.67494077e-03
   8.21528316e-03  1.66831689e-02  1.21040364e-02  2.78626206e-02
   3.69037636e-02  1.43644423e-01  1.12011935e-02  2.73345297e-01
   4.70981440e-02  1.08192298e-03  1.57428139e-02  1.90613788e-01
  -7.10998198e-03 -7.98518252e-04 -2.65576333e-04 -7.94939745e-05
   8.25357054e-03]]
#列联表矩阵形式
Crosstab = pd.crosstab(adult_data['education'], adult_data['race'], margins = True)
#每一对值出现的观察频数
print('列联表为:\n',Crosstab)
Crosstab_1 = pd.crosstab(adult_data['education'], adult_data['race'],margins = False)
Ct = np.array(Crosstab_1)
Ct_1=Ct/len(adult_data)
列联表为:
 race            Amer-Indian-Eskimo   Asian-Pac-Islander   Black   Other  \
education                                                                 
 10th                           16                   13     133       9   
 11th                           14                   21     153      10   
 12th                            5                    9      70      14   
 1st-4th                         4                    5      16       9   
 5th-6th                         2                   18      21      13   
 7th-8th                         9                   11      56      17   
 9th                             5                    9      89       8   
 Assoc-acdm                      8                   29     107       8   
 Assoc-voc                      19                   38     112       6   
 Bachelors                      21                  289     330      33   
 Doctorate                       3                   28      11       2   
 HS-grad                       119                  226    1174      78   
 Masters                         5                   88      86       7   
 Preschool                       0                    6       5       2   
 Prof-school                     2                   41      15       4   
 Some-college                   79                  208     746      51   
All                            311                 1039    3124     271   

race            White    All  
education                     
 10th             762    933  
 11th             977   1175  
 12th             335    433  
 1st-4th          134    168  
 5th-6th          279    333  
 7th-8th          553    646  
 9th              403    514  
 Assoc-acdm       915   1067  
 Assoc-voc       1207   1382  
 Bachelors       4682   5355  
 Doctorate        369    413  
 HS-grad         8904  10501  
 Masters         1537   1723  
 Preschool         38     51  
 Prof-school      514    576  
 Some-college    6207   7291  
All             27816  32561  
#列计数
Ct_row = np.sum(Ct, axis = 0)
#行计数
Ct_column = np.sum(Ct, axis = 1)
Ct_column_T = Ct_column.reshape(-1, 1)
#每一对值的期望出现频率
e = (Ct_column_T * Ct_row) / len(adult_data)
print('每一对值的期望出现频率为:\n', e)
每一对值的期望出现频率为:
 [[8.91136636e+00 2.97714137e+01 8.95148183e+01 7.76520991e+00
  7.97037192e+02]
 [1.12227819e+01 3.74934738e+01 1.12733024e+02 9.77933724e+00
  1.00377138e+03]
 [4.13571451e+00 1.38167440e+01 4.15433187e+01 3.60378981e+00
  3.69900433e+02]
 [1.60461902e+00 5.36076902e+00 1.61184239e+01 1.39823715e+00
  1.43517951e+02]
 [3.18058413e+00 1.06258100e+01 3.19490188e+01 2.77150579e+00
  2.84473081e+02]
 [6.17014219e+00 2.06134332e+01 6.19791775e+01 5.37655477e+00
  5.51860692e+02]
 [4.90937011e+00 1.64014004e+01 4.93147016e+01 4.27793987e+00
  4.39096588e+02]
 [1.01912411e+01 3.40472651e+01 1.02371180e+02 8.88047050e+00
  9.11509843e+02]
 [1.31999017e+01 4.40987070e+01 1.32593225e+02 1.15021652e+01
  1.18060600e+03]
 [5.11472314e+01 1.70874512e+02 5.13774761e+02 4.45688093e+01
  4.57463469e+03]
 [3.94468843e+00 1.31785572e+01 3.96244587e+01 3.43733301e+00
  3.52814963e+02]
 [1.00298240e+02 3.35079973e+02 1.00749744e+03 8.73981450e+01
  8.97072621e+03]
 [1.64568963e+01 5.49797918e+01 1.65309788e+02 1.43402537e+01
  1.47191327e+03]
 [4.87116489e-01 1.62737631e+00 4.89309296e+00 4.24464851e-01
  4.35679494e+01]
 [5.50155094e+00 1.83797795e+01 5.52631676e+01 4.79395596e+00
  4.92061546e+02]
 [6.96385553e+01 2.32650994e+02 6.99520408e+02 6.06818280e+01
  6.22850822e+03]]

卡方统计量计算公式
χ 2 = ∑ i = 1 m 1 ∑ i = 1 m 2 ( n i j − e i j ) 2 e i j \chi^{2}=\sum\limits_{i=1}^{m_{1}}\sum\limits_{i=1}^{m_{2}}\frac{(n_{ij}-e_{ij})^{2}}{e_{ij}} χ2=i=1m1i=1m2eij(nijeij)2
其 中 观 察 频 数 n i j 和 期 望 频 数 e i j 其中观察频数n_{ij}和期望频数e_{ij} nijeij

# 计算卡方统计量量化每一对值出现的观察频数与期望频数之间的差异 X_2
X_2 = 0 # 初始化
for i in range(0, 16):
    for j in range(0, 5):
        k = ((Ct[i, j] - e[i, j])**2) / e[i, j]
        X_2 = X_2 + k
print('卡方统计量:', X_2)
卡方统计量: 730.6712962254584

自由度计算公式
p = ( m 1 − 1 ) ( m 2 − 1 ) p = \left(m_{1}-1)(m_{2}-1\right) p=(m11)(m21)

# 自由度
q = (16 - 1) * (5 - 1)
print('自由度:', q)
自由度: 60

p-value计算公式
p ( z ) = P ( θ ⩾ z ) = 1 − F ( θ ) p\left(z\right)=P\left(\theta\geqslant z\right)=1-F\left(\theta\right) p(z)=P(θz)=1F(θ)

# 计算p值
F = stats.chi2.cdf(X_2, q)
p_value = 1-F
print('p值为:', p_value)
p值为: 0.0
# 计算出p值为0.0,取a=0.1>p值
a = 0.1 
z = stats.chi2.pdf(1 - a, q)
print('在a = 0.1的显著性水平下统计量的临界值:', z)
在a = 0.1的显著性水平下统计量的临界值: 3.163454313384917e-42
# 置信度
confidence_level = 1 - a
print('置信度为:', confidence_level)
置信度为: 0.9
# 假设检验
if p_value < 0.1:
    print('二者相关')
else:
    print('二者独立')
二者相关
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值