chi-square test(principle used in C4.5’s CVP Pruning),
also called chi-square statistics,
also called chi-square goodness-of fit
here is the contingency table:
The target is to prove:
∑
i
=
1
i
=
r
∑
j
=
1
j
=
s
[
X
i
j
−
N
i
⋅
(
N
⋅
j
n
)
]
2
N
i
⋅
(
N
j
n
)
∼
χ
2
[
(
r
−
1
)
(
s
−
1
)
]
①
\sum_{i=1}^{i=r} \sum_{j=1}^{j=s}\frac{[X_{ij}-N_{i·}(\frac{N_{·j}}{n})]^2}{N_{i·}(\frac{N_j}{n})}\sim \chi^2{[(r-1)(s-1)]}①
i=1∑i=rj=1∑j=sNi⋅(nNj)[Xij−Ni⋅(nN⋅j)]2∼χ2[(r−1)(s−1)]①
Note:
the left side of above is “discrete”
the right side of above is “continuous”
----------------------------------------------
Let’s review the concepts of “Multi-dimensional Normal Distribution”,
according to[1]
X
∼
N
(
μ
,
∑
)
X\sim N(\mu,\sum)
X∼N(μ,∑)
μ
=
[
E
[
X
1
]
,
E
[
X
2
]
,
⋅
⋅
⋅
,
E
[
X
s
]
]
T
\mu=[E[X_1],E[X_2],···,E[X_s]]^T
μ=[E[X1],E[X2],⋅⋅⋅,E[Xs]]T
∑
=
:
[
C
o
v
[
X
i
,
X
j
]
;
1
≤
i
,
j
≤
s
]
\sum=: [Cov[X_i,X_j];1≤i,j≤s]
∑=:[Cov[Xi,Xj];1≤i,j≤s]
-----------------------------------------------------------------------------------------------
∑ j = 1 j = s [ X i j − N i ⋅ ( N ⋅ j n ) ] 2 N i ⋅ ( N ⋅ j n ) \sum_{j=1}^{j=s}\frac{[X_{ij}-N_{i·}(\frac{N_{·j}}{n})]^2}{N_{i·}(\frac{N_{·j}}{n})} ∑j=1j=sNi⋅(nN⋅j)[Xij−Ni⋅(nN⋅j)]2
= N i ⋅ ∑ j = 1 j = s [ X i j N i ⋅ − ( N ⋅ j n ) ] 2 ( N ⋅ j n ) N_{i·}\sum_{j=1}^{j=s}\frac{[\frac{X_{ij}}{N_{i·}}-(\frac{N_{·j}}{n})]^2}{(\frac{N_{·j}}{n})} Ni⋅∑j=1j=s(nN⋅j)[Ni⋅Xij−(nN⋅j)]2
= N i ⋅ { [ ∑ j = 1 j = s − 1 [ X i j N i ⋅ − ( N ⋅ j n ) ] 2 N ⋅ j n ] + [ X i s N i ⋅ − ( N ⋅ s n ) ] 2 N ⋅ s n } N_{i·}\{[\sum_{j=1}^{j=s-1}\frac{[\frac{X_{ij}}{N_{i·}}-(\frac{N_{·j}}{n})]^2}{\frac{N_{·j}}{n}}]+ \frac{[\frac{X_{is}}{N_{i·}}-(\frac{N_{·s}}{n})]^2} {\frac{N_{·s}}{n}}\} Ni⋅{[∑j=1j=s−1nN⋅j[Ni⋅Xij−(nN⋅j)]2]+nN⋅s[Ni⋅Xis−(nN⋅s)]2}
= N i ⋅ { [ ∑ j = 1 j = s − 1 [ X i j N i ⋅ − ( N ⋅ j n ) ] 2 N ⋅ j N i ⋅ ] + [ ∑ j = 1 j = s − 1 ( X i j N i ⋅ − N ⋅ j n ) ] 2 N s N i ⋅ } N_{i·} \{[\sum_{j=1}^{j=s-1}\frac{[ \frac{X_{ij}}{N_{i·}}-(\frac{N_{·j}}{n})]^2 }{ \frac{N_{·j}}{N_{i·}} }]+ \frac{[\sum_{j=1}^{j=s-1}(\frac{X_{ij}}{N_{i·}}-\frac{N_{·j}}{n})]^2}{{\frac{Ns}{N_{i·}}}} \} Ni⋅{[∑j=1j=s−1Ni⋅N⋅j[Ni⋅Xij−(nN⋅j)]2]+Ni⋅Ns[∑j=1j=s−1(Ni⋅Xij−nN⋅j)]2}
Let’s set
p
∗
=
(
N
⋅
1
n
,
.
.
.
,
N
⋅
(
s
−
1
)
n
)
T
p^*=(\frac{N_{·1}}{n},...,\frac{N_{·(s-1)}}{n})^T
p∗=(nN⋅1,...,nN⋅(s−1))T
X ‾ ∗ = ( X i 1 N i ⋅ , ⋅ ⋅ ⋅ , X i ( s − 1 ) N i ⋅ ) T \overline{X}^*=(\frac{X_{i1}}{N_{i·}},···,\frac{X_{i(s-1)}}{N_{i·}})^T X∗=(Ni⋅Xi1,⋅⋅⋅,Ni⋅Xi(s−1))T
So,
N
i
⋅
∑
j
=
1
j
=
s
[
X
i
j
N
i
⋅
−
(
N
⋅
j
n
)
]
2
(
N
⋅
j
n
)
N_{i·}\sum_{j=1}^{j=s}\frac{[\frac{X_{ij}}{N_{i·}}-(\frac{N_{·j}}{n})]^2}{(\frac{N_{·j}}{n})}
Ni⋅j=1∑j=s(nN⋅j)[Ni⋅Xij−(nN⋅j)]2
=
N
i
⋅
(
X
‾
∗
−
p
∗
)
T
(
∑
∗
)
−
1
(
X
‾
∗
−
p
∗
)
=N_{i·}(\overline{X}^*-p^*)^T(\sum^*)^{-1}(\overline{X}^*-p^*)
=Ni⋅(X∗−p∗)T(∑∗)−1(X∗−p∗)
where
∑
∗
=
\sum^*=
∑∗=
[ p 1 0 ⋅ ⋅ ⋅ 0 0 p 2 ⋅ ⋅ ⋅ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋅ ⋅ ⋅ p s − 1 ] − [ p 1 p 2 ⋮ p s − 1 ] [ p 1 p 2 ⋮ p s − 1 ] T \left[ \begin{matrix} p_1 & 0 & ···&0 \\ 0 & p_2 & ···&0 \\ \vdots & \vdots & \ddots&\vdots\\ 0&0&···&p_{s-1} \end{matrix} \right]- \left[ \begin{matrix} p_1 \\ p_2 \\ \vdots \\ p_{s-1} \end{matrix} \right] \left[ \begin{matrix} p_1 \\ p_2 \\ \vdots \\ p_{s-1} \end{matrix} \right]^T ⎣⎢⎢⎢⎡p10⋮00p2⋮0⋅⋅⋅⋅⋅⋅⋱⋅⋅⋅00⋮ps−1⎦⎥⎥⎥⎤−⎣⎢⎢⎢⎡p1p2⋮ps−1⎦⎥⎥⎥⎤⎣⎢⎢⎢⎡p1p2⋮ps−1⎦⎥⎥⎥⎤T
According to Sherman-Morison Formula:
(
∑
∗
)
−
1
=
(\sum^*)^{-1}=
(∑∗)−1=
[ 1 p 1 0 ⋅ ⋅ ⋅ 0 0 1 p 2 ⋅ ⋅ ⋅ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋅ ⋅ ⋅ 1 p s − 1 ] − 1 p s [ 1 1 ⋅ ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ 1 ⋮ ⋮ ⋱ ⋮ 1 1 ⋅ ⋅ ⋅ 1 ] \left[ \begin{matrix} \frac{1}{p_1} & 0 & ···&0 \\ 0 & \frac{1}{p_2} & ···&0 \\ \vdots & \vdots & \ddots&\vdots\\ 0&0&···&\frac{1}{p_{s-1}} \end{matrix} \right] -\frac{1}{p_s} \left[ \begin{matrix} 1 & 1 & ···&1 \\ 1 & 1 & ···&1 \\ \vdots & \vdots & \ddots&\vdots\\ 1&1&···&1 \end{matrix} \right] ⎣⎢⎢⎢⎡p110⋮00p21⋮0⋅⋅⋅⋅⋅⋅⋱⋅⋅⋅00⋮ps−11⎦⎥⎥⎥⎤−ps1⎣⎢⎢⎢⎡11⋮111⋮1⋅⋅⋅⋅⋅⋅⋱⋅⋅⋅11⋮1⎦⎥⎥⎥⎤
Let’s set
Y
i
=
N
i
⋅
X
‾
∗
−
p
∗
∑
∗
②
Y_i=\sqrt{N_{i·}}\frac{\overline{X}^*-p^*}{\sqrt{\sum^*}}②
Yi=Ni⋅∑∗X∗−p∗②
according [3]:
------------------------the following are from wikipedia-------------------------------
[
X
1
(
1
)
⋮
X
1
(
k
)
]
+
[
X
2
(
1
)
⋮
X
2
(
k
)
]
+
⋯
+
[
X
n
(
1
)
⋮
X
n
(
k
)
]
=
[
∑
i
=
1
n
[
X
i
(
1
)
]
⋮
∑
i
=
1
n
[
X
i
(
k
)
]
]
=
∑
i
=
1
n
X
i
{\displaystyle {\begin{bmatrix}X_{1(1)}\\\vdots \\X_{1(k)}\end{bmatrix}}+{\begin{bmatrix}X_{2(1)}\\\vdots \\X_{2(k)}\end{bmatrix}}+\cdots +{\begin{bmatrix}X_{n(1)}\\\vdots \\X_{n(k)}\end{bmatrix}} ={\begin{bmatrix}\sum _{i=1}^{n}\left[X_{i(1)}\right]\\\vdots \\\sum _{i=1}^{n}\left[X_{i(k)}\right]\end{bmatrix}}=\sum _{i=1}^{n}\mathbf {X} _{i}}
⎣⎢⎡X1(1)⋮X1(k)⎦⎥⎤+⎣⎢⎡X2(1)⋮X2(k)⎦⎥⎤+⋯+⎣⎢⎡Xn(1)⋮Xn(k)⎦⎥⎤=⎣⎢⎡∑i=1n[Xi(1)]⋮∑i=1n[Xi(k)]⎦⎥⎤=i=1∑nXi
and the average is
1
n
∑
i
=
1
n
X
i
=
1
n
[
∑
i
=
1
n
X
i
(
1
)
⋮
∑
i
=
1
n
X
i
(
k
)
]
=
[
X
ˉ
i
(
1
)
⋮
X
ˉ
i
(
k
)
]
=
X
ˉ
n
1
n
∑
i
=
1
n
X
i
=
1
n
[
∑
i
=
1
n
X
i
(
1
)
⋮
∑
i
=
1
n
X
i
(
k
)
]
=
[
X
ˉ
i
(
1
)
⋮
X
ˉ
i
(
k
)
]
=
X
ˉ
n
{\displaystyle {\frac {1}{n}}\sum _{i=1}^{n}\mathbf {X} _{i}={\frac {1}{n}}{\begin{bmatrix}\sum _{i=1}^{n}X_{i(1)}\\\vdots \\\sum _{i=1}^{n}X_{i(k)}\end{bmatrix}}={\begin{bmatrix}{\bar {X}}_{i(1)}\\\vdots \\{\bar {X}}_{i(k)}\end{bmatrix}}=\mathbf {{\bar {X}}_{n}} } {\displaystyle {\frac {1}{n}}\sum _{i=1}^{n}\mathbf {X} _{i}={\frac {1}{n}}{\begin{bmatrix}\sum _{i=1}^{n}X_{i(1)}\\\vdots \\\sum _{i=1}^{n}X_{i(k)}\end{bmatrix}}={\begin{bmatrix}{\bar {X}}_{i(1)}\\\vdots \\{\bar {X}}_{i(k)}\end{bmatrix}}=\mathbf {{\bar {X}}_{n}} }
n1i=1∑nXi=n1⎣⎢⎡∑i=1nXi(1)⋮∑i=1nXi(k)⎦⎥⎤=⎣⎢⎡Xˉi(1)⋮Xˉi(k)⎦⎥⎤=Xˉnn1i=1∑nXi=n1⎣⎢⎡∑i=1nXi(1)⋮∑i=1nXi(k)⎦⎥⎤=⎣⎢⎡Xˉi(1)⋮Xˉi(k)⎦⎥⎤=Xˉn
and therefore
1
n
∑
i
=
1
n
[
X
i
−
E
(
X
i
)
]
=
1
n
∑
i
=
1
n
(
X
i
−
μ
)
=
n
(
X
‾
n
−
μ
)
.
1
n
∑
i
=
1
n
[
X
i
−
E
(
X
i
)
]
=
1
n
∑
i
=
1
n
(
X
i
−
μ
)
=
n
(
X
‾
n
−
μ
)
.
{\displaystyle {\frac {1}{\sqrt {n}}}\sum _{i=1}^{n}\left[\mathbf {X} _{i}-\operatorname {E} \left(X_{i}\right)\right]={\frac {1}{\sqrt {n}}}\sum _{i=1}^{n}(\mathbf {X} _{i}-{\boldsymbol {\mu }})={\sqrt {n}}\left({\overline {\mathbf {X} }}_{n}-{\boldsymbol {\mu }}\right).} {\displaystyle {\frac {1}{\sqrt {n}}}\sum _{i=1}^{n}\left[\mathbf {X} _{i}-\operatorname {E} \left(X_{i}\right)\right]={\frac {1}{\sqrt {n}}}\sum _{i=1}^{n}(\mathbf {X} _{i}-{\boldsymbol {\mu }})={\sqrt {n}}\left({\overline {\mathbf {X} }}_{n}-{\boldsymbol {\mu }}\right).}
n1i=1∑n[Xi−E(Xi)]=n1i=1∑n(Xi−μ)=n(Xn−μ).n1i=1∑n[Xi−E(Xi)]=n1i=1∑n(Xi−μ)=n(Xn−μ).
The multivariate central limit theorem states that
n
(
X
‾
n
−
μ
)
→
D
N
k
(
0
,
Σ
)
{\displaystyle {\sqrt {n}}\left({\overline {\mathbf {X} }}_{n}-{\boldsymbol {\mu }}\right)\ {\stackrel {D}{\rightarrow }}\ N_{k}(0,{\boldsymbol {\Sigma }})}
n(Xn−μ) →D Nk(0,Σ)
------------------------the above are from wikipedia-------------------------------
So,for②,we can get
Y
i
∼
N
s
−
1
(
0
,
I
s
−
1
)
③
Y_i\sim N_{s-1}(\bold0,I_{s-1})③
Yi∼Ns−1(0,Is−1)③
where
0
=
[
0
,
0
,
…
,
0
]
T
\bold0=[0,0,\dots,0]^T
0=[0,0,…,0]T
I
s
−
1
=
E
(
s
−
1
)
⋅
(
s
−
1
)
I_{s-1}=E_{(s-1)·(s-1)}
Is−1=E(s−1)⋅(s−1)
then for ①
∑ i = 1 i = r ∑ j = 1 j = s [ X i j − N i ⋅ ( N ⋅ j n ) ] 2 N i ⋅ ( N ⋅ j n ) = ∑ i = 1 i = r Y i 2 \sum_{i=1}^{i=r} \sum_{j=1}^{j=s}\frac{[X_{ij}-N_{i·}(\frac{N_{·j}}{n})]^2}{N_{i·}(\frac{N_{·j}}{n})}\\ =\sum_{i=1}^{i=r}Y_i^2 i=1∑i=rj=1∑j=sNi⋅(nN⋅j)[Xij−Ni⋅(nN⋅j)]2=i=1∑i=rYi2
Because of ③,
∑
i
=
1
i
=
r
Y
i
2
∼
χ
2
[
(
s
−
1
)
(
r
−
1
)
]
\sum_{i=1}^{i=r}Y_i^2\sim\chi^2[(s-1)(r-1)]
i=1∑i=rYi2∼χ2[(s−1)(r−1)]
The Chi-Square statistics was invented by Pearson[8].
Reference:
[1]https://en.wikipedia.org/wiki/Multivariate_normal_distribution
[2]《Seven different proofs for the Pearson independence test》
[3]https://en.wikipedia.org/wiki/Central_limit_theorem
[4]https://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2003/lecture-notes/lec23.pdf
[5]https://arxiv.org/pdf/1808.09171.pdf
[6]https://www.math.utah.edu/~davar/ps-pdf-files/Chisquared.pdf
[7]http://personal.psu.edu/drh20/asymp/fall2006/lectures/ANGELchpt07.pdf
[8]https://download.csdn.net/download/appleyuchi/10834144