数据挖掘与分析课程笔记
- 参考教材:Data Mining and Analysis : MOHAMMED J.ZAKI, WAGNER MEIRA JR.
文章目录
- 数据挖掘与分析课程笔记(目录)
- 数据挖掘与分析课程笔记(Chapter 1)
- 数据挖掘与分析课程笔记(Chapter 2)
- 数据挖掘与分析课程笔记(Chapter 5)
- 数据挖掘与分析课程笔记(Chapter 7)
- 数据挖掘与分析课程笔记(Chapter 14)
- 数据挖掘与分析课程笔记(Chapter 15)
- 数据挖掘与分析课程笔记(Chapter 20)
- 数据挖掘与分析课程笔记(Chapter 21)
笔记目录
Chapter 7:降维
PCA:主元分析
7.1 背景
D = ( X 1 X 2 ⋯ X d x 1 x 11 x 12 ⋯ x 1 d x 2 x 21 x 22 ⋯ x 2 d ⋮ ⋮ ⋮ ⋱ ⋮ x n x n 1 x n 2 ⋯ x n d ) \mathbf{D}=\left(\begin{array}{c|cccc} & X_{1} & X_{2} & \cdots & X_{d} \\ \hline \mathbf{x}_{1} & x_{11} & x_{12} & \cdots & x_{1 d} \\ \mathbf{x}_{2} & x_{21} & x_{22} & \cdots & x_{2 d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \mathbf{x}_{n} & x_{n 1} & x_{n 2} & \cdots & x_{n d} \end{array}\right) D=⎝ ⎛x1x2⋮xnX1x11x21⋮xn1X2x12x22⋮xn2⋯⋯⋯⋱⋯Xdx1dx2d⋮xnd⎠ ⎞
对象: x 1 T , ⋯ , x n T ∈ R d \mathbf{x}_{1}^T,\cdots,\mathbf{x}_n^T \in \mathbb{R}^d x1T,⋯,xnT∈Rd, ∀ x ∈ R d , \forall \mathbf{x} \in \mathbb{R}^d, ∀x∈Rd, 设 x = ( x 1 , ⋯ , x d ) T = ∑ i = 1 d x i e i \mathbf{x}=(x_1,\cdots,x_d)^T= \sum\limits_{i=1}^{d}x_i \mathbf{e}_i x=(x1,⋯,xd)T=i=1∑dxiei
其中, e i = ( 0 , ⋯ , 1 , ⋯ , 0 ) T ∈ R d \mathbf{e}_i=(0,\cdots,1,\cdots,0)^T\in\mathbb{R}^d ei=(0,⋯,1,⋯,0)T∈Rd,i-坐标
设另有单位正交基 { u } i = 1 n \{\mathbf{u}\}_{i=1}^n {u}i=1n, x = ∑ i = 1 d a i u i , a i ∈ R \mathbf{x}=\sum\limits_{i=1}^{d}a_i \mathbf{u}_i,a_i \in \mathbb{R} x=i=1∑daiui,ai∈R, u i T u j = { 1 , i = j 0 , i ≠ j \mathbf{u}_i^T \mathbf{u}_j =\left\{\begin{matrix} 1,i=j\\ 0,i\ne j \end{matrix}\right. uiTuj={1,i=j0,i=j
∀ r : 1 ≤ r ≤ d , x = a 1 u 1 + ⋯ + a r u r ⏟ 投影 + a r + 1 u r + 1 + ⋯ + a d u d ⏟ 误差 \forall r:1\le r\le d, \mathbf{x}=\underbrace{a_1 \mathbf{u}_1+\cdots+a_r \mathbf{u}_r}_{\text{投影}}+ \underbrace{a_{r+1} \mathbf{u}_{r+1}+\cdots+a_d \mathbf{u}_d}_{\text{误差}} ∀r:1≤r≤d,x=投影 a1u1+⋯+arur+误差 ar+1ur+1+⋯+adud
前 r r r 项是投影,后面是投影误差。
目标:对于给定 D D D,寻找最优 { u } i = 1 n \{\mathbf{u}\}_{i=1}^n {u}i=1n,使得 D D D 在其前 r r r 维子空间的投影是对 D D D 的“最佳近似”,即投影之后“误差最小”。
7.2 主元分析:
7.2.1 最佳直线近似
(一阶主元分析)(r=1)
目标:寻找 u 1 \mathbf{u}_1 u1,不妨记为 u = ( u 1 , ⋯ , u d ) T \mathbf{u}=(u_1,\cdots,u_d)^T u=(u1,⋯,ud)T。
假设: ∣ ∣ u ∣ ∣ = u T u = 1 ||\mathbf{u}||=\mathbf{u}^T\mathbf{u}=1 ∣∣u∣∣=uTu=1, μ ^ = 1 n ∑ i = 1 n x i = 0 , ∈ R d \hat{\boldsymbol{\mu}}=\frac{1}{n} \sum\limits_{i=1}^n\mathbf{x}_i=\mathbf{0},\in \mathbb{R}^{d} μ^=n1i=1∑nxi=0,∈Rd
∀
x
i
(
i
=
1
,
⋯
,
n
)
\forall \mathbf{x}_i(i=1,\cdots,n)
∀xi(i=1,⋯,n),
x
i
\mathbf{x}_i
xi 沿
u
\mathbf{u}
u 方向投影是:
x
i
′
=
(
u
T
x
i
u
T
u
)
u
=
(
u
T
x
i
)
u
=
a
i
u
,
a
i
=
u
T
x
i
\mathbf{x}_{i}^{\prime}=\left(\frac{\mathbf{u}^{T} \mathbf{x}_{i}}{\mathbf{u}^{T} \mathbf{u}}\right) \mathbf{u}=\left(\mathbf{u}^{T} \mathbf{x}_{i}\right) \mathbf{u}=a_{i} \mathbf{u},a_{i}=\mathbf{u}^{T} \mathbf{x}_{i}
xi′=(uTuuTxi)u=(uTxi)u=aiu,ai=uTxi
μ
^
=
0
⇒
\hat{\boldsymbol{\mu}}=\mathbf{0}\Rightarrow
μ^=0⇒
μ
^
\hat{\boldsymbol{\mu}}
μ^ 在
u
\mathbf{u}
u 上投影是0;
x
1
′
,
⋯
,
x
n
′
\mathbf{x}_{1}^{\prime},\cdots,\mathbf{x}_{n}^{\prime}
x1′,⋯,xn′ 的平均值为0 。
P r o j ( m e a n ( D ) ) = m e a n P r o j ( D ) Proj(mean(D))=mean{Proj(D)} Proj(mean(D))=meanProj(D)
考察
x
1
′
,
⋯
,
x
n
′
\mathbf{x}_{1}^{\prime},\cdots,\mathbf{x}_{n}^{\prime}
x1′,⋯,xn′ 沿
u
\mathbf{u}
u 方向的样本方差:
σ
u
2
=
1
n
∑
i
=
1
n
(
a
i
−
μ
u
)
2
=
1
n
∑
i
=
1
n
(
u
T
x
i
)
2
=
1
n
∑
i
=
1
n
u
T
(
x
i
x
i
T
)
u
=
u
T
(
1
n
∑
i
=
1
n
x
i
x
i
T
)
u
=
u
T
Σ
u
\begin{aligned} \sigma_{\mathbf{u}}^{2} &=\frac{1}{n} \sum_{i=1}^{n}\left(a_{i}-\mu_{\mathbf{u}}\right)^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{u}^{T} \mathbf{x}_{i}\right)^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{u}^{T}\left(\mathbf{x}_{i} \mathbf{x}_{i}^{T}\right) \mathbf{u} \\ &=\mathbf{u}^{T}\left(\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i} \mathbf{x}_{i}^{T}\right) \mathbf{u} \\ &=\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u} \end{aligned}
σu2=n1i=1∑n(ai−μu)2=n1i=1∑n(uTxi)2=n1i=1∑nuT(xixiT)u=uT(n1i=1∑nxixiT)u=uTΣu
Σ
\mathbf{\Sigma}
Σ 是样本协方差矩阵。
目标:
max
u
u
T
Σ
u
s.t
u
T
u
−
1
=
0
\begin{array}{ll} \max\limits_{\mathbf{u}} & \mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u} \\ \text{s.t} & \mathbf{u}^T\mathbf{u}-1=0 \end{array}
umaxs.tuTΣuuTu−1=0
应用 Lagrangian 乘数法:
max
u
J
(
u
)
=
u
T
Σ
u
−
λ
(
u
T
u
−
1
)
\max \limits_{\mathbf{u}} J(\mathbf{u})=\mathbf{u}^{T} \Sigma \mathbf{u}-\lambda\left(\mathbf{u}^{T} \mathbf{u}-1\right)
umaxJ(u)=uTΣu−λ(uTu−1)
求偏导:
∂
∂
u
J
(
u
)
=
0
∂
∂
u
(
u
T
Σ
u
−
λ
(
u
T
u
−
1
)
)
=
0
2
Σ
u
−
2
λ
u
=
0
Σ
u
=
λ
u
\begin{aligned} \frac{\partial}{\partial \mathbf{u}} J(\mathbf{u}) &=\mathbf{0} \\ \frac{\partial}{\partial \mathbf{u}}\left(\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}-\lambda\left(\mathbf{u}^{T} \mathbf{u}-1\right)\right) &=\mathbf{0} \\ 2 \mathbf{\Sigma} \mathbf{u}-2 \lambda \mathbf{u} &=\mathbf{0} \\ \mathbf{\Sigma} \mathbf{u} &=\lambda \mathbf{u} \end{aligned}
∂u∂J(u)∂u∂(uTΣu−λ(uTu−1))2Σu−2λuΣu=0=0=0=λu
注意到:
u
T
Σ
u
=
u
T
λ
u
=
λ
\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}=\mathbf{u}^{T} \lambda \mathbf{u}=\lambda
uTΣu=uTλu=λ
故优化问题的解 λ \lambda λ 选取 Σ \mathbf{\Sigma} Σ 最大特征值, u \mathbf{u} u 选与 λ \lambda λ 相应的单位特征向量。
问题:上述问题使得 σ u 2 \sigma_{\mathbf{u}}^{2} σu2 最大的 u \mathbf{u} u 能否使投影误差最小?
定义平均平方误差(Minimum Squared Error,MSE):
M
S
E
(
u
)
=
1
n
∑
i
=
1
n
∥
x
i
−
x
i
′
∥
2
=
1
n
∑
i
=
1
n
(
x
i
−
x
i
′
)
T
(
x
i
−
x
i
′
)
=
1
n
∑
i
=
1
n
(
∥
x
i
∥
2
−
2
x
i
T
x
i
′
+
(
x
i
′
)
T
x
i
′
)
=
1
n
∑
i
=
1
n
(
∥
x
i
∥
2
−
2
x
i
T
(
u
T
x
i
)
u
+
[
(
u
T
x
i
)
u
]
T
[
(
u
T
x
i
)
u
]
)
=
1
n
∑
i
=
1
n
(
∥
x
i
∥
2
−
2
(
u
T
x
i
)
x
i
T
u
+
(
u
T
x
i
)
(
x
i
T
u
)
u
T
u
)
=
1
n
∑
i
=
1
n
(
∥
x
i
∥
2
−
u
T
x
i
x
i
T
u
)
=
1
n
∑
i
=
1
n
∥
x
i
∥
2
−
u
T
Σ
u
=
v
a
r
(
D
)
−
σ
u
2
\begin{aligned} M S E(\mathbf{u}) &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right\|^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right)^{T}\left(\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 \mathbf{x}_{i}^{T} \mathbf{x}_{i}^{\prime}+\left(\mathbf{x}_{i}^{\prime}\right)^{T} \mathbf{x}_{i}^{\prime}\right)\\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 \mathbf{x}_{i}^{T} (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}+\left[(\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}\right]^{T} \left[ (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}\right] \right)\\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{x}_{i}^{T} \mathbf{u}+(\mathbf{u}^{T} \mathbf{x}_{i})(\mathbf{x}_{i}^{T} \mathbf{u})\mathbf{u}^{T}\mathbf{u} \right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-\mathbf{u}^{T} \mathbf{x}_{i}\mathbf{x}_{i}^{T} \mathbf{u} \right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}\right\|^{2}-\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}\\ &= var(D)-\sigma_{\mathbf{u}}^{2} \end{aligned}
MSE(u)=n1i=1∑n∥xi−xi′∥2=n1i=1∑n(xi−xi′)T(xi−xi′)=n1i=1∑n(∥xi∥2−2xiTxi′+(xi′)Txi′)=n1i=1∑n(∥xi∥2−2xiT(uTxi)u+[(uTxi)u]T[(uTxi)u])=n1i=1∑n(∥xi∥2−2(uTxi)xiTu+(uTxi)(xiTu)uTu)=n1i=1∑n(∥xi∥2−uTxixiTu)=n1i=1∑n∥xi∥2−uTΣu=var(D)−σu2
上式表明:
v
a
r
(
D
)
=
σ
u
2
+
M
S
E
var(D)=\sigma_{\mathbf{u}}^{2}+MSE
var(D)=σu2+MSE
u \mathbf{u} u 的几何意义: R d \mathbb{R}^d Rd 中使得数据沿其方向投影后方差最大的同时,MSE 最小的直线方向。
u \mathbf{u} u 被称为一阶主元(first principal component)
7.2.2 最佳2-维近似
(二阶主元分析:r=2)
假设 u 1 \mathbf{u}_1 u1 已经找到,即 Σ \mathbf{\Sigma} Σ 的最大特征值对应的特征向量。
目标:寻找 u 2 \mathbf{u}_2 u2 ,简记为 v \mathbf{v} v,使得: v T u 1 = 0 , v T v = 1 \mathbf{v}^{T} \mathbf{u}_{1}=0,\mathbf{v}^{T} \mathbf{v} =1 vTu1=0,vTv=1
考虑
x
i
\mathbf{x}_{i}
xi 沿
v
\mathbf{v}
v 方向投影的方差:
max
u
σ
v
2
=
v
T
Σ
v
s.t
v
T
v
−
1
=
0
v
T
u
1
=
0
\begin{array}{ll} \max\limits_{\mathbf{u}} & \sigma_{\mathbf{v}}^{2} = \mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v} \\ \text{s.t} & \mathbf{v}^T\mathbf{v}-1=0\\ & \mathbf{v}^{T} \mathbf{u}_{1}=0 \end{array}
umaxs.tσv2=vTΣvvTv−1=0vTu1=0
定义:
J
(
v
)
=
v
T
Σ
v
−
α
(
v
T
v
−
1
)
−
β
(
v
T
u
1
−
0
)
J(\mathbf{v})=\mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v}-\alpha\left(\mathbf{v}^{T} \mathbf{v}-1\right)-\beta\left(\mathbf{v}^{T} \mathbf{u}_{1}-0\right)
J(v)=vTΣv−α(vTv−1)−β(vTu1−0)
对
v
\mathbf{v}
v 求偏导得:
2
Σ
v
−
2
α
v
−
β
u
1
=
0
2 \Sigma \mathbf{v}-2 \alpha \mathbf{v}-\beta \mathbf{u}_{1}=\mathbf{0}
2Σv−2αv−βu1=0
两边同乘
u
1
T
\mathbf{u}_{1}^{T}
u1T:
2
u
1
T
Σ
v
−
2
α
u
1
T
v
−
β
u
1
T
u
1
=
0
2
u
1
T
Σ
v
−
β
=
0
2
v
T
Σ
u
1
−
β
=
0
2
v
T
λ
1
u
1
−
β
=
0
β
=
0
\begin{aligned} 2 \mathbf{u}_{1}^{T}\Sigma \mathbf{v}-2 \alpha \mathbf{u}_{1}^{T}\mathbf{v}-\beta \mathbf{u}_{1}^{T}\mathbf{u}_{1} &=0 \\ 2 \mathbf{u}_{1}^{T}\Sigma \mathbf{v}-\beta &= 0\\ 2 \mathbf{v}^{T}\Sigma \mathbf{u}_{1}-\beta &= 0\\ 2 \mathbf{v}^{T}\lambda_1 \mathbf{u}_{1}-\beta &= 0\\ \beta &= 0 \end{aligned}
2u1TΣv−2αu1Tv−βu1Tu12u1TΣv−β2vTΣu1−β2vTλ1u1−ββ=0=0=0=0=0
再代入到原式:
2
Σ
v
−
2
α
v
=
0
Σ
v
=
α
v
2 \Sigma \mathbf{v}-2 \alpha \mathbf{v}=\mathbf{0}\\ \Sigma \mathbf{v}=\alpha \mathbf{v}
2Σv−2αv=0Σv=αv
故
v
\mathbf{v}
v 也是
Σ
\mathbf{\Sigma}
Σ 的特征向量。
σ v 2 = v T Σ v = α \sigma_{\mathbf{v}}^{2} = \mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v} =\alpha σv2=vTΣv=α,故 α \alpha α 应取 Σ \mathbf{\Sigma} Σ (第二大)的特征向量。
问题1:上述求得的 v \mathbf{v} v (即 u 2 \mathbf{u}_2 u2 ),与 u 1 \mathbf{u}_1 u1 一起考虑,能否使 D D D 在 s p a n { u 1 , u 2 } span\{\mathbf{u}_1, \mathbf{u}_2 \} span{u1,u2} 上投影总方差最大?
设 x i = a i 1 u 1 + a i 2 u 2 ⏟ 投影 + ⋯ \mathbf{x}_i=\underbrace{a_{i1} \mathbf{u}_1+a_{i2}\mathbf{u}_2}_{投影}+\cdots xi=投影 ai1u1+ai2u2+⋯
则 x i \mathbf{x}_i xi 在 s p a n { u 1 , u 2 } span\{\mathbf{u}_1, \mathbf{u}_2 \} span{u1,u2} 上投影坐标: a i = ( a i 1 , a i 2 ) T = ( u 1 T x i , u 2 T x i ) T \mathbf{a}_{i}=(a_{i1},a_{i2})^T=(\mathbf{u}_1^{T}\mathbf{x}_i,\mathbf{u}_2^{T}\mathbf{x}_i)^{T} ai=(ai1,ai2)T=(u1Txi,u2Txi)T
令 U 2 = ( ∣ ∣ u 1 u 2 ∣ ∣ ) \mathbf{U}_{2}=\left(\begin{array}{cc} \mid & \mid \\ \mathbf{u}_{1} & \mathbf{u}_{2} \\ \mid & \mid \end{array}\right) U2=⎝ ⎛∣u1∣∣u2∣⎠ ⎞,则 a i = U 2 T x i \mathbf{a}_{i}=\mathbf{U}_{2}^{T} \mathbf{x}_{i} ai=U2Txi
投影总方差为:
var
(
A
)
=
1
n
∑
i
=
1
n
∥
a
i
−
0
∥
2
=
1
n
∑
i
=
1
n
(
U
2
T
x
i
)
T
(
U
2
T
x
i
)
=
1
n
∑
i
=
1
n
x
i
T
(
U
2
U
2
T
)
x
i
=
1
n
∑
i
=
1
n
x
i
T
(
u
1
u
1
T
+
u
2
u
2
T
)
x
i
=
u
1
T
Σ
u
1
+
u
2
T
Σ
u
2
=
λ
1
+
λ
2
\begin{aligned} \operatorname{var}(\mathbf{A}) &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{a}_{i}-\mathbf{0}\right\|^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{U}_{2}^{T} \mathbf{x}_{i}\right)^{T}\left(\mathbf{U}_{2}^{T} \mathbf{x}_{i}\right) \\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left(\mathbf{U}_{2} \mathbf{U}_{2}^{T}\right) \mathbf{x}_{i}\\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left( \mathbf{u}_{1}\mathbf{u}_{1}^T + \mathbf{u}_{2}\mathbf{u}_{2}^T \right) \mathbf{x}_{i}\\ &=\mathbf{u}_{1}^T\mathbf{\Sigma} \mathbf{u}_{1} + \mathbf{u}_{2}^T\mathbf{\Sigma} \mathbf{u}_{2}\\ &= \lambda_1 +\lambda_2 \end{aligned}
var(A)=n1i=1∑n∥ai−0∥2=n1i=1∑n(U2Txi)T(U2Txi)=n1i=1∑nxiT(U2U2T)xi=n1i=1∑nxiT(u1u1T+u2u2T)xi=u1TΣu1+u2TΣu2=λ1+λ2
问题2:平均平方误差是否最小?
其中,
x
i
′
=
U
2
U
2
T
x
i
\mathbf{x}_{i}^{\prime}=\mathbf{U}_{2}\mathbf{U}_{2}^{T} \mathbf{x}_{i}
xi′=U2U2Txi
M
S
E
=
1
n
∑
i
=
1
n
∥
x
i
−
x
i
′
∥
2
=
1
n
∑
i
=
1
n
∥
x
i
∥
2
−
1
n
∑
i
=
1
n
x
i
T
(
U
2
U
2
T
)
x
i
=
v
a
r
(
D
)
−
λ
1
−
λ
2
\begin{aligned} M S E &= \frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right\|^{2} \\ &= \frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}\right\|^{2} - \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left(\mathbf{U}_{2} \mathbf{U}_{2}^{T}\right) \mathbf{x}_{i}\\ &= var(D) - \lambda_1 - \lambda_2 \end{aligned}
MSE=n1i=1∑n∥xi−xi′∥2=n1i=1∑n∥xi∥2−n1i=1∑nxiT(U2U2T)xi=var(D)−λ1−λ2
结论:
- Σ \mathbf{\Sigma} Σ 的前 r r r 个特征值的和 λ 1 + ⋯ + λ r ( λ 1 ≥ ⋯ ≥ λ r ) \lambda_1+\cdots+\lambda_r(\lambda_1\ge\cdots\ge\lambda_r) λ1+⋯+λr(λ1≥⋯≥λr) 给出最大投影总方差;
- v a r ( D ) − ∑ i = 1 r λ i var(D)-\sum\limits_{i=1}^r \lambda_i var(D)−i=1∑rλi 给出最小MSE;
- λ 1 , ⋯ , λ r \lambda_1,\cdots,\lambda_r λ1,⋯,λr 相应的特征向量 u 1 , ⋯ u r \mathbf{u}_{1},\cdots\mathbf{u}_{r} u1,⋯ur 张成 r r r - 阶主元。
7.2.3 推广
Σ d × d \Sigma_{d\times d} Σd×d , λ 1 ≥ λ 2 ≥ ⋯ λ d \lambda_1 \ge \lambda_2 \ge \cdots \lambda_d λ1≥λ2≥⋯λd,中心化
∑ i = 1 r λ i \sum\limits_{i=1}^r\lambda_i i=1∑rλi:最大投影总方差;
v a r ( D ) − ∑ i = 1 r λ i var(D)-\sum\limits_{i=1}^r\lambda_i var(D)−i=1∑rλi:最小MSE
实践: 如何选取适当的 r r r,考虑比值 ∑ i = 1 r λ i v a r ( D ) \frac{\sum\limits_{i=1}^r\lambda_i}{var(D)} var(D)i=1∑rλi 与给定阈值 α \alpha α 比较
算法 7.1 PCA:
输入: D D D, α \alpha α
输出: A A A (降维后)
- μ = 1 n ∑ i = 1 r x i \boldsymbol{\mu} = \frac{1}{n}\sum\limits_{i=1}^r\mathbf{x}_i μ=n1i=1∑rxi;
- Z = D − 1 ⋅ μ T \mathbf{Z}=\mathbf{D}-\mathbf{1}\cdot \boldsymbol{\mu} ^T Z=D−1⋅μT;
- Σ = 1 n ( Z T Z ) \mathbf{\Sigma}=\frac{1}{n}(\mathbf{Z}^T\mathbf{Z}) Σ=n1(ZTZ);
- λ 1 ≥ λ 2 ≥ ⋯ λ d \lambda_1 \ge \lambda_2 \ge \cdots \lambda_d λ1≥λ2≥⋯λd, ⟵ Σ \longleftarrow \mathbf{\Sigma} ⟵Σ 的特征值(降序排列);
- u 1 , u 2 , ⋯ , u d \mathbf{u}_1,\mathbf{u}_2,\cdots,\mathbf{u}_d u1,u2,⋯,ud, ⟵ Σ \longleftarrow \mathbf{\Sigma} ⟵Σ 的特征向量(单位正交);
- 计算 ∑ i = 1 r λ i v a r ( D ) \frac{\sum\limits_{i=1}^r\lambda_i}{var(D)} var(D)i=1∑rλi,选取其比值超过 α \alpha α 最小的 r r r;
- U r = ( u 1 , u 2 , ⋯ , u r ) \mathbf{U}_r=(\mathbf{u}_1,\mathbf{u}_2,\cdots,\mathbf{u}_r) Ur=(u1,u2,⋯,ur);
- A = { a i ∣ a i = U r T x i , i = 1 , ⋯ , n } A=\{\mathbf{a}_i|\mathbf{a}_i=\mathbf{U}_r^T\mathbf{x}_i, i=1,\cdots,n\} A={ai∣ai=UrTxi,i=1,⋯,n}。
7.3 Kernel PCA:核主元分析
ϕ : I → F ⊆ R d \phi:\mathcal{I}\to \mathcal{F}\subseteq \mathbb{R}^d ϕ:I→F⊆Rd
K : I × I → R K:\mathcal{I}\times\mathcal{I}\to \mathbb{R} K:I×I→R
K ( x i , x j ) = ϕ T ( x i ) ϕ ( x j ) K(\mathbf{x}_i,\mathbf{x}_j)=\phi^T(\mathbf{x}_i)\phi(\mathbf{x}_j) K(xi,xj)=ϕT(xi)ϕ(xj)
已知: K = [ K ( x i , x j ) ] n × n \mathbf{K}=[K(\mathbf{x}_i,\mathbf{x}_j)]_{n\times n} K=[K(xi,xj)]n×n, Σ ϕ = 1 n ∑ i = 1 n ϕ ( x i ) ϕ ( x i ) T \mathbf{\Sigma}_{\phi}=\frac{1}{n}\sum\limits_{i=1}^n\phi(\mathbf{x}_i)\phi(\mathbf{x}_i)^T Σϕ=n1i=1∑nϕ(xi)ϕ(xi)T
对象: ϕ ( x 1 ) , ϕ ( x 2 ) , ⋯ , ϕ ( x n ) ∈ R d \phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n)\in \mathbb{R}^d ϕ(x1),ϕ(x2),⋯,ϕ(xn)∈Rd,假设 1 n ∑ i n ϕ ( x i ) = 0 \frac{1}{n}\sum\limits_{i}^{n}\phi(\mathbf{x}_i)=\mathbf{0} n1i∑nϕ(xi)=0, K → K ^ \mathbf{K} \to \hat{\mathbf{K}} K→K^,已经中心化;
目标:
u
,
λ
,
s
.
t
.
Σ
ϕ
u
=
λ
u
\mathbf{u},\lambda,s.t. \mathbf{\Sigma}_{\phi}\mathbf{u}=\lambda\mathbf{u}
u,λ,s.t.Σϕu=λu
1
n
∑
i
=
1
n
ϕ
(
x
i
)
[
ϕ
(
x
i
)
T
u
]
=
λ
u
∑
i
=
1
n
[
ϕ
(
x
i
)
T
u
n
λ
]
ϕ
(
x
i
)
=
u
\begin{aligned} \frac{1}{n}\sum\limits_{i=1}^n\phi(\mathbf{x}_i)[\phi(\mathbf{x}_i)^T\mathbf{u}] &=\lambda\mathbf{u}\\ \sum\limits_{i=1}^n[\frac{\phi(\mathbf{x}_i)^T\mathbf{u}}{n\lambda}] \phi(\mathbf{x}_i)&=\mathbf{u}\\ \end{aligned}
n1i=1∑nϕ(xi)[ϕ(xi)Tu]i=1∑n[nλϕ(xi)Tu]ϕ(xi)=λu=u
相同于所有数据线性组合。
令:
c
i
=
ϕ
(
x
i
)
T
u
n
λ
c_i=\frac{\phi(\mathbf{x}_i)^T\mathbf{u}}{n\lambda}
ci=nλϕ(xi)Tu,则
u
=
∑
i
=
1
n
c
i
ϕ
(
x
i
)
\mathbf{u}=\sum\limits_{i=1}^nc_i \phi(\mathbf{x}_i)
u=i=1∑nciϕ(xi)。代入原式:
(
1
n
∑
i
=
1
n
ϕ
(
x
i
)
ϕ
(
x
i
)
T
)
(
∑
j
=
1
n
c
j
ϕ
(
x
j
)
)
=
λ
∑
i
=
1
n
c
i
ϕ
(
x
i
)
1
n
∑
i
=
1
n
∑
j
=
1
n
c
j
ϕ
(
x
i
)
ϕ
(
x
i
)
T
ϕ
(
x
j
)
=
λ
∑
i
=
1
n
c
i
ϕ
(
x
i
)
∑
i
=
1
n
(
ϕ
(
x
i
)
∑
j
=
1
n
c
j
K
(
x
i
,
x
j
)
)
=
n
λ
∑
i
=
1
n
c
i
ϕ
(
x
i
)
\begin{aligned} \left(\frac{1}{n} \sum_{i=1}^{n} \phi\left(\mathbf{x}_{i}\right) \phi\left(\mathbf{x}_{i}\right)^{T}\right)\left(\sum_{j=1}^{n} c_{j} \phi\left(\mathbf{x}_{j}\right)\right) &=\lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \\ \frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{n} c_{j} \phi\left(\mathbf{x}_{i}\right) \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right) &=\lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \\ \sum_{i=1}^{n}\left(\phi\left(\mathbf{x}_{i}\right) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \end{aligned}
(n1i=1∑nϕ(xi)ϕ(xi)T)(j=1∑ncjϕ(xj))n1i=1∑nj=1∑ncjϕ(xi)ϕ(xi)Tϕ(xj)i=1∑n(ϕ(xi)j=1∑ncjK(xi,xj))=λi=1∑nciϕ(xi)=λi=1∑nciϕ(xi)=nλi=1∑nciϕ(xi)
注意,此处
K
=
K
^
\mathbf{K}=\hat{\mathbf{K}}
K=K^ 已经中心化
对于
∀
k
(
1
≤
k
≤
n
)
\forall k (1\le k\le n)
∀k(1≤k≤n),两边同时左乘
ϕ
(
x
k
)
\phi(\mathbf{x}_{k})
ϕ(xk):
∑
i
=
1
n
(
ϕ
T
(
x
k
)
ϕ
(
x
i
)
∑
j
=
1
n
c
j
K
(
x
i
,
x
j
)
)
=
n
λ
∑
i
=
1
n
c
i
ϕ
T
(
x
k
)
ϕ
(
x
i
)
∑
i
=
1
n
(
K
(
x
k
,
x
i
)
∑
j
=
1
n
c
j
K
(
x
i
,
x
j
)
)
=
n
λ
∑
i
=
1
n
c
i
K
(
x
k
,
x
i
)
\begin{aligned} \sum_{i=1}^{n}\left(\phi^T(\mathbf{x}_{k}) \phi\left(\mathbf{x}_{i}\right) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} \phi^T(\mathbf{x}_{k}) \phi\left(\mathbf{x}_{i}\right) \\ \sum_{i=1}^{n}\left(K(\mathbf{x}_k, \mathbf{x}_i) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} K(\mathbf{x}_k, \mathbf{x}_i) \\ \end{aligned}
i=1∑n(ϕT(xk)ϕ(xi)j=1∑ncjK(xi,xj))i=1∑n(K(xk,xi)j=1∑ncjK(xi,xj))=nλi=1∑nciϕT(xk)ϕ(xi)=nλi=1∑nciK(xk,xi)
令
K
i
=
(
K
(
x
i
,
x
1
)
,
K
(
x
i
,
x
2
)
,
⋯
,
K
(
x
i
,
x
n
)
)
T
\mathbf{K}_{i}=\left(K\left(\mathbf{x}_{i}, \mathbf{x}_{1}\right), K\left(\mathbf{x}_{i}, \mathbf{x}_{2}\right), \cdots, K\left(\mathbf{x}_{i}, \mathbf{x}_{n}\right)\right)^{T}
Ki=(K(xi,x1),K(xi,x2),⋯,K(xi,xn))T (核矩阵的第
i
i
i 行,
K
=
(
[
K
1
T
⋮
K
n
T
]
)
\mathbf{K}=(\begin{bmatrix} \mathbf{K}_1^T \\ \vdots \\ \mathbf{K}_n^T \end{bmatrix})
K=(⎣
⎡K1T⋮KnT⎦
⎤)),
c
=
(
c
1
,
c
2
,
⋯
,
c
n
)
T
\mathbf{c}=(c_1,c_2,\cdots,c_n)^T
c=(c1,c2,⋯,cn)T,则:
∑
i
=
1
n
K
(
x
k
,
x
i
)
K
i
T
c
=
n
λ
K
k
T
c
,
k
=
1
,
2
,
⋯
,
n
K
k
T
[
K
1
T
⋮
K
n
T
]
c
=
n
λ
K
k
T
c
K
k
T
K
=
n
λ
K
k
T
c
\begin{aligned} \sum_{i=1}^{n}K(\mathbf{x}_k, \mathbf{x}_i) \mathbf{K}^T_i\mathbf{c} &=n \lambda \mathbf{K}^T_k\mathbf{c},k=1,2,\cdots,n \\ \mathbf{K}^T_k\begin{bmatrix} \mathbf{K}_1^T \\ \vdots \\ \mathbf{K}_n^T \end{bmatrix}\mathbf{c} &=n \lambda \mathbf{K}^T_k\mathbf{c}\\ \mathbf{K}^T_k\mathbf{K} &=n \lambda \mathbf{K}^T_k\mathbf{c} \end{aligned}
i=1∑nK(xk,xi)KiTcKkT⎣
⎡K1T⋮KnT⎦
⎤cKkTK=nλKkTc,k=1,2,⋯,n=nλKkTc=nλKkTc
即
K
2
c
=
n
λ
K
c
\mathbf{K}^2\mathbf{c}=n\lambda \mathbf{K}\mathbf{c}
K2c=nλKc
假设
K
−
1
\mathbf{K}^{-1}
K−1 存在
K
2
c
=
n
λ
K
c
K
c
=
n
λ
c
K
c
=
η
c
,
η
=
n
λ
\begin{aligned} \mathbf{K}^2\mathbf{c}&=n\lambda \mathbf{K}\mathbf{c}\\ \mathbf{K}\mathbf{c}&=n\lambda \mathbf{c}\\ \mathbf{K}\mathbf{c}&= \eta\mathbf{c},\eta=n\lambda \end{aligned}
K2cKcKc=nλKc=nλc=ηc,η=nλ
结论:
η
1
n
≥
η
2
n
≥
⋯
≥
η
n
n
\frac{\eta_1}{n}\ge\frac{\eta_2}{n}\ge\cdots\ge\frac{\eta_n}{n}
nη1≥nη2≥⋯≥nηn,给出在特征空间中
ϕ
(
x
1
)
,
ϕ
(
x
2
)
,
⋯
,
ϕ
(
x
n
)
\phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n)
ϕ(x1),ϕ(x2),⋯,ϕ(xn) 的投影方差:
∑
i
=
1
r
η
r
n
\sum\limits_{i=1}^{r}\frac{\eta_r}{n}
i=1∑rnηr,其中
η
1
≥
η
2
⋯
≥
η
n
\eta_1\ge\eta_2\cdots\ge\eta_n
η1≥η2⋯≥ηn 是
K
\mathbf{K}
K 的特征值。
问:可否计算出 ϕ ( x 1 ) , ϕ ( x 2 ) , ⋯ , ϕ ( x n ) \phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n) ϕ(x1),ϕ(x2),⋯,ϕ(xn) 在主元方向上的投影(即降维之后的数据)?
设
u
1
,
⋯
,
u
d
\mathbf{u}_1,\cdots,\mathbf{u}_d
u1,⋯,ud 是
Σ
ϕ
\mathbf{\Sigma}_{\phi}
Σϕ 的特征向量,则
ϕ
(
x
j
)
=
a
1
u
1
+
⋯
+
a
d
u
d
\phi(\mathbf{x}_j)=a_1\mathbf{u}_1+\cdots+a_d\mathbf{u}_d
ϕ(xj)=a1u1+⋯+adud,其中
a
k
=
ϕ
(
x
j
)
T
u
k
,
k
=
1
,
2
,
⋯
,
d
=
ϕ
(
x
j
)
T
∑
i
=
1
n
c
k
i
ϕ
(
x
i
)
=
∑
i
=
1
n
c
k
i
ϕ
(
x
j
)
T
ϕ
(
x
i
)
=
∑
i
=
1
n
c
k
i
K
(
x
j
,
x
i
)
\begin{aligned} a_k &= \phi(\mathbf{x}_j)^T\mathbf{u}_k, k=1,2,\cdots,d\\ &= \phi(\mathbf{x}_j)^T\sum\limits_{i=1}^nc_{ki} \phi(\mathbf{x}_i)\\ &= \sum\limits_{i=1}^nc_{ki} \phi(\mathbf{x}_j)^T\phi(\mathbf{x}_i)\\ &= \sum\limits_{i=1}^nc_{ki} K(\mathbf{x}_j,\mathbf{x}_i) \end{aligned}
ak=ϕ(xj)Tuk,k=1,2,⋯,d=ϕ(xj)Ti=1∑nckiϕ(xi)=i=1∑nckiϕ(xj)Tϕ(xi)=i=1∑nckiK(xj,xi)
算法7.2:核主元分析( F ⊆ R d \mathcal{F}\subseteq \mathbb{R}^d F⊆Rd)
输入: K K K, α \alpha α
输出: A A A (降维后数据的投影坐标)
-
K ^ : = ( I − 1 n 1 n × n ) K ( I − 1 n 1 n × n ) \hat{\mathbf{K}} :=\left(\mathbf{I}-\frac{1}{n} \mathbf{1}_{n \times n}\right) \mathbf{K}\left(\mathbf{I}-\frac{1}{n} \mathbf{1}_{n \times n}\right) K^:=(I−n11n×n)K(I−n11n×n)
-
η 1 , η 2 , ⋯ η d \eta_1,\eta_2,\cdots\eta_d η1,η2,⋯ηd ⟵ K \longleftarrow \mathbf{K} ⟵K 的特征值,只取前 d d d 个
-
c 1 , c 2 , ⋯ , c d \mathbf{c}_1,\mathbf{c}_2,\cdots,\mathbf{c}_d c1,c2,⋯,cd ⟵ K \longleftarrow \mathbf{K} ⟵K 的特征向量(单位化,正交)
-
c i ← 1 η i ⋅ c i , i = 1 , ⋯ , d \mathbf{c}_i \leftarrow \frac{1}{\sqrt{\eta_i}}\cdot \mathbf{c}_i,i=1,\cdots,d ci←ηi1⋅ci,i=1,⋯,d
-
选取最小的 r r r 使得: ∑ i = 1 r η i n ∑ i = 1 d η i n ≥ α \frac{\sum\limits_{i=1}^r\frac{\eta_i}{n}}{\sum\limits_{i=1}^d\frac{\eta_i}{n}}\ge \alpha i=1∑dnηii=1∑rnηi≥α
-
C r = ( c 1 , c 2 , ⋯ , c r ) \mathbf{C}_r=(\mathbf{c}_1,\mathbf{c}_2,\cdots,\mathbf{c}_r) Cr=(c1,c2,⋯,cr)
-
A = { a i ∣ a i = C r T K i , i = 1 , ⋯ , n } A=\{\mathbf{a}_i|\mathbf{a}_i=\mathbf{C}_r^T\mathbf{K}_i, i=1,\cdots,n\} A={ai∣ai=CrTKi,i=1,⋯,n}