线性判别分析LDA同PCA一样,也是一种常用的降维方法,但有所不同的是LDA属于有监督的降维,PCA则属于无监督的降维。本篇文章将从原理推导、python实现以及案例分析三个方面进行讲解。
原理推导
二分类原理
LDA首先是为了分类任务服务的,因此需要找到一个投影方向w,使得投影后的样本尽可能按照原始类别分开。假设有C1、C2两个类别的样本,两类的均值分别是
u
1
=
1
N
1
∑
x
∈
C
1
x
u_1=\frac{1}{N_1}\sum_{x \in C_1}x
u1=N11∑x∈C1x,
u
2
=
1
N
2
∑
x
∈
C
2
x
u_2=\frac{1}{N_2}\sum_{x \in C_2}x
u2=N21∑x∈C2x,我们希望投影后的类间距离越大,同时类内距离越小,因此优化问题可写为:
m
a
x
J
(
w
)
=
∣
∣
w
T
(
u
1
−
u
2
)
∣
∣
2
2
D
1
+
D
2
s
.
t
.
w
T
w
=
1
\begin{aligned} maxJ(w)&=\frac{||w^T(u1-u2)||_2^2}{D_1+D_2} \\ s.t. w^Tw&=1 \\ \end{aligned}
maxJ(w)s.t.wTw=D1+D2∣∣wT(u1−u2)∣∣22=1
其中
D
1
D_1
D1和
D
2
D_2
D2的值定义为:
D
1
=
∑
x
∈
C
1
(
w
T
x
−
w
T
u
1
)
2
=
∑
x
∈
C
1
w
T
(
x
−
u
1
)
(
x
−
u
1
)
T
w
D
2
=
∑
x
∈
C
2
(
w
T
x
−
w
T
u
2
)
2
=
∑
x
∈
C
2
w
T
(
x
−
u
2
)
(
x
−
u
2
)
T
w
\begin{aligned} D_1&=\sum_{x \in C_1}(w^Tx-w^Tu_1)^2 \\ &=\sum_{x \in C_1}w^T(x-u_1)(x-u_1)^Tw \\ D_2&=\sum_{x \in C_2}(w^Tx-w^Tu_2)^2 \\ &=\sum_{x \in C_2}w^T(x-u_2)(x-u_2)^Tw \\ \end{aligned}
D1D2=x∈C1∑(wTx−wTu1)2=x∈C1∑wT(x−u1)(x−u1)Tw=x∈C2∑(wTx−wTu2)2=x∈C2∑wT(x−u2)(x−u2)Tw
因此,上式可化简为:
m
a
x
J
(
w
)
=
w
T
(
u
1
−
u
2
)
(
u
2
−
u
2
)
T
w
w
T
∑
x
∈
C
i
(
x
−
u
i
)
(
x
−
u
i
)
T
w
s
.
t
.
w
T
w
=
1
\begin{aligned} maxJ(w)&=\frac{w^T(u_1-u_2)(u_2-u_2)^Tw}{w^T\sum_{x \in C_i}(x-u_i)(x-u_i)^Tw} \\ s.t. w^Tw&=1 \end{aligned}
maxJ(w)s.t.wTw=wT∑x∈Ci(x−ui)(x−ui)TwwT(u1−u2)(u2−u2)Tw=1
我们定义类间散度矩阵
S
B
=
(
u
1
−
u
2
)
(
u
1
−
u
2
)
T
S_B=(u_1-u_2)(u_1-u_2)^T
SB=(u1−u2)(u1−u2)T,类内散度矩阵为
S
W
=
∑
x
∈
C
i
(
x
−
u
i
)
(
x
−
u
i
)
T
S_W=\sum_{x \in C_i}(x-u_i)(x-u_i)^T
SW=∑x∈Ci(x−ui)(x−ui)T,则上式可化简为:
m
a
x
J
(
w
)
=
w
T
S
B
w
w
T
S
w
w
\begin{aligned} maxJ(w)=\frac{w^TS_Bw}{w^TS_ww} \end{aligned}
maxJ(w)=wTSwwwTSBw
为了最大化
J
(
w
)
J(w)
J(w),我们可以对w求偏导,并令导数等于零:
∂
J
(
w
)
∂
w
=
S
B
w
(
w
T
S
w
w
)
−
S
W
w
(
w
T
S
B
w
)
(
w
T
S
w
w
)
2
=
0
\begin{aligned} \frac{\partial J(w)}{\partial w}=\frac{S_Bw(w^TS_ww)-S_Ww(w^TS_Bw)}{(w^TS_ww)^2}=0 \end{aligned}
∂w∂J(w)=(wTSww)2SBw(wTSww)−SWw(wTSBw)=0
于是,我们可以得到:
S
B
w
(
w
T
S
w
w
)
−
S
W
w
(
w
T
S
B
w
)
=
0
S
B
w
=
J
(
w
)
S
W
w
\begin{aligned} S_Bw(w^TS_ww)&-S_Ww(w^TS_Bw)=0 \\ S_Bw&=J(w)S_Ww \end{aligned}
SBw(wTSww)SBw−SWw(wTSBw)=0=J(w)SWw
整理可得:
S
W
−
1
S
B
w
=
J
(
w
)
w
S_W^{-1}S_Bw=J(w)w
SW−1SBw=J(w)w
由于
J
(
w
)
J(w)
J(w)实质上就是一个数,因此我们可以看出最大化的目标就是求取对应矩阵的特征值。
进一步地,对于二分类问题而言,由于
S
B
=
(
u
1
−
u
2
)
(
u
1
−
u
2
)
T
S_B=(u_1-u_2)(u_1-u_2)^T
SB=(u1−u2)(u1−u2)T,我们可以得到:
S
B
w
=
(
u
1
−
u
2
)
(
u
1
−
u
2
)
T
w
S_Bw=(u_1-u_2)(u_1-u_2)^Tw
SBw=(u1−u2)(u1−u2)Tw
由于
(
u
1
−
u
2
)
T
w
(u_1-u_2)^Tw
(u1−u2)Tw是一个数,因此
S
B
w
S_Bw
SBw的方向始终与
u
1
−
u
2
u_1-u_2
u1−u2一致,如果只考虑w的方向,而不考虑其长度,我们可以得到
w
=
S
W
−
1
(
u
1
−
u
2
)
w=S_W^{-1}(u_1-u_2)
w=SW−1(u1−u2)。
多分类扩展
前面我们简单地分析了二分类的情况,进一步地,如果是多分类情况,又该如何处理呢?我们可以发现类内散度矩阵
S
W
S_W
SW实质上是可以扩展的,但是类内散度矩阵
S
B
S_B
SB之前是两类均值的距离平方,无法直接扩展到多类。因此,我们引入全局散度矩阵:
S
T
=
∑
i
=
1
n
(
x
i
−
u
)
(
x
i
−
u
)
T
S_T=\sum_{i=1}^n(x_i-u)(x_i-u)^T
ST=i=1∑n(xi−u)(xi−u)T
其中,u代表全体样本的均值。
如果我们把全局散度定义为类内散度与类间散度之和,即
S
T
=
S
B
+
S
W
S_T=S_B+S_W
ST=SB+SW,那么类间散度矩阵可表示为:
S
B
=
S
T
−
S
W
=
∑
i
=
1
n
(
x
i
−
u
)
(
x
i
−
u
)
T
−
∑
x
∈
C
i
(
x
−
u
i
)
(
x
−
u
i
)
T
=
∑
j
=
1
N
(
∑
x
∈
C
j
(
x
−
u
)
(
x
−
u
)
T
−
∑
x
∈
C
j
(
x
−
u
j
)
(
x
−
u
j
)
T
)
=
∑
j
=
1
N
(
∑
x
∈
C
j
(
x
x
T
−
2
x
u
T
+
u
u
T
−
x
x
T
+
2
x
u
j
T
−
u
j
u
j
T
)
)
=
∑
j
=
1
N
(
∑
x
∈
C
j
2
x
(
u
j
T
−
u
T
)
+
u
u
T
−
u
j
u
j
T
)
=
∑
j
=
1
N
(
m
j
(
2
u
j
(
u
j
T
−
u
T
)
+
u
u
T
−
u
j
u
j
T
)
)
=
∑
j
=
1
N
(
m
j
(
−
2
u
j
u
T
+
u
u
T
+
u
j
u
j
T
)
)
=
∑
j
=
1
N
m
j
(
u
j
−
u
)
(
u
j
−
u
)
T
\begin{aligned} S_B&=S_T-S_W \\ &=\sum_{i=1}^n(x_i-u)(x_i-u)^T-\sum_{x \in C_i}(x-u_i)(x-u_i)^T \\ &=\sum_{j=1}^N(\sum_{x \in C_j}(x-u)(x-u)^T-\sum_{x \in C_j}(x-u_j)(x-u_j)^T) \\ &=\sum_{j=1}^N(\sum_{x \in C_j}(xx^T-2xu^T+uu^T-xx^T+2xu_j^T-u_ju_j^T)) \\ &=\sum_{j=1}^N(\sum_{x \in C_j}2x(u_j^T-u^T)+uu^T-u_ju_j^T) \\ &=\sum_{j=1}^N(m_j(2u_j(u_j^T-u^T)+uu^T-u_ju_j^T)) \\ &=\sum_{j=1}^N(m_j(-2u_ju^T+uu^T+u_ju_j^T)) \\ &=\sum_{j=1}^Nm_j(u_j-u)(u_j-u)^T \end{aligned}
SB=ST−SW=i=1∑n(xi−u)(xi−u)T−x∈Ci∑(x−ui)(x−ui)T=j=1∑N(x∈Cj∑(x−u)(x−u)T−x∈Cj∑(x−uj)(x−uj)T)=j=1∑N(x∈Cj∑(xxT−2xuT+uuT−xxT+2xujT−ujujT))=j=1∑N(x∈Cj∑2x(ujT−uT)+uuT−ujujT)=j=1∑N(mj(2uj(ujT−uT)+uuT−ujujT))=j=1∑N(mj(−2ujuT+uuT+ujujT))=j=1∑Nmj(uj−u)(uj−u)T
其中
m
j
m_j
mj是该类的样本个数,从中我们也可以看出类间散度表示的就是每个类别中心到全局中心的一种甲醛距离。至此,我们可以得到多类别数据的LDA求解方法:
- 计算数据集中每个类别样本的均值向量 u j u_j uj,以及总体均值向量 u u u
- 计算类内散度矩阵 S W S_W SW,全局散度矩阵 S T S_T ST,并得到类间散度矩阵 S B = S T − S W S_B=S_T-S_W SB=ST−SW
- 对矩阵 S W − 1 S B S_W^{-1}S_B SW−1SB进行特征值分解,将特征值从大到小排列
- 取特征值前d大的对应的特征向量 w 1 , w 2 , . . . w d w_1,w_2,...w_d w1,w2,...wd,并将n维样本映射到d维
Python实现
二分类实现
def calculate_covariance_matrix(X, Y=None):
if Y is None:
Y = X
n_samples = np.shape(X)[0]
covariance_matrix = (1 / (n_samples - 1)) * (X - X.mean(axis=0)).T.dot(Y - Y.mean(axis=0))
return np.array(covariance_matrix, dtype=float)
class LDA():
def __init__(self):
self.w = None
def fit(self, X, y):
X1 = X[y == 0]
X2 = X[y == 1]
cov1 = calculate_covariance_matrix(X1)
cov2 = calculate_covariance_matrix(X2)
cov_tot = cov1 + cov2
mean1 = X1.mean(0)
mean2 = X2.mean(0)
mean_diff = np.atleast_1d(mean1 - mean2)
self.w = np.linalg.pinv(cov_tot).dot(mean_diff)
def transform(self, X, y):
self.fit(X, y)
X_transform = X.dot(self.w)
return X_transform
def predict(self, X):
# 这里理解为w的向量方向为u2指向u1(w的方向和u1-u2一致),自然u1的方向和w的方向夹角小于90,而u2和w的方向夹角大于90
# 如果y大于0,则类别为0,y小于0,则类别为1
y_pred = []
for sample in X:
h = sample.dot(self.w)
y = 1 * (h < 0)
y_pred.append(y)
return y_pred
多分类实现
class MultiClassLDA():
def __init__(self, solver="svd"):
self.solver = solver
def _calculate_scatter_matrices(self, X, y):
n_features = np.shape(X)[1]
labels = np.unique(y)
# Within class scatter matrix:
# SW = sum{ (X_for_class - mean_of_X_for_class)^2 } # <=> (n_samples_X_for_class - 1) * covar(X_for_class)
SW = np.zeros((n_features, n_features))
for label in labels:
_X = X[y == label]
SW += (len(_X) - 1) * calculate_covariance_matrix(_X)
# Between class scatter:
# SB = sum{ n_samples_for_class * (mean_for_class - total_mean)^2 }
total_mean = np.mean(X, axis=0)
SB = np.zeros((n_features, n_features))
for label in labels:
_X = X[y == label]
_mean = np.mean(_X, axis=0)
SB += len(_X) * ((_mean - total_mean).reshape(-1, 1)).dot((_mean - total_mean).reshape(-1, 1).T)
# 也可以用另外一种方法进行计算
ST = (len(X) - 1) * calculate_covariance_matrix(X)
SB_v1 = ST - SW
return SW, SB
def transform(self, X, y, n_components):
SW, SB = self._calculate_scatter_matrices(X, y)
# Determine SW^-1 * SB by calculating inverse of SW
A = np.linalg.inv(SW).dot(SB)
# Get eigenvalues and eigenvectors of SW^-1 * SB
eigenvalues, eigenvectors = np.linalg.eigh(A)
# Sort the eigenvalues and corresponding eigenvectors from largest
# to smallest eigenvalue and select the first n_components
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx][:n_components]
eigenvectors = eigenvectors[:, idx][:, :n_components]
# Project the data onto eigenvectors
X_transformed = X.dot(eigenvectors)
return X_transformed
def plot_in_2d(self, X, y, title=None):
X_transformed = self.transform(X, y, n_components=2)
x1 = X_transformed[:, 0]
x2 = X_transformed[:, 1]
plt.scatter(x1, x2, c=y)
if title: plt.title(title)
plt.show()
案例分析
我们可以使用sklearn中的接口来看看LDA的使用,随机生成3类数据,每个样本有4个特征,我们使用LDA降维算法后,将前两个维度进行可视化:
import numpy as np
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import matplotlib.pyplot as plt
#构造数据集
def make_data(centers=3, cluster_std=[1.0, 3.0, 2.5], n_samples=150, n_features=2):
X, y = make_blobs(n_samples=n_samples, n_features=n_features, centers=centers, cluster_std=cluster_std)
return X, y
X, y = make_data([[2.0, 1.0], [15.0, 5.0], [31.0, 12.0]], [1.0, 3.0, 2.5], n_features=4)
print(X.shape)
sk_lda = LinearDiscriminantAnalysis(n_components=2)
sk_lda.fit(X, y)
transformed = sk_lda.transform(X)
plt.subplots(figsize=(10, 8))
plt.scatter(transformed[:, 0], transformed[:, 1], c=y, cmap=plt.cm.Set1)
plt.title("sklearn's offical implementation")
plt.show()
![](https://img-blog.csdnimg.cn/dd9175147d764ed6afe14b20ad527454.png#pic_center)
优点
- 在降维过程中可以使用类别的先验知识经验,而像PCA这样的无监督学习无法使用先验知识;
- LDA在样本分类信息依赖均值而不是方差的时候,比PCA算法较优。
缺点
- LDA与PCA均不适合对非高斯分布样本进行降维
- LDA降维算法最多降到类别数K-1的维度,当降维的维度大于K-1时,则不能使用LDA。当然目前有一些改进的LDA算法可以绕过这个问题
- LDA在样本分类信息依赖方差而非均值的时候,降维效果不好
- LDA可能过度拟合数据
参考
《百面机器学习-算法工程师带你去面试》- 第4章降维
深入浅出线性判别分析(LDA,从理论到代码实现) - 知乎 (zhihu.com)
线性判别分析LDA原理及推导过程(非常详细) - 知乎 (zhihu.com)