Author: wiz
Date: 5th June, 2019
Name: The principle of PCA
Version: 1.0
目录
文章来源
最近在看迁移学习(Transfer Learning),里面有个TCA(Transfer Component Analysis)算法(后续应该会更新一篇),看完之后,对最后二次型求解不太理解。听闻PCA和TCA极其相似,仔细研读之后,果然如此,因此整理这篇文章。
友情提醒:本文为纯数学原理推导,无代码(后续再更新)
问题背景
机器学习领域,通常对大量数据进行预测。数据中包含多维特征,为预测提供了丰富的训练资源。但这些特征之间可能存在相关性,从而增加了问题分析的复杂性。需要找到一种有效的降维方法,使得数据在压缩过后保留主要的信息。
推导过程
1.PCA是什么?
PCA(Principle Component Analysis),主成分分析,是一种机器学习中常用的降维方法。将数据从高维度降低到低维度,保留主要的特征信息。
上图为例,所有的数据是分布在三维空间中,PCA将三维数据映射到二维平面u,二维平面由向量<u1,u2>表示,u1与u2垂直
2.问题转化
假设有一组数据,将其中心化后:
其中
现在的需求是经过映射,尽量能够保持原有数据的特征,也就是最小化数据x到平面的距离h。
由于是一个三角问题,距离最小转化为投影最大(内积计算的是投影),也就是最大化下面的式子:(另一种解释是方差最大)
数据xi投影到u1的内积用表示。
3.推导过程
因此,也就是最大化这个式子:
=>
=>,( )
=>(|AA^T| = |A| |A^T| = |A||A| = |A|^2)
=>
=>
=>,,,,
式子到这里,求上式的最大值,上式中为标准的二次型,并且为半正定矩阵(所有特征值大于等于0)
证明:为半正定矩阵
假设的某个特征值为,对应的特征向量为,那么:
=>,特征值性质
=>
=>
=>
=>
□
因此,存在最大值,如何求最大值?
4.解决方法
方法一:拉格朗日乘子法
目标函数与约束条件构成最大化问题:
构造拉格朗日函数:
对u1求导,令其为0:
显然,u1即为的特征值对应的特征向量!并且,的所有特征值和特征向量都满足上式,将上式带入目标函数中:
因此,如果取最大的特征值,得到的目标值就最大。
可能会有疑问:为什么一阶导数为0,就是极大值呢?再求二阶导数:
当取最大值时,为半负定阵。
所以,目标函数在最大特征值对应的特征向量上取得最大值。第一主轴方向即为第一大特征值对应的特征向量方向,第二主轴方向为第二大特征值对应的特征向量方向,以此类推。
方法二:奇异值分解法
对一个向量,其二范数(模长)的平方为:
目标函数转化为:
问题变成,对一个矩阵,它对一个向量做变换,如何才能让变换后向量的模长伸缩尺度(二范数)最大?
引入一个定理:
表示矩阵A的最大奇异值,一个矩阵A的奇异值为AA'(或A'A)的特征值开平方(前面证明特征值>=0)
证明:
假设对称阵,有n个特征值,对应特征向量。
对称阵不同特征值对应的特征向量两两正交,因此这组特征向量构成空间中一组单位正交基。
任取一个向量x,表示为:
(1)
则有:
(2)
又:
(3)
将(1)带入(3)中:
由于单位特征向量两两正交,相同的内积为1,不同的为0,因此变成:
根据特征值大小关系有:
所以:
□
-------------------
显然,当时取得最大奇异值,
我们需要最大化目标函数:
将XT替换矩阵A,则u1的方向即为最大特征值对应的特征向量的方向,第二主轴以此类推。
PS:关于主成分保留占比问题
前面已经讨论,第一大主轴最大值为最大奇异值(最大特征值开平方),第二主轴对应次大奇异值,以此类推。假设取前 r 个奇异值(降序排列后)对应主轴作为主成分,则提取后的数据信息占比为:
分子为前 r 个奇异值平方和,分母是所有奇异值平方和。
总结
两种解决方法最后结果相同,前者应用规划问题求解,后者应用矩阵知识求解,思考思路不太相同,代码实现可以直接用矩阵方法,计算特征值特征向量可以完成。
另,因为这个排版实在是难看,所以写了一个推导过程的latex代码,公式是正确的,转成PDF大概这样:
本来想放在overleaf上,无奈网速不行,代码如下:
\documentclass[11pt]{article}
\newcommand{\numpy}{{\tt numpy}} % tt font for numpy
\topmargin -.5in
\textheight 9in
\oddsidemargin -.25in
\evensidemargin -.25in
\textwidth 7in
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{amsthm}
\begin{document}
% ========== Edit your name here
\author{WiZ}
\title{The Principle of PCA}
\date{5th June 2019}
\maketitle
\medskip
% ========== Begin answering questions here
\section{Problem}
We want data $x$ map to hyperplane $u$, which contains by vector $u_{1}$ and $u_{2}$, the problem is:
\textbf{how to minimal the distance from $x$ to $u_{1}$.}
\section{Transform}
We transform the minimal distance problem to the maximal projection problem, others said maximal variance, that is:
\begin{align}
max(\left|\vec{x}_{i} \cdot \vec{u}_{1}\right|)
\end{align}
\section{Derivation}
Assume data$\left\{\vec{x}_{1}, \vec{x}_{2}, \ldots, \vec{x}_{n}\right\}$ after centralizing. The problem is to maximal:
\begin{equation}
\frac{1}{n} \sum_{i=1}^{n}\left|\vec{x}_{i} \cdot \vec{u}_{1}\right|
\end{equation}
The problem equals to maximal:
\begin{equation}
\frac{1}{n} \sum_{i=1}^{n}\left(\vec{x}_{i} \cdot \vec{u}_{1}\right)^{2}
\end{equation}
In this way, we have:
\begin{align}
\frac{1}{n} \sum_{i=1}^{n}\left|\vec{x}_{i} \cdot \vec{u}_{1}\right|^{2} & = \frac{1}{n} \sum_{i=1}^{n}\left(\vec{x}_{i} \cdot \vec{u}_{1}\right)^{2}\\
& = \frac{1}{n} \sum_{i=1}^{n}\left(x_{i}^{T} u_{1}\right)^{2}\\
& = \frac{1}{n} \sum_{i=1}^{n}\left(x_{i}^{T} u_{1}\right)^{T}\left(x_{i}^{T} u_{1}\right)\\
& = \frac{1}{n} \sum_{i=1}^{n} u_{1}^{T} x_{i} x_{i}^{T} u_{1}\\
& = \frac{1}{n} u_{1}^{T}\left(\sum_{i=1}^{n} x_{i} x_{i}^{T}\right) u_{1}\\
& = \frac{1}{n} u_{1}^{T} X X^{T} u_{1}
\end{align}
(4)$\rightarrow$(5):$\vec{x}_{i} \cdot \vec{u}_{1}=x_{i}^{T} u_{1}$\\
(5)$\rightarrow$(6):$|AA^T| = |A| |A^T| = |A||A| = |A|^2$\\
(8)$\rightarrow$(9):\(X=\left[\begin{array}{llll}{x_{1}} & {x_{2}} & {\cdots} & {x_{n}}\end{array}\right]\), \(X^{T}=\left[\begin{array}{c}{x_{1}^{T}} \\ {x_{2}^{T}} \\ {\vdots} \\ {x_{n}^{T}}\end{array}\right]\) and \(X X^{T}=\sum_{i=1}^{n} x_{i} x_{i}^{T}\)\\
\\
To solve the maximal \(\frac{1}{n} u_{1}^{T} X X^{T} u_{1}\), we first prove \(X X^{T}\) is a positive semi-definite matrix.(equal to prove all eigenvalue are greater than 0)\\
\begin{proof}
$X X^{T}$ is a positive semi-definite matrix.\\
Assume: $X X^{T}$ have one eigenvalue \(\lambda\),corresponding eigenvector \(\xi\),then:\\
\(\Rightarrow\)$\quad$$\quad$\(X X^{T} \xi=\lambda \xi\)\\
\(\Rightarrow\)$\quad$$\quad$\(\left(X X^{T} \xi\right)^{T} \xi=(\lambda \xi)^{T} \xi\)\\
\(\Rightarrow\)$\quad$$\quad$\(\xi^{T} X X^{T} \xi=\lambda \xi^{T} \xi\)\\
\(\Rightarrow\)$\quad$$\quad$\(\xi^{T} X X^{T} \xi=\left(X^{T} \xi\right)^{T}\left(X^{T} \xi\right)=\left\|X^{T} \xi\right\|^{2}=\lambda \xi^{T} \xi=\lambda\|\xi\|^{2}\)\\
\(\Rightarrow\)$\quad$$\quad$\(\left\|X^{T} \xi\right\|^{2}=\lambda\|\xi\|^{2} \rightarrow \lambda \geq 0\)\\
\end{proof}
\section{Solution}
After provement, \(\frac{1}{n} u_{1}^{T} X X^{T} u_{1}\) has maximal value. how to solve it?
\subsection{Method 1: Lagrange multiplier}
Objective function and constraint conditions constitute the maximum problem:
\begin{equation}
\left\{\begin{array}{c}{\max \left\{u_{1}^{T} X X^{T} u_{1}\right\}} \\ {u_{1}^{T} u_{1}=1}\end{array}\right.
\end{equation}
Construct Lagrange function:
$$
f\left(u_{1}\right)=u_{1}^{T} X X^{T} u_{1}+\lambda\left(1-u_{1}^{T} u_{1}\right)
$$
Take the derivative of u1 and set it equal to 0:
$$
\frac{\partial f}{\partial u_{1}}=2 X X^{T} u_{1}-2 \lambda u_{1}=0 \rightarrow X X^{T} u_{1}=\lambda u_{1}
$$
Obviously, u1 is the eigenvector, take it to formor equation:
$$
u_{1}^{T} X X^{T} u_{1}=\lambda u_{1}^{T} u_{1}=\lambda
$$
So, if you take the maximum eigenvalue, you get the maximum target value. You might wonder: why is the first derivative zero, the maximum? Find the second derivative:
$$
\frac{\partial^{2} f}{\partial u_{1}}=2\left(X X^{T}-\lambda I\right)
$$
when \(\lambda\) taking to the maximal one, \(X X^{T}-\lambda I\) is a negative semidefinite matrix.
Therefore, the objective function maximizes on the eigenvector corresponding to the maximum eigenvalue. The first principal axis is the direction of the eigenvector corresponding to the first eigenvalue, the second principal axis is the direction of the eigenvector corresponding to the second largest eigenvalue, and so on.
\subsection{Method 2: Singular value decompositionr}
For a vector, the square of its two norm (modulus length) is:
$$
\|x\|_{2}^{2}=<x, x>=x^{T} x
$$
The objective function is converted to:
$$
u_{1}^{T} X X^{T} u_{1}=\left(X^{T} u_{1}\right)^{T}\left(X^{T} u_{1}\right)=<X^{T} u_{1}, X^{T} u_{1}>=\left\|X^{T} u_{1}\right\|_{2}^{2}
$$
The problem becomes, for a matrix, which transforms a vector, how can we make the modulus scale of the transformed vector the largest (two norm)?
Introduce a theorm:
$$
\frac{\|A x\|}{\|x\|} \leq \sigma_{1}(A)=\|A\|_{2}
$$
\(\sigma_{1}(A)\) denotes the maximum singular value of matrix A, the singular value of a matrix A is AA (or A A) eigenvalue squared (previously proved eigenvalue>=0).
We first make a provement:
\begin{proof}
$$
\frac{\|A x\|}{\|x\|} \leq \sigma_{1}(A)=\|A\|_{2}
$$\
Assume: Symmetrical matrix\(A^{T} A \in \mathrm{C}^{n \times n}\) have n eigenvalues \(\lambda_{1} \geq \lambda_{2} \geq \cdots \geq \lambda_{n} \geq 0\), corresponding eigenvector\(\xi_{1}, \xi_{2}, \cdots, \xi_{n}\), take one vector $x$:\\
$$
x=\sum_{i=1}^{n} \alpha_{i} \xi_{i}
$$
And we have:
$$
\|x\|_{2}^{2}=<x, x>=\alpha_{1}^{2}+\cdots+\alpha_{n}^{2}
$$
Also:
$$
\|A x\|_{2}^{2}=<A x, A x>=(A x)^{T} A x=x^{T} A^{T} A x=<x, A^{T} A x>
$$
Take \( x=\sum_{i=1}^{n} \alpha_{i} \xi_{i}\) to above equation:
$$
\begin{aligned}<x, A^{T} A x>&=<\alpha_{1} \xi_{1}+\cdots+\alpha_{n} \xi_{n}, \alpha_{1} A^{T} A \xi_{1}+\cdots+\alpha_{n} A^{T} A \xi_{n}>\\ &=<\alpha_{1} \xi_{1}+\cdots+\alpha_{n} \xi_{n}, \lambda_{1} \alpha_{1} \xi_{1}+\cdots+\lambda_{n} \alpha_{n} \xi_{n}>\\ &=<\lambda_{1} \alpha_{1}^{2}+\cdots+\lambda_{n} \alpha_{n}^{2}>\\ &\leq \lambda_{1}\left(\alpha_{1}^{2}+\cdots+\alpha_{n}^{2}\right)=\lambda_{1}\|x\|_{2}^{2}
\end{aligned}
$$
Therefore:
$$
\frac{\|A x\|_{2}}{\|x\|_{2}} \leq \sqrt{\lambda_{1}}=\sigma_{1}
$$
Notes: Since the unit eigenvectors are orthogonal, the same inner product is 1, and the different one is 0.
\end{proof}
Clearly, when $x = \xi_{1}$, $A$ have a maximum singular value,
$$
\left\|A \xi_{1}\right\|_{2}^{2}=<\xi_{1}, A^{T} A \xi_{1}>=<\xi_{1}, \lambda_{1} \xi_{1}>=\lambda_{1}
$$
$$
\left\|A \xi_{1}\right\|_{2}=\sqrt{\lambda_{1}}=\sigma_{1}
$$
Back to our proposal:
$$
u_{1}^{T} X X^{T} u_{1}=\left\|X^{T} u_{1}\right\|_{2}^{2}
$$
Replace $A$ with $X^{T}$, $u_{1}$is our maximum eigenvalues' eigenvector, the second one and so on.
\section{Conclution}
Firstly, problem transform.\\
Next, some basic knowledge about Matrix.\\
Last, more patient and careful.
\footnote{Have fun in study}
\end{document}
\grid
\grid
参考:PCA推导过程