PCA主成分分析
- Principal Component Analysis
- 优点:降低数据的复杂性,识别最重要的多个特征
- 缺点:不一定需要,且可能损失有用信息
- 适用数据类型:数值型数据
1. 向量的表示及基
- 内积: ( a 1 , a 2 , ⋯   , a n ) T ⋅ ( b 1 , b 2 , ⋯   , b n ) T = a 1 b 1 + a 2 b 2 + ⋯ + a n b n (a_1,a_2,\cdots,a_n)^T\cdot(b_1,b_2,\cdots,b_n)^T=a_1b_1+a_2b_2+\cdots+a_nb_n (a1,a2,⋯,an)T⋅(b1,b2,⋯,bn)T=a1b1+a2b2+⋯+anbn
- 解释: A ⋅ B = ∣ A ∣ ∣ B ∣ c o s ( a ) A\cdot B=|A||B|cos(a) A⋅B=∣A∣∣B∣cos(a)
- 设向量B的模为1,则A与B的内积值等于A向B所在直线投影的矢量长度
- 向量可以表示为(3,2)
,实际上表示线性组合: x ( 1 , 0 ) T + y ( 0 , 1 ) T x(1,0)^T+y(0,1)^T x(1,0)T+y(0,1)T - (1,0)和(0,1)叫做二维空间的一组基
- 基是正交的(即内积为0,或直观说相互垂直)
- 要求:线性无关
2. 基变换
- 变换:数据与一个基做内积运算,结果作为第一个新的坐标分量,然后与第二个基做内积运算,结果作为第二个新坐标的分量
⟮ p 1 p 2 ⋮ p R ⟯ ( a 1 a 2 ⋯ a M ) = ⟮ p 1 a 1 p 1 a 2 ⋯ p 1 a M p 2 a 1 p 2 a 2 ⋯ p 2 a M ⋮ ⋮ ⋱ ⋮ p R a 1 p R a 2 ⋯ p R a M ⟯ \left\lgroup\begin {matrix}p_1 \cr p_2 \cr \vdots \cr p_R \end{matrix}\right\rgroup (a_1 a_2 \cdots a_M) = \left\lgroup\begin {matrix}p_1a_1 & p_1a_2 & \cdots & p_1a_M \cr p_2a_1 & p_2a_2 & \cdots & p_2a_M\cr \vdots & \vdots & \ddots & \vdots \cr p_Ra_1 & p_Ra_2 & \cdots & p_Ra_M \end{matrix}\right\rgroup ⎩⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎧p1p2⋮pR⎭⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎫(a1a2⋯aM)=⎩⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎧p1a1p2a1⋮pRa1p1a2p2a2⋮pRa2⋯⋯⋱⋯p1aMp2aM⋮pRaM⎭⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎫ - 两个矩阵相乘的意义是将右边矩阵的每一列列向量变换到左边矩阵中每一行行向量为基所表示的空间上去
3. 协方差
- 方向:如何选择这个方向(或者说基)才能尽量保留最多的原始信息,一种直观的看法是:希望投影后的投影值尽可能的分散
- 方差: V a r ( a ) = 1 m ∑ i = 1 m ( a i − μ ) 2 Var(a)=\frac{1}{m}\sum_{i=1}^{m}{(a_i-\mu)^2} Var(a)=m1i=1∑m(ai−μ)2
- 寻找一个一维基,使得所有数据变换为这个基上的坐标表示后,方差值最大
- 协方差(假设均值为0时): C o v ( a , b ) = 1 m ∑ i = 1 m a i b i Cov(a,b) = \frac{1}{m}\sum_{i=1}^{m}{a_ib_i} Cov(a,b)=m1i=1∑maibi
- 如果单纯只选择方差最大的方向,后续方向应该会合方差最大的方向接近重合,为了让两个字段尽可能的表示更多的原始信息,我们不希望它们之间存在(线性)相关性的
- 可以用两个字段的协方差来表示其相关性
- 当协方差为0时,表示两个字段完全独立。为了让协方差为0,选择第二个基时只能在与第一个基正交的方向上选择。这样最终选择的两个方向一定是正交的。
4. PCA的优化目标
- **目标:**将一组N维向量降维K维(K大于0,小于N),目标是选择K个单位正交基,使原始数据变换到这组基上后,各字段两两间协方差为0,字段的方差则尽可能的大
- 协方差矩阵: X = ⟮ a 1 a 2 ⋯ a m b 1 b 2 ⋯ b m ⟯ 1 m X X T = ⟮ 1 m ∑ i = 1 m a i 2 1 m ∑ i = 1 m a i b i 1 m ∑ i = 1 m a i b i 1 m ∑ i = 1 m b i 2 ⟯ X = \left\lgroup\begin {matrix}a_1 & a_2 & \cdots &a_m \cr b_1 & b_2 & \cdots & b_m \end{matrix}\right\rgroup \ \ \ \ \ \frac{1}{m}XX^T=\left\lgroup\begin {matrix} \frac{1}{m}\sum_{i=1}^{m}{a_i^2} & \frac{1}{m}\sum_{i=1}^{m}{a_ib_i} \cr \frac{1}{m}\sum_{i=1}^{m}{a_ib_i} & \frac{1}{m}\sum_{i=1}^{m}{b_i^2} \end{matrix}\right\rgroup X=⎩⎪⎪⎧a1b1a2b2⋯⋯ambm⎭⎪⎪⎫ m1XXT=⎩⎪⎪⎧m1∑i=1mai2m1∑i=1maibim1∑i=1maibim1∑i=1mbi2⎭⎪⎪⎫
- 矩阵对角线上的两个元素分别是两个字段的方差,而其它元素是a和b的协方差
- **协方差矩阵对角化:**即除对角线外的其它元素化为0,并且对角线上将元素按大小从上到下排列
- 协方差矩阵对角化: P C P T = Λ = ⟮ λ 1 λ 2 ⋱ λ n ⟯ PCP^T = \Lambda =\left\lgroup\begin {matrix}\lambda_1 & & & \cr & \lambda_2 & & \cr & & \ddots & \cr & & &\lambda_n \end{matrix}\right\rgroup PCPT=Λ=⎩⎪⎪⎪⎪⎪⎪⎪⎪⎪⎧λ1λ2⋱λn⎭⎪⎪⎪⎪⎪⎪⎪⎪⎪⎫
- 根据特征值的从大到小,将特征向量从上到下排列,则用前K行组成的矩阵乘以原始数据矩阵X,就得到了我们需要的降维后的数据矩阵Y
- 实对称矩阵:一个n行n列的实对称矩阵一定可以找到n个单位正交特征向量 E = ( e 1 e 2 ⋯ e n ) E=(e_1\ e_2\ \cdots\ e_n) E=(e1 e2 ⋯ en)
- 实对称矩阵可进行对角化: E T C E = Λ = ⟮ λ 1 λ 2 ⋱ λ n ⟯ E^TCE=\Lambda=\left\lgroup\begin {matrix}\lambda_1 & & & \cr & \lambda_2 & & \cr & & \ddots & \cr & & &\lambda_n \end{matrix}\right\rgroup ETCE=Λ=⎩⎪⎪⎪⎪⎪⎪⎪⎪⎪⎧λ1λ2⋱λn⎭⎪⎪⎪⎪⎪⎪⎪⎪⎪⎫
5. PCA实例
- 数据:
⟮
−
1
−
1
0
2
0
−
2
0
0
1
1
⟯
\left\lgroup\begin {matrix}-1 & -1 & 0 & 2 & 0 \cr -2 & 0 & 0 & 1 & 1 \end{matrix}\right\rgroup
⎩⎪⎪⎧−1−2−10002101⎭⎪⎪⎫
二维数据5个样本 - 协方差矩阵:
C
=
1
5
⟮
−
1
−
1
0
2
0
−
2
0
0
1
1
⟯
⟮
−
1
−
2
−
1
0
0
0
2
1
0
1
⟯
=
⟮
6
5
4
5
4
5
6
5
⟯
C=\frac{1}{5} \left\lgroup\begin {matrix}-1 & -1 & 0 & 2 & 0 \cr -2 & 0 & 0 & 1 & 1 \end{matrix}\right\rgroup \left\lgroup\begin {matrix}-1 & -2 \cr -1 & 0 \cr 0 & 0 \cr 2 & 1 \cr 0 & 1 \end{matrix}\right\rgroup = \left\lgroup\begin {matrix}\frac{6}{5} & \frac{4}{5} \cr \frac{4}{5} & \frac{6}{5} \end{matrix}\right\rgroup
C=51⎩⎪⎪⎧−1−2−10002101⎭⎪⎪⎫⎩⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎧−1−1020−20011⎭⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎫=⎩⎪⎪⎧56545456⎭⎪⎪⎫
得到协方差矩阵,对角线上表达的是方差,其它元素是协方差 - 特征值: λ 1 = 2 , λ 2 = 2 5 \lambda_1=2,\ \lambda_2=\frac{2}{5} λ1=2, λ2=52
- 特征向量:
c
1
⟮
1
1
⟯
,
c
2
⟮
−
1
1
⟯
c_1\left\lgroup\begin {matrix}1 \cr 1 \end{matrix}\right\rgroup,c_2\left\lgroup\begin {matrix}-1 \cr 1 \end{matrix}\right\rgroup
c1⎩⎪⎪⎧11⎭⎪⎪⎫,c2⎩⎪⎪⎧−11⎭⎪⎪⎫
还需要进行归一化 - 对角化: P C P T = ⟮ 1 2 1 2 − 1 2 1 2 ⟯ ⟮ 6 5 4 5 4 5 6 5 ⟯ ⟮ 1 2 1 2 − 1 2 1 2 ⟯ = ⟮ 2 0 0 2 5 ⟯ PCP^T=\left\lgroup\begin {matrix}\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \cr -\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \end{matrix}\right\rgroup \left\lgroup\begin {matrix}\frac{6}{5} & \frac{4}{5} \cr \frac{4}{5} & \frac{6}{5} \end{matrix}\right\rgroup \left\lgroup\begin {matrix}\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \cr -\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \end{matrix}\right\rgroup = \left\lgroup\begin {matrix}2 & 0 \cr 0 & \frac{2}{5} \end{matrix}\right\rgroup PCPT=⎩⎪⎪⎪⎧21−212121⎭⎪⎪⎪⎫⎩⎪⎪⎧56545456⎭⎪⎪⎫⎩⎪⎪⎪⎧21−212121⎭⎪⎪⎪⎫=⎩⎪⎪⎧20052⎭⎪⎪⎫
- 降维: Y = ( 1 2 1 2 ) ⟮ − 1 − 1 0 2 0 − 2 0 0 1 1 ⟯ = ( − 3 2 − 1 2 0 3 2 − 1 2 ) Y=(\frac{1}{\sqrt{2}}\ \frac{1}{\sqrt{2}})\left\lgroup\begin {matrix}-1 & -1 & 0 & 2 & 0 \cr -2 & 0 & 0 & 1 & 1 \end{matrix}\right\rgroup =(-\frac{3}{\sqrt{2}}\ \ -\frac{1}{\sqrt{2}}\ \ 0 \ \ \frac{3}{\sqrt{2}}\ -\frac{1}{\sqrt{2}}) Y=(21 21)⎩⎪⎪⎧−1−2−10002101⎭⎪⎪⎫=(−23 −21 0 23 −21)
- 总结:简单来说,就是求出协方差矩阵,然后求出它的特征值和特征向量,找到特征值最大的前K个,将它们对应的特征向量进行归一化,然后用这些特征向量与原始数据矩阵相乘即可
代码示例1:PCA算法的实现
- 伪代码:
去除平均值
计算协方差矩阵
计算协方差矩阵的特征值和特征向量
将特征值从大到小排列
保留最上面的N个特征向量
将数据转换到上述N个特征向量构成的新空间中
from numpy import *
def loadDataSet(filename, delim='\t'):
fr = open(filename)
stringArr = [line.strip().split(delim) for line in fr.readlines()]
datArr = [list(map(float,line)) for line in stringArr] #第一个参数 function 以参数序列中的每一个元素调用 function 函数,
#返回包含每次 function 函数返回值的新列表。
return mat(datArr)
def pca(dataMat, topNfeat=9999999):
meanVals = mean(dataMat, axis=0) #求取均值
meanRemoved = dataMat - meanVals
covMat = cov(meanRemoved, rowvar=0) #求得矩阵的协方差矩阵,rowvar=0,说明传入的数据一行代表一个样本;
#若非0,说明传入的数据一列代表一个样本.
eigVals, eigVects = linalg.eig(mat(covMat)) #计算矩阵的特征值个特征向量
eigValInd = argsort(eigVals) #将x中的元素从小到大排列,提取其对应的index(索引),然后输出
eigValInd = eigValInd[:-(topNfeat+1):-1]
redEigVects = eigVects[:, eigValInd]
lowDDataMat = meanRemoved * redEigVects
reconMat = (lowDDataMat * redEigVects.T) + meanVals
return lowDDataMat, reconMat
dataMat = loadDataSet('testSet.txt')
lowDMat, reconMat = pca(dataMat, 1)
shape(lowDMat)
(1000, 1)
import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(dataMat[:,0].flatten().A[0], dataMat[:,1].flatten().A[0], marker="^", s=90)
ax.scatter(reconMat[:,0].flatten().A[0], reconMat[:,1].flatten().A[0], marker="o", s=50, c="red")
<matplotlib.collections.PathCollection at 0x1e6c17a1518>
lowDMat, reconMat = pca(dataMat, 2)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(dataMat[:,0].flatten().A[0], dataMat[:,1].flatten().A[0], marker="^", s=90)
ax.scatter(reconMat[:,0].flatten().A[0], reconMat[:,1].flatten().A[0], marker="o", s=50, c="red")
<matplotlib.collections.PathCollection at 0x1e6c180ada0>
代码示例2:利用PCA对半导体制造数据降维
- 这个数据集包含了590个特征,我们需要对这些特征进行降维处理。也可以通过https://archive.ics.uci.edu/ml/machine-learning-databases/secom/ 进行该数据集的下载,也可以关注公众号:不柒工作室 回复:PCA进行获取。
- 该数据集包含很多的缺失值。这些缺失值是以NaN标识的。对于这些缺失值,我们有一些处理办法。下面我们用平均值代替缺失值,平均值根据那些非NaN得到。
#将NaN替换成平均值的函数
def replaceNanWithMean():
datMat = loadDataSet('secom.data', ' ')
numFeat = shape(datMat)[1]
for i in range(numFeat):
meanVal = mean(datMat[nonzero(~isnan(datMat[:,i].A))[0],i])
datMat[nonzero(isnan(datMat[:, i].A))[0], i] = meanVal
return datMat
dataMat = replaceNanWithMean()
meanVals = mean(dataMat, axis=0)
meanRemoved = dataMat - meanVals
covMat = cov(meanRemoved, rowvar=0)
eigVals, eigVects = linalg.eig(mat(covMat))
eigVals
array([ 5.34151979e+07, 2.17466719e+07, 8.24837662e+06, 2.07388086e+06,
1.31540439e+06, 4.67693557e+05, 2.90863555e+05, 2.83668601e+05,
2.37155830e+05, 2.08513836e+05, 1.96098849e+05, 1.86856549e+05,
1.52422354e+05, 1.13215032e+05, 1.08493848e+05, 1.02849533e+05,
1.00166164e+05, 8.33473762e+04, 8.15850591e+04, 7.76560524e+04,
6.66060410e+04, 6.52620058e+04, 5.96776503e+04, 5.16269933e+04,
5.03324580e+04, 4.54661746e+04, 4.41914029e+04, 4.15532551e+04,
3.55294040e+04, 3.31436743e+04, 2.67385181e+04, 1.47123429e+04,
1.44089194e+04, 1.09321187e+04, 1.04841308e+04, 9.48876548e+03,
8.34665462e+03, 7.22765535e+03, 5.34196392e+03, 4.95614671e+03,
4.23060022e+03, 4.10673182e+03, 3.41199406e+03, 3.24193522e+03,
2.74523635e+03, 2.35027999e+03, 2.16835314e+03, 1.86414157e+03,
1.76741826e+03, 1.70492093e+03, 1.66199683e+03, 1.53948465e+03,
1.33096008e+03, 1.25591691e+03, 1.15509389e+03, 1.12410108e+03,
1.03213798e+03, 1.00972093e+03, 9.50542179e+02, 9.09791361e+02,
8.32001551e+02, 8.08898242e+02, 7.37343627e+02, 6.87596830e+02,
5.64452104e+02, 5.51812250e+02, 5.37209115e+02, 4.93029995e+02,
4.13720573e+02, 3.90222119e+02, 3.37288784e+02, 3.27558605e+02,
3.08869553e+02, 2.46285839e+02, 2.28893093e+02, 1.96447852e+02,
1.75559820e+02, 1.65795169e+02, 1.56428052e+02, 1.39671194e+02,
1.28662864e+02, 1.15624070e+02, 1.10318239e+02, 1.08663541e+02,
1.00695416e+02, 9.80687852e+01, 8.34968275e+01, 7.53025397e+01,
6.89260158e+01, 6.67786503e+01, 6.09412873e+01, 5.30974002e+01,
4.71797825e+01, 4.50701108e+01, 4.41349593e+01, 4.03313416e+01,
3.95741636e+01, 3.74000035e+01, 3.44211326e+01, 3.30031584e+01,
3.03317756e+01, 2.88994580e+01, 2.76478754e+01, 2.57708695e+01,
2.44506430e+01, 2.31640106e+01, 2.26956957e+01, 2.16925102e+01,
2.10114869e+01, 2.00984697e+01, 1.86489543e+01, 1.83733216e+01,
1.72517802e+01, 1.60481189e+01, 1.54406997e+01, 1.48356499e+01,
1.44273357e+01, 1.42318192e+01, 1.35592064e+01, 1.30696836e+01,
1.28193512e+01, 1.22093626e+01, 1.15228376e+01, 1.12141738e+01,
1.02585936e+01, 9.86906139e+00, 9.58794460e+00, 9.41686288e+00,
9.20276340e+00, 8.63791398e+00, 8.20622561e+00, 8.01020114e+00,
7.53391290e+00, 7.33168361e+00, 7.09960245e+00, 7.02149364e+00,
6.76557324e+00, 6.34504733e+00, 6.01919292e+00, 5.81680918e+00,
5.44653788e+00, 5.12338463e+00, 4.79593185e+00, 4.47851795e+00,
4.50369987e+00, 4.27479386e+00, 3.89124198e+00, 3.56466892e+00,
3.32248982e+00, 2.97665360e+00, 2.61425544e+00, 2.31802829e+00,
2.17171124e+00, 1.99239284e+00, 1.96616566e+00, 1.88149281e+00,
1.79228288e+00, 1.71378363e+00, 1.68028783e+00, 1.60686268e+00,
1.47158244e+00, 1.40656712e+00, 1.37808906e+00, 1.27967672e+00,
1.22803716e+00, 1.18531109e+00, 9.38857180e-01, 9.18222054e-01,
8.26265393e-01, 7.96585842e-01, 7.74597255e-01, 7.14002770e-01,
6.79457797e-01, 6.37928310e-01, 6.24646758e-01, 5.34605353e-01,
4.60658687e-01, 4.24265893e-01, 4.08634622e-01, 3.70321764e-01,
3.67016386e-01, 3.35858033e-01, 3.29780397e-01, 2.94348753e-01,
2.84154176e-01, 2.72703994e-01, 2.63265991e-01, 2.45227786e-01,
2.25805135e-01, 2.22331919e-01, 2.13514673e-01, 1.93961935e-01,
1.91647269e-01, 1.83668491e-01, 1.82518017e-01, 1.65310922e-01,
1.57447909e-01, 1.51263974e-01, 1.39427297e-01, 1.32638882e-01,
1.28000027e-01, 1.13559952e-01, 1.12576237e-01, 1.08809771e-01,
1.07136355e-01, 8.60839655e-02, 8.50467792e-02, 8.29254355e-02,
7.03701660e-02, 6.44475619e-02, 6.09866327e-02, 6.05709478e-02,
5.93963958e-02, 5.22163549e-02, 4.92729703e-02, 4.80022983e-02,
4.51487439e-02, 4.30180504e-02, 4.13368324e-02, 4.03281604e-02,
3.91576587e-02, 3.54198873e-02, 3.31199510e-02, 3.13547234e-02,
3.07226509e-02, 2.98354196e-02, 2.81949091e-02, 2.49158051e-02,
2.36374781e-02, 2.28360210e-02, 2.19602047e-02, 2.00166957e-02,
1.86597535e-02, 1.80415918e-02, 1.72261012e-02, 1.60703860e-02,
1.49566735e-02, 1.40165444e-02, 1.31296856e-02, 1.21358005e-02,
1.07166503e-02, 1.01045695e-02, 9.76055340e-03, 9.16740926e-03,
8.78108857e-03, 8.67465278e-03, 8.30918514e-03, 8.05104488e-03,
7.56152126e-03, 7.31508852e-03, 7.26347037e-03, 6.65728354e-03,
6.50769617e-03, 6.28009879e-03, 6.19160730e-03, 5.64130272e-03,
5.30195373e-03, 5.07453702e-03, 4.47372286e-03, 4.32543895e-03,
4.22006582e-03, 3.97065729e-03, 3.75292740e-03, 3.64861290e-03,
3.38915810e-03, 3.27965962e-03, 3.06633825e-03, 2.99206786e-03,
2.83586784e-03, 2.74987243e-03, 2.31066313e-03, 2.26782346e-03,
1.82206662e-03, 1.74955624e-03, 1.69305161e-03, 1.66624597e-03,
1.55346749e-03, 1.51278404e-03, 1.47296800e-03, 1.33617458e-03,
1.30517592e-03, 1.24056353e-03, 1.19823961e-03, 1.14381059e-03,
1.13027458e-03, 1.11081803e-03, 1.08359152e-03, 1.03517496e-03,
1.00164593e-03, 9.50024604e-04, 8.94981182e-04, 8.74363843e-04,
7.98497544e-04, 7.51612219e-04, 6.63964301e-04, 6.21097643e-04,
6.18098604e-04, 5.72611402e-04, 5.57509230e-04, 5.47002381e-04,
5.27195076e-04, 5.11487997e-04, 4.87787872e-04, 4.74249071e-04,
4.52367688e-04, 4.24431100e-04, 4.19119024e-04, 3.72489906e-04,
3.38125455e-04, 3.34002143e-04, 2.97951371e-04, 2.84845901e-04,
2.79038287e-04, 2.77054476e-04, 2.67962796e-04, 2.54815125e-04,
2.29230595e-04, 1.99245436e-04, 1.90381389e-04, 1.84497913e-04,
1.77415682e-04, 1.68160613e-04, 1.63992030e-04, 1.58025552e-04,
1.54226003e-04, 1.35736724e-04, 1.40079892e-04, 1.46097433e-04,
1.46890640e-04, 1.22704034e-04, 1.16752515e-04, 1.14080847e-04,
1.04252870e-04, 9.90265099e-05, 9.66039063e-05, 9.60766570e-05,
9.16166346e-05, 9.07003476e-05, 8.60212633e-05, 8.32654023e-05,
7.70526076e-05, 7.36470021e-05, 7.24998306e-05, 6.80209909e-05,
6.68682701e-05, 6.14500430e-05, 5.99843180e-05, 5.49918002e-05,
5.24646951e-05, 5.13403843e-05, 5.02336254e-05, 4.89288504e-05,
4.51104474e-05, 4.29823765e-05, 4.18869715e-05, 4.14341561e-05,
3.94822845e-05, 3.80307292e-05, 3.57776535e-05, 3.43901591e-05,
2.98089203e-05, 2.72388358e-05, 1.46846459e-05, 2.42608885e-05,
1.66549051e-05, 2.30962279e-05, 2.27807559e-05, 2.14440814e-05,
1.96208174e-05, 1.88276186e-05, 1.91217363e-05, 1.43753346e-05,
1.39779892e-05, 7.36188593e-06, 1.21760519e-05, 1.20295835e-05,
8.34248007e-06, 1.13426750e-05, 1.09258905e-05, 8.93991858e-06,
9.23630207e-06, 1.02782992e-05, 1.01021810e-05, 9.64538300e-06,
9.72678797e-06, 7.20354828e-06, 6.69282813e-06, 6.49477814e-06,
5.91044556e-06, 6.00244889e-06, 5.67034893e-06, 5.31392220e-06,
5.09342484e-06, 4.65422046e-06, 4.45482134e-06, 4.11265577e-06,
3.48065951e-06, 3.65202836e-06, 3.77558985e-06, 2.78847699e-06,
2.57492503e-06, 2.66299628e-06, 2.39210232e-06, 2.06298821e-06,
2.00824521e-06, 1.76373602e-06, 1.58273269e-06, 1.32211395e-06,
1.44003524e-06, 1.49813697e-06, 1.10002716e-06, 1.42489429e-06,
9.01008864e-07, 8.49881106e-07, 7.62521870e-07, 6.57641102e-07,
5.85636641e-07, 5.33937361e-07, 4.16077216e-07, 3.33765858e-07,
2.95575265e-07, 2.54744632e-07, 2.20144574e-07, 1.86314528e-07,
1.77370970e-07, 1.54794345e-07, 1.39738552e-07, 1.47331688e-07,
1.04110968e-07, 1.00786519e-07, 9.38635091e-08, 9.10853310e-08,
8.71546326e-08, 7.48338889e-08, 6.06817435e-08, 5.66479201e-08,
5.24576912e-08, 4.57020646e-08, 2.89942624e-08, 2.60449426e-08,
2.10987990e-08, 2.17618741e-08, 1.75542294e-08, 1.34637029e-08,
1.27167437e-08, 1.23258200e-08, 1.04987513e-08, 9.86367964e-09,
8.49422040e-09, 9.33428124e-09, 7.42189761e-09, 6.46870680e-09,
6.84633797e-09, 5.76455749e-09, 5.01138012e-09, 3.48686431e-09,
2.77880627e-09, 2.91267178e-09, 1.73093441e-09, 1.42391225e-09,
1.80003583e-10, 6.95073560e-10, 6.13337791e-10, 9.24977136e-10,
1.16455057e-09, 1.11815869e-09, 1.97062440e-10, 2.61925018e-10,
5.27517926e-10, 1.94882420e-15, -1.35801994e-15, 5.42081315e-16,
-1.07767511e-17, 1.47709396e-18, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00])
#print eigVals
print (sum(eigVals)*0.9)
print (sum(eigVals[:6]))
plt.plot(eigVals[:20])
81131452.77696146
87267225.18122165
[<matplotlib.lines.Line2D at 0x1e6c18bc518>]
前6个主成分覆盖了96.8%的方差,前20个主成分覆盖了99.3%的方差。如果保留前6个而去除后584个主成分,就可以实现大概100:1的压缩比。
代码示例3:利用sklearn自带计算PCA的方法
from sklearn import decomposition
pca_sklean = decomposition.PCA()
pca_sklean.fit(replaceNanWithMean())
main_var = pca_sklean.explained_variance_
print(sum(main_var)*0.9)
print(sum(main_var[:6]))
plt.plot(main_var[:20])
plt.show()
81131452.77696137
87267225.18122156
总结
- 降维技术使得数据变得更易用,并且往往能够去除数据中的噪声,使得其他机器学习任务更加准确。
- 有很多技术可以用于数据降维,在这些技术中,独立成分分析、因子分析和主成分分析比较流行,其中又以主成分分析应用最为广泛。
- PCA可以从数据中识别其主要特征,它是沿着数据最大方差方向旋转坐标轴实现的。选择方差最大的方向作为第一条坐标轴,后续坐标轴则与前面的坐标轴正交。协方差矩阵上的特征值分析可以用一系列的正交坐标轴来获取。
- 奇异值分解方法也可以用于特征值分析。