PCA原理及其python实现
PCA(Principal Components Analysis)原理
*假设我们有m个samples,每个samples有n维特征
*那么可以构造输入矩阵 X,m行n列
*PCA降维的目标就是将原来每个用n维表示的sample用k维表示,k<n
*数学上的推导由于我原来的笔记找不到了这里就先不展开,以后有了自己新的理解再补充
*推荐入门资料:南大周志华西瓜书上对PCA的讲解,书中三页讲清楚了大部分的思想
除此之外,关于PCA的最大方差解释推荐阅读这篇文章主成分分析(Principal components analysis)-最大方差解释
PCA实现步骤
- 计算协方差矩阵covX = X.T.dot(X), the shape of covX is (n,n)
- 对covX进行特征分解,得到特征值eigenValue和特征向量eigenVector。where the shape of eigenVector is n by feature n
- 对eigenValue降序排序,取前k个特征值,并且将他们对应的特征向量从eigenVector中抽取出来,得到selectVec, 大小为n by feature k
- 降维后的数据矩阵为X_lowdimension=X.dot(selectVec), 大小为m by feature k
如果k比较低的情况下(k=2或者k=3),可以将这些数据plot出来看看PCA降维到底发生了什么
5.用降维后的数据矩阵重构X, X_reconstruct=X_lowdimension.dot(selectVec.T), 大小为(m,n)
PCA核心算法python实现
'''
#meanX(dataX)
#Function: for calculating the mean of input dataX
#input parameter:numpy form "dataX", whose row indicate sample,coloum indicate character
'''
def meanX(dataX):
return np.mean(dataX,axis=0)#axis=0 indicate calculating mean by coloum
'''
#PCA(XMat, k)
#Function: calculate input Matrix XMat's PCA result using the first k dimension character
#input parameter:
- XMat: numpy form "XMat", whose row indicate sample,coloum indicate character
- k: means only reserve the first k egienvalue's egienvector
#return:
- X_afterPCA: the low dimensional matrix respect to parameter k
- recon_XMat: reconstruct data, the matrix after shifting the coordinate
'''
def PCA(XMat, k):
average = meanX(XMat)#calculate the mean of XMat by coloum,the shape of average is (1,n)
m, n = np.shape(XMat)
XMat_Centralization = []
avgs = np.tile(average, (m, 1))#copy m average by row,the shape of avgs is(m,n)
XMat_Centralization = XMat - avgs#data preprocessing :rawData minus every character's mean, the shape of XMat_Centralization is (m,n)
#covX = np.cov(XMat_Centralization.T) #Calculating the covariance matrix,the shape of covX is (n,n)
covX = XMat_Centralization.T.dot(XMat_Centralization)#,if x's shape is (m,n),covx=X.T*X
eigenVal, eigenVec = np.linalg.eig(covX) #calculating eigenvalue and egienvector of covariance matrix covX,eigenvalue's shape is (n,1),eigenVec's shape is (n,n)
index_EigenVal = np.argsort(-eigenVal) #sort eigenVal by descending order
X_afterPCA = []
if k > n:
print ("k must lower than feature number")
return
else:
#eigenVec is a coloum vector which means the eigenVec indicates (n by feature n)
selectVec = np.matrix(eigenVec.T[index_EigenVal[:k]]) #eigenVec.T indicates (feature n by n),slect the first k feature to construct selectVec whose shape is(k,n)
X_afterPCA = XMat_Centralization * selectVec.T #preforming PCA on XMat_Centralization,we can get X_afterPCA whose shape is (m,k),n features is transformed to k features in each sample
recon_XMat = (X_afterPCA * selectVec) + average #reconstruct the XMat by X_afterPCA, recon_XMat's shape is (m,n)
return X_afterPCA, recon_XMat
optdigits数据集介绍
下载地址:http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
-
Title of Database: Optical Recognition of Handwritten Digits
-
Source:
E. Alpaydin, C. Kaynak
Department of Computer Engineering
Bogazici University, 80815 Istanbul Turkey
alpaydin@boun.edu.tr
July 1998 -
Past Usage:
C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
Applications to Handwritten Digit Recognition,
MSc Thesis, Institute of Graduate Studies in Science and
Engineering, Bogazici University.E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika,
to appear. ftp://ftp.icsi.berkeley.edu/pub/ai/ethem/kyb.ps.Z -
Relevant Information:
We used preprocessing programs made available by NIST to extract
normalized bitmaps of handwritten digits from a preprinted form. From
a total of 43 people, 30 contributed to the training set and different
13 to the test set. 32x32 bitmaps are divided into nonoverlapping
blocks of 4x4 and the number of on pixels are counted in each block.
This generates an input matrix of 8x8 where each element is an
integer in the range 0…16. This reduces dimensionality and gives
invariance to small distortions.For info on NIST preprocessing routines, see
M. D. Garris, J. L. Blue, G. T. Candela, D. L. Dimmick, J. Geist,
P. J. Grother, S. A. Janet, and C. L. Wilson, NIST Form-Based
Handprint Recognition System, NISTIR 5469, 1994. -
Number of Instances
optdigits.tra Training 3823
optdigits.tes Testing 1797The way we used the dataset was to use half of training for
actual training, one-fourth for validation and one-fourth
for writer-dependent testing. The test set was used for
writer-independent testing and is the actual quality measure. -
Number of Attributes
64 input+1 class attribute -
For Each Attribute:
All input attributes are integers in the range 0…16.
The last attribute is the class code 0…9 -
Missing Attribute Values
None -
Class Distribution
Class: No of examples in training set
0: 376
1: 389
2: 380
3: 389
4: 387
5: 376
6: 377
7: 387
8: 380
9: 382Class: No of examples in testing set
0: 178
1: 182
2: 177
3: 183
4: 181
5: 182
6: 181
7: 179
8: 174
9: 180
Accuracy on the testing set with k-nn
using Euclidean distance as the metric
k = 1 : 98.00
k = 2 : 97.38
k = 3 : 97.83
k = 4 : 97.61
k = 5 : 97.89
k = 6 : 97.77
k = 7 : 97.66
k = 8 : 97.66
k = 9 : 97.72
k = 10 : 97.55
k = 11 : 97.89
case1:用PCA对optdigits.tra数据集降维
(1)从optdigits.tra抽取出所有的数字为’3‘的行
(2)对其作用PCA,将这些数据降维到2D
(3)在平面上画出这些二维散点
(4)以栅格的形式在二维散点平面上查找离标准栅格点最近的5×5共25个点
(5)将这些点对应的原数据以灰度图的形式画出来
(6)比较这25个’3‘表现出来的规律
case2:用PCA对点云数据降维
在3D视觉中,点云是指由大量空间3D点组成的数据集:每一个点由x,y,z三个坐标组成。
结构光系统扫描出来的点云:
我们对上面这个鞋子的点云进行PCA处理,结果如下: