PCA(主成分分析)降维原理及其在optdigits以及点云数据集上的python实现

PCA(Principal Components Analysis)原理

*假设我们有m个samples,每个samples有n维特征
*那么可以构造输入矩阵 X,m行n列
*PCA降维的目标就是将原来每个用n维表示的sample用k维表示,k<n
*数学上的推导由于我原来的笔记找不到了这里就先不展开,以后有了自己新的理解再补充
*推荐入门资料:南大周志华西瓜书上对PCA的讲解,书中三页讲清楚了大部分的思想
除此之外,关于PCA的最大方差解释推荐阅读这篇文章主成分分析(Principal components analysis)-最大方差解释

PCA实现步骤

  1. 计算协方差矩阵covX = X.T.dot(X), the shape of covX is (n,n)
  2. 对covX进行特征分解,得到特征值eigenValue和特征向量eigenVector。where the shape of eigenVector is n by feature n
  3. 对eigenValue降序排序,取前k个特征值,并且将他们对应的特征向量从eigenVector中抽取出来,得到selectVec, 大小为n by feature k
  4. 降维后的数据矩阵为X_lowdimension=X.dot(selectVec), 大小为m by feature k
    如果k比较低的情况下(k=2或者k=3),可以将这些数据plot出来看看PCA降维到底发生了什么
    5.用降维后的数据矩阵重构X, X_reconstruct=X_lowdimension.dot(selectVec.T), 大小为(m,n)

PCA核心算法python实现

'''
#meanX(dataX)
#Function: for calculating the mean of input dataX
#input parameter:numpy form "dataX", whose row indicate sample,coloum indicate character
'''
def meanX(dataX):
    return np.mean(dataX,axis=0)#axis=0 indicate calculating mean by coloum



'''
#PCA(XMat, k)
#Function: calculate input Matrix XMat's PCA result using the first k dimension character
#input parameter:
    - XMat: numpy form "XMat", whose row indicate sample,coloum indicate character
    - k: means only reserve the first k egienvalue's egienvector
#return:
    - X_afterPCA: the low dimensional matrix respect to parameter k
    - recon_XMat: reconstruct data, the matrix after shifting the coordinate
'''
def PCA(XMat, k):
    average = meanX(XMat)#calculate the mean of XMat by coloum,the shape of average is (1,n)
    m, n = np.shape(XMat)
    XMat_Centralization = []
    avgs = np.tile(average, (m, 1))#copy m average by row,the shape of avgs is(m,n)
    XMat_Centralization = XMat - avgs#data preprocessing :rawData minus every character's mean, the shape of XMat_Centralization is (m,n)
    #covX = np.cov(XMat_Centralization.T)   #Calculating the covariance matrix,the shape of covX is (n,n)
    covX = XMat_Centralization.T.dot(XMat_Centralization)#,if x's shape is (m,n),covx=X.T*X
    eigenVal, eigenVec = np.linalg.eig(covX)  #calculating eigenvalue and egienvector of covariance matrix covX,eigenvalue's shape is (n,1),eigenVec's shape is (n,n)
    index_EigenVal = np.argsort(-eigenVal) #sort eigenVal by descending order
    
    X_afterPCA = []
    if k > n:
        print ("k must lower than feature number")
        return
    else:
        #eigenVec is a coloum vector which means the eigenVec indicates (n by feature n)
        selectVec = np.matrix(eigenVec.T[index_EigenVal[:k]]) #eigenVec.T indicates (feature n by n),slect the first k feature to construct selectVec whose shape is(k,n)
        X_afterPCA = XMat_Centralization * selectVec.T #preforming PCA on XMat_Centralization,we can get X_afterPCA whose shape is (m,k),n features is transformed to k features in each sample
        recon_XMat = (X_afterPCA * selectVec) + average #reconstruct the XMat by  X_afterPCA, recon_XMat's shape is (m,n)
    return X_afterPCA, recon_XMat

optdigits数据集介绍

下载地址:http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

  1. Title of Database: Optical Recognition of Handwritten Digits

  2. Source:
    E. Alpaydin, C. Kaynak
    Department of Computer Engineering
    Bogazici University, 80815 Istanbul Turkey
    alpaydin@boun.edu.tr
    July 1998

  3. Past Usage:
    C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition,
    MSc Thesis, Institute of Graduate Studies in Science and
    Engineering, Bogazici University.

    E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika,
    to appear. ftp://ftp.icsi.berkeley.edu/pub/ai/ethem/kyb.ps.Z

  4. Relevant Information:
    We used preprocessing programs made available by NIST to extract
    normalized bitmaps of handwritten digits from a preprinted form. From
    a total of 43 people, 30 contributed to the training set and different
    13 to the test set. 32x32 bitmaps are divided into nonoverlapping
    blocks of 4x4 and the number of on pixels are counted in each block.
    This generates an input matrix of 8x8 where each element is an
    integer in the range 0…16. This reduces dimensionality and gives
    invariance to small distortions.

    For info on NIST preprocessing routines, see
    M. D. Garris, J. L. Blue, G. T. Candela, D. L. Dimmick, J. Geist,
    P. J. Grother, S. A. Janet, and C. L. Wilson, NIST Form-Based
    Handprint Recognition System, NISTIR 5469, 1994.

  5. Number of Instances
    optdigits.tra Training 3823
    optdigits.tes Testing 1797

    The way we used the dataset was to use half of training for
    actual training, one-fourth for validation and one-fourth
    for writer-dependent testing. The test set was used for
    writer-independent testing and is the actual quality measure.

  6. Number of Attributes
    64 input+1 class attribute

  7. For Each Attribute:
    All input attributes are integers in the range 0…16.
    The last attribute is the class code 0…9

  8. Missing Attribute Values
    None

  9. Class Distribution
    Class: No of examples in training set
    0: 376
    1: 389
    2: 380
    3: 389
    4: 387
    5: 376
    6: 377
    7: 387
    8: 380
    9: 382

    Class: No of examples in testing set
    0: 178
    1: 182
    2: 177
    3: 183
    4: 181
    5: 182
    6: 181
    7: 179
    8: 174
    9: 180

Accuracy on the testing set with k-nn
using Euclidean distance as the metric

k = 1 : 98.00
k = 2 : 97.38
k = 3 : 97.83
k = 4 : 97.61
k = 5 : 97.89
k = 6 : 97.77
k = 7 : 97.66
k = 8 : 97.66
k = 9 : 97.72
k = 10 : 97.55
k = 11 : 97.89

case1:用PCA对optdigits.tra数据集降维

(1)从optdigits.tra抽取出所有的数字为’3‘的行
(2)对其作用PCA,将这些数据降维到2D
(3)在平面上画出这些二维散点
(4)以栅格的形式在二维散点平面上查找离标准栅格点最近的5×5共25个点
(5)将这些点对应的原数据以灰度图的形式画出来
(6)比较这25个’3‘表现出来的规律
在这里插入图片描述
在这里插入图片描述

case2:用PCA对点云数据降维

在3D视觉中,点云是指由大量空间3D点组成的数据集:每一个点由x,y,z三个坐标组成。
结构光系统扫描出来的点云:
在这里插入图片描述
我们对上面这个鞋子的点云进行PCA处理,结果如下:
在这里插入图片描述

  • 4
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值