基于SVD实现PCA算法

学了PCA以后,不用SKlearn现成的包,基于numpy和scipy自己实现了一下PCA算法,原理和SKlearn里PCA是一样的,都是基于SVD方法实现的。可以按照顺序把代码贴进去,自己跑一下试试。

导入需要的包

import time #调用时间,显示算法运行时间
import os
import math
import numpy as np
import scipy as sp
import pandas as pd
# 使np矩阵不显示科学计数
np.set_printoptions(suppress=True)

主体PCA代码

class PCA(object):
    """
    Principal component analysis (PCA).
    Linear dimensionality reduction using Singular Value Decomposition of the
    data to project it to a lower dimensional space. The input data is centered
    but not scaled for each feature before applying the SVD.
    Parameters
    ----------
    n_component_ratio: select the number of components such that the amount of variance that needs to be
        explained is greater than the percentage specified by n_component_ratio
    ----------
    
    function1-fit_transform: Fit the model by computing full SVD on X.
    Parameters
    ----------
    X: x_train
    ----------
    
    function2-transform: Fit the model with X and apply the dimensionality reduction on y.
    Parameters
    ----------
    y : x_test
    ----------
    """
    
    def __init__(self,n_component_ratio):
        self.n_component_ratio=n_component_ratio
        
    def fit_transform(self,x):
        n_samples, n_features = x.shape       
        X_center = x-np.mean(x, axis=0)   # Center data
        U, s, Vt=sp.linalg.svd(X_center, full_matrices=False)
        self.__Vt=Vt
        explained_variance = (s ** 2) / (n_samples - 1)   # Get variance explained by singular values
        total_var = explained_variance.sum()
        explained_variance_ratio = explained_variance / total_var   #caculate the compressed rate
        self.n_components=self.__choose_ratio(explained_r=explained_variance_ratio)  #find how many features we should keep on the compressed rate we select
        x_compressed = U[:, :self.n_components].dot(np.diag(s[:self.n_components]))  #return the features we choose
        del X_center,explained_variance,U,s,Vt      #del variable and release memory
        self.explained_variance_ratio=explained_variance_ratio[:self.n_components] #make explained variance ratio the attributes in pca, so that we can print it out
        return x_compressed
    
    def __choose_ratio(self,explained_r):
        for i in range(1, len(explained_r)):
            if sum(explained_r[:i])/sum(explained_r) >= self.n_component_ratio:
                return i
            
    def transform(self,y):
        y_centre = y-np.mean(y, axis=0)
        y_compressed=y_centre.dot(np.linalg.inv(self.__Vt)[:,:self.n_components])   # compress x_test based on the Vt on x_train
        del y_centre  #del variable and release memory
        gc.collect()
        return y_compressed.values

找个数据集测试一下:

from sklearn.datasets import load_digits # 经典手写数字分类数据集
from sklearn.model_selection import train_test_split

digits= load_digits()
digits.target=pd.DataFrame(digits.target)
digits.data=pd.DataFrame(digits.data)
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.4, random_state=1)

数据集原来有64个特征值
在这里插入图片描述
在这里使用手写的PCA来降维:

start_time = time.time()

pca = PCA(n_component_ratio=0.95) #这里只支持输入保留的可解释方差率是多少,小数形式,不支持整数
X_train_pca = pca.fit_transform(X_train)
X_test_pca=pca.transform(X_test)
pca_n=X_train_pca.shape[1]
print("keep {} features after pca \nthe shape of X_train after PCA: {} \nthe shape of X_test after PCA: {}".format(pca_n,X_train_pca.shape,X_test_pca.shape))

print("--- %s seconds ---" % (time.time() - start_time))

在这里插入图片描述

在保留95%可解释方差的情况下降维,最后保留的特征值有28个
可视化一下:

var=np.cumsum(np.round(pca.explained_variance_ratio, decimals=3)*100)  # caculate the cumulative explained variance ratio

plt.plot(list(range(1,len(var)+1)), var)
plt.title("PCA Variance Explained Based on Custom PCA")
plt.ylabel("% Variance Explained")
plt.xlabel("# Of Features")
plt.show()

在这里插入图片描述
PCA以后,前两行降维后的数据:

在这里插入图片描述

再用sklearn自带的PCA包对比一下:

from sklearn.decomposition import PCA
start_time = time.time()

pca2 = PCA(n_components=0.95,copy=True, whiten=False)
X_train_pca2 = pca2.fit_transform(X_train)
X_test_pca2=pca2.transform(X_test)
pca_n2=X_train_pca2.shape[1]
print("keep {} features after pca \nthe shape of X_train after PCA: {} \nthe shape of X_test after PCA: {}".format(pca_n2,X_train_pca2.shape,X_test_pca2.shape))

print("--- %s seconds ---" % (time.time() - start_time))

在这里插入图片描述

在这里插入图片描述
可视化一下可解释方差率曲线

var=np.cumsum(np.round(pca2.explained_variance_ratio_, decimals=3)*100)  # caculate the cumulative explained variance ratio

plt.plot(list(range(1,len(var)+1)), var)
plt.title("PCA Variance Explained Based on sklearn PCA")
plt.ylabel("% Variance Explained")
plt.xlabel("# Of Features")
plt.show()

在这里插入图片描述
对比以后,前面手写的PCA和sklearn里PCA算法降维结果一样。sklearn的PCA稍微快一点。

算法主体部分代码放到GitHub上了,有兴趣欢迎关注:
https://github.com/JuneYaooo/ml-algorithms

  • 5
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值