基于SVD实现PCA算法

最新推荐文章于 2024-07-20 17:25:24 发布

JUNE YAO

最新推荐文章于 2024-07-20 17:25:24 发布

阅读量2.7k

点赞数 5

分类专栏：手撕算法 Data Science Python 文章标签： python 算法机器学习

本文链接：https://blog.csdn.net/yao_june/article/details/110062463

版权

Python 同时被 3 个专栏收录

15 篇文章 3 订阅

订阅专栏

Data Science

6 篇文章 0 订阅

订阅专栏

手撕算法

4 篇文章 0 订阅

订阅专栏

学了PCA以后，不用SKlearn现成的包，基于numpy和scipy自己实现了一下PCA算法，原理和SKlearn里PCA是一样的，都是基于SVD方法实现的。可以按照顺序把代码贴进去，自己跑一下试试。

导入需要的包

import time #调用时间，显示算法运行时间
import os
import math
import numpy as np
import scipy as sp
import pandas as pd
# 使np矩阵不显示科学计数
np.set_printoptions(suppress=True)

主体PCA代码

class PCA(object):
    """
    Principal component analysis (PCA).
    Linear dimensionality reduction using Singular Value Decomposition of the
    data to project it to a lower dimensional space. The input data is centered
    but not scaled for each feature before applying the SVD.
    Parameters
    ----------
    n_component_ratio: select the number of components such that the amount of variance that needs to be
        explained is greater than the percentage specified by n_component_ratio
    ----------
    
    function1-fit_transform: Fit the model by computing full SVD on X.
    Parameters
    ----------
    X: x_train
    ----------
    
    function2-transform: Fit the model with X and apply the dimensionality reduction on y.
    Parameters
    ----------
    y : x_test
    ----------
    """
    
    def __init__(self,n_component_ratio):
        self.n_component_ratio=n_component_ratio
        
    def fit_transform(self,x):
        n_samples, n_features = x.shape       
        X_center = x-np.mean(x, axis=0)   # Center data
        U, s, Vt=sp.linalg.svd(X_center, full_matrices=False)
        self.__Vt=Vt
        explained_variance = (s ** 2) / (n_samples - 1)   # Get variance explained by singular values
        total_var = explained_variance.sum()
        explained_variance_ratio = explained_variance / total_var   #caculate the compressed rate
        self.n_components=self.__choose_ratio(explained_r=explained_variance_ratio)  #find how many features we should keep on the compressed rate we select
        x_compressed = U[:, :self.n_components].dot(np.diag(s[:self.n_components]))  #return the features we choose
        del X_center,explained_variance,U,s,Vt      #del variable and release memory
        self.explained_variance_ratio=explained_variance_ratio[:self.n_components] #make explained variance ratio the attributes in pca, so that we can print it out
        return x_compressed
    
    def __choose_ratio(self,explained_r):
        for i in range(1, len(explained_r)):
            if sum(explained_r[:i])/sum(explained_r) >= self.n_component_ratio:
                return i
            
    def transform(self,y):
        y_centre = y-np.mean(y, axis=0)
        y_compressed=y_centre.dot(np.linalg.inv(self.__Vt)[:,:self.n_components])   # compress x_test based on the Vt on x_train
        del y_centre  #del variable and release memory
        gc.collect()
        return y_compressed.values

找个数据集测试一下：

from sklearn.datasets import load_digits # 经典手写数字分类数据集
from sklearn.model_selection import train_test_split

digits= load_digits()
digits.target=pd.DataFrame(digits.target)
digits.data=pd.DataFrame(digits.data)
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.4, random_state=1)

数据集原来有64个特征值
在这里插入图片描述
在这里使用手写的PCA来降维：

start_time = time.time()

pca = PCA(n_component_ratio=0.95) #这里只支持输入保留的可解释方差率是多少，小数形式，不支持整数
X_train_pca = pca.fit_transform(X_train)
X_test_pca=pca.transform(X_test)
pca_n=X_train_pca.shape[1]
print("keep {} features after pca \nthe shape of X_train after PCA: {} \nthe shape of X_test after PCA: {}".format(pca_n,X_train_pca.shape,X_test_pca.shape))

print("--- %s seconds ---" % (time.time() - start_time))

在这里插入图片描述

在保留95%可解释方差的情况下降维，最后保留的特征值有28个
可视化一下：

var=np.cumsum(np.round(pca.explained_variance_ratio, decimals=3)*100)  # caculate the cumulative explained variance ratio

plt.plot(list(range(1,len(var)+1)), var)
plt.title("PCA Variance Explained Based on Custom PCA")
plt.ylabel("% Variance Explained")
plt.xlabel("# Of Features")
plt.show()

在这里插入图片描述
PCA以后，前两行降维后的数据：

在这里插入图片描述

再用sklearn自带的PCA包对比一下：

from sklearn.decomposition import PCA
start_time = time.time()

pca2 = PCA(n_components=0.95,copy=True, whiten=False)
X_train_pca2 = pca2.fit_transform(X_train)
X_test_pca2=pca2.transform(X_test)
pca_n2=X_train_pca2.shape[1]
print("keep {} features after pca \nthe shape of X_train after PCA: {} \nthe shape of X_test after PCA: {}".format(pca_n2,X_train_pca2.shape,X_test_pca2.shape))

print("--- %s seconds ---" % (time.time() - start_time))

在这里插入图片描述

在这里插入图片描述
可视化一下可解释方差率曲线

var=np.cumsum(np.round(pca2.explained_variance_ratio_, decimals=3)*100)  # caculate the cumulative explained variance ratio

plt.plot(list(range(1,len(var)+1)), var)
plt.title("PCA Variance Explained Based on sklearn PCA")
plt.ylabel("% Variance Explained")
plt.xlabel("# Of Features")
plt.show()

在这里插入图片描述
对比以后，前面手写的PCA和sklearn里PCA算法降维结果一样。sklearn的PCA稍微快一点。

算法主体部分代码放到GitHub上了，有兴趣欢迎关注：
https://github.com/JuneYaooo/ml-algorithms

JUNE YAO

关注

5
点赞
踩
23

收藏

觉得还不错? 一键收藏
4
评论
基于SVD实现PCA算法

学了PCA以后，不用SKlearn现成的包，基于numpy和scipy自己实现了一下PCA算法，原理和SKlearn里PCA是一样的，都是基于SVD方法实现的。可以按照顺序把代码贴进去，自己跑一下试试。导入需要的包import time #调用时间，显示算法运行时间import osimport mathimport numpy as npimport scipy as spimport pandas as pd# 使np矩阵不显示科学计数np.set_printoptions(suppr
复制链接

扫一扫

专栏目录