PCA算法原理及实践

最新推荐文章于 2021-09-12 10:22:54 发布

Johnson0722

最新推荐文章于 2021-09-12 10:22:54 发布

阅读量1k

点赞数

分类专栏：机器学习文章标签：机器学习 PCA 主成分分析降维

本文链接：https://blog.csdn.net/John_xyz/article/details/87007145

版权

机器学习专栏收录该内容

16 篇文章 4 订阅

订阅专栏

PCA的概念

PCA（principal components analysis）即主成分分析技术，是特征降维的常用手段。它是一个线性变换。这个变换把数据变换到一个新的坐标系统中，使得任何数据投影的第一大方差在第一个坐标(称为第一主成分)上，第二大方差在第二个坐标(第二主成分)上，依次类推。主成分分析经常用减少数据集的维数，同时保持数据集的对方差贡献最大的特征。这是通过保留低阶主成分，忽略高阶主成分做到的。这样低阶成分往往能够保留住数据的最重要方面

PCA算法流程

输入: $m$ 个样本，每个样本 $n$ 维空间 $X_i=(x_{i1}, x_{i2}, ..., x_{in})$ ，要降到维数 $k$
输出：降维后的样本集 $Y$

特征标准化，平衡各个特征尺度：
$x_{ij} = \frac{x_{ij} - \mu_j}{s_j}$ , 其中 $\mu_j$ 为特征 $j$ 的均值， $s_j$ 为特征 $j$ 的标准差
计算样本协方差矩阵
求出协方差矩阵的特征值以及对应的特征向量
将特征向量按对应特征值大小从上到下按行排列成矩阵，取前 $k$ 行组成矩阵 $P$
$Y = P X$ 即为降维后的 $k$ 维数据

notes:
有时候，我们不能制动降维后 $n^{'}$ 的值，这时候可以换种方式，指定一个降维的主成分比重阈值 $t$ , 这个阈值在 $（ 0, 1]$ 之间。假如 $n$ 个特征值为 $\lambda_1 >= \lambda_2 >= \lambda_3 >=... >= \lambda_n$ , 可以通过下式得到：
$KaTeX parse error: Double superscript at position 15: \frac{\sum_1^n'̲\lambda_i}{\sum…$

PCA实践

# coding:utf-8
import numpy as np
from sklearn.decomposition import PCA

X = np.array([[-1, -1, 1], [-2, -1, 2], [-3, -2, 3], [1, 1, 1], [2, 1, 2], [3, 2, 3]])
# select top two component
pca_model_1 = PCA(n_components=2)

# select the number of components such that the amount of variance that needs to be greater than
# the percentage specified by n_components
pca_model_2 = PCA(n_components=0.8, svd_solver='full')

pca_model_1.fit(X)

# Principal axes in feature space,
# representing the directions of maximum variance in the data
print(pca_model_1.components_)
"""
>>> [[-0.83849224 -0.54491354 -0.        ]
     [ 0.          0.          1.        ]]
"""

# The amount of variance explained by each of the selected components.
print(pca_model_1.explained_variance_)
"""
>>> [7.93954312 0.8       ]
"""

# Percentage of variance explained by each of the selected components
print(pca_model_1.explained_variance_ratio_)
"""
>>> [0.90222081 0.09090909]
"""

# he singular values corresponding to each of the selected components
print(pca_model_1.singular_values_)
"""
>>> [6.30061232 2.        ]
"""

pca_model_2.fit(X)
print(pca_model_2.components_)
print(pca_model_2.explained_variance_)
print(pca_model_2.explained_variance_ratio_)
print(pca_model_2.singular_values_)
"""
[[-0.83849224 -0.54491354 -0.        ]]
[7.93954312]
[0.90222081]
[6.30061232]

"""