PCA主成分分析理解

最新推荐文章于 2024-07-12 16:16:27 发布

weixin_45974177

最新推荐文章于 2024-07-12 16:16:27 发布

阅读量125

点赞数

分类专栏： Python-算法文章标签： python 机器学习开发语言

本文链接：https://blog.csdn.net/weixin_45974177/article/details/130447215

版权

Python-算法专栏收录该内容

7 篇文章 1 订阅

订阅专栏

PCA主成分分析实现

一、PCA主成分分析案例理解
二、Python实现方法
- 2.1 Python手动实现
- 2.2 Python调库实现
总结

一、PCA主成分分析案例理解

Q1：假定我们收集了一组数据，每组数据包含了很多特征（维度很高），那么我们可以使用什么方法用低维的方式来对其进行展现呢？（这里之所以三维降低到二维是为了方便，实际使用中一般会更高）
A1：我们可以使用PCA主成分分析的方法。
Q2： PCA主成分分析是怎么实现的呢？A2： PCA主成分分析让我们暂时抛弃特征的感念，将其看成一个成分，成分和成分之间是平等的。那么基于这个条件，我们就可以分析数据之间的关系，这里我们会使用到协方差、均值、特征值、特征向量的概念，不去深究细节。我们只需要知道，这些概念可以帮我们在损失最少信息的前提下，将其中某些成分删去，糅合进留下的成分中。举例来说，就是将一个[40,3]的矩阵变为[40,2]的矩阵，而[40,2]的矩阵中会尽可能保存原始数据，但是他现在是二维的了，我们看起来更方便了。

二、Python实现方法

2.1 Python手动实现

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Project ：算法-PCA.py 
@File    ：PCA.py
@IDE     ：PyCharm 
@Author  ：Pan  Youxuan
@Date    ：2023/4/30 11:16 
@Descib  ： PCA主成分分析实现
'''
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, decomposition
from mpl_toolkits.mplot3d import Axes3D

np.random.seed(0)
mu_vec1 = np.array([0, 0, 0])
mu_vec2 = np.array([1, 1, 1])
cov1 = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
cov2 = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
class1_sample = np.random.multivariate_normal(mu_vec1, cov1, size=20).T
class2_sample = np.random.multivariate_normal(mu_vec2, cov2, size=20).T
print('生成数据维度是class1={},class2={}'.format(class1_sample.shape, class2_sample.shape))

# 绘制三维图像。
fig = plt.figure()
ax = Axes3D(fig, auto_add_to_figure=False)
fig.add_axes(ax)
X = np.arange(-1, 1, 0.25)
Y = np.arange(-1, 1, 0.25)
X, Y = np.meshgrid(X, Y)
ax.scatter(class1_sample[0, :], class1_sample[1, :], class1_sample[2, :], c='r', marker='o')
ax.scatter(class2_sample[0, :], class2_sample[1, :], class2_sample[2, :], c='b', marker='^')
plt.show()

# PCA降维
all_samples = np.concatenate((class1_sample, class2_sample), axis=1)
print('所有样本的维度为{}'.format(all_samples.shape))
# 求每个维度的均值,axis=1为计算每行的均值，之后在转换为3*1矩阵。
mean_vector = all_samples.mean(axis=1).reshape(3, 1)
print('每个维度的均值为{}'.format(mean_vector))
# 计算协方差矩阵
scatter_mat = np.zeros((3, 3))
for i in range(all_samples.shape[1]):
    scatter_mat += (all_samples[:, i].reshape(3, 1) - mean_vector).dot((all_samples[:, i].reshape(3, 1)).T)
print('协方差矩阵为:{}'.format(scatter_mat))

# 求特征向量和特征值，使用np.linalg库
eig_val, eig_vec = np.linalg.eig(scatter_mat)
print('得到的特征值为:{}\n得到的特征向量为:{}'.format(eig_val, eig_vec))

# 选择特征值最大的数，表示信息量更多
W = eig_vec[0:2, :]
y = W.dot(all_samples)
# 绘制图片
fig1 = plt.figure()
ax1 = fig1.add_subplot(111)
ax1.scatter(y[0, 0:21], y[1, 0:21], c='r', marker='o')
ax1.scatter(y[0, 21:], y[1, 21:], c='b', marker='^')
plt.show()

2.2 Python调库实现

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Project ：算法-PCA.py 
@File    ：PCA.py
@IDE     ：PyCharm 
@Author  ：Pan  Youxuan
@Date    ：2023/4/30 11:16 
@Descib  ：PCA主成分分析实现
'''
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, decomposition
from mpl_toolkits.mplot3d import Axes3D

iris = datasets.load_iris()
X = iris.data
Y = iris.target
print('生成的X数据维度为:{},Y为{}'.format(X.shape, Y.shape))
pca = decomposition.PCA(n_components=None)
pca.fit(X)
print('每个特征的占比为:{}'.format(pca.explained_variance_ratio_))
pca = decomposition.PCA(n_components=2)
pca.fit(X)
X_r = pca.transform(X)
print('修改后的X数据维度是:{}'.format(X_r.shape))

fig = plt.figure()
ax2 = fig.add_subplot(111)
colors = ['r', 'g', 'b']
for label, color in zip(np.unique(Y), colors):
    position = Y == label
    ax2.scatter(X_r[position, 0], X_r[position, 1], label='target=%d' % label, color=color)

ax2.set_xlabel('X[0]')
ax2.set_ylabel('X[1]')
ax2.legend(loc='best')
ax2.set_title('PCA')
plt.show()

总结

个人理解如下：
PCA其实是从近似的角度出发。当我们得到一组高维数据时，我们会无从下手，因为维度太多，我们没法对其进行分析。那么，我们暂时抛弃这些维度的具体含义，仅从数据的角度出发，选出其中信息量最大的部分实现降维。

weixin_45974177

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
PCA主成分分析理解

个人理解如下：PCA其实是从近似的角度出发。当我们得到一组高维数据时，我们会无从下手，因为维度太多，我们没法对其进行分析。那么，我们暂时抛弃这些维度的具体含义，仅从数据的角度出发，选出其中信息量最大的部分实现降维。
复制链接

扫一扫