【Python】机器学习笔记07-主成分分析（Principle Component Analysis）

最新推荐文章于 2022-10-05 07:13:56 发布

RM -RF /星

最新推荐文章于 2022-10-05 07:13:56 发布

阅读量545

点赞数

分类专栏：数据科学与人工智能文章标签：机器学习 python 数据分析数据挖掘

本文链接：https://blog.csdn.net/weixin_41429999/article/details/108047739

版权

数据科学与人工智能专栏收录该内容

11 篇文章 5 订阅

订阅专栏

本文的参考资料：《Python数据科学手册》；
本文的源代上传到了Gitee上；

本文用到的包：

%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.datasets import load_digits, fetch_lfw_people
from sklearn.decomposition import PCA

sns.set()
plt.rc('font', family='SimHei')
plt.rc('axes', unicode_minus=False)

主成分分析（Principle Component Analysis）

主成分分析是一种非常基础的降维算法，适用于数据可视化、噪音过滤、特征抽取等领域；

PCA的弱点之一是容易受数据中的异常点的影响，为此，PCA也有很多变体，例如Sparse PCA；

同时，PCA对于非线性数据的效果不是很理想，如果要降维非线性的数据，可以尝试其他方法；
在现在的sklearn中，主成分分析由PCA类实现；

主轴

使用PCA拟合一组二维数据，模型训练完成后会提供components_和explained_variance_两个属性，将这两个属性作为一个向量的方向与模平方画在图上，得到以下结果：

这两个向量构成了数据的主轴，每一个数据点在主轴上的投影就是数据的主成分；

x = np.random.rand(100) * 5
y = x + 1.5 * np.random.rand(100)
x, y = x[:, np.newaxis], y[:, np.newaxis]

plt.figure(figsize=(10, 10))
plt.plot(x, y, 'o', alpha=0.5)

pca = PCA(n_components=2)
pca.fit(np.hstack((x, y)))

print(f'PCA属性，成分：{pca.components_=}')
print(f'PCA属性：可解释差异{pca.explained_variance_=}')

for v, l in zip(pca.components_, pca.explained_variance_):
    plt.annotate('', pca.mean_ + v * np.sqrt(l), pca.mean_, arrowprops=dict(arrowstyle='->', linewidth=2, color='k'))
plt.axis('equal')
plt.title('PCA 主轴')

在这里插入图片描述

降维

使用PCA进行降维意味着去除一个或多个主成分，从而得到一个维度更低但是保留了最大数据方差的数据投影；

第一个示例将二维数据投影到了一维，第二个示例将64维的手写数字数据投影到了二维，从第一个示例可以看到，最然数据变为了一维，但是数据的大致分布情况保留了下来，从第二个示例可以看出，虽然维度降低了32倍，但是部分数字仍旧能够在这个二维空间被分隔开；

x = np.dot(np.random.randn(2, 2), np.random.randn(2, 200)).T
pca = PCA(n_components=1)
pca.fit(x)
res = pca.transform(x)
res = pca.inverse_transform(res)

plt.figure(figsize=(10, 10))
plt.plot(x[:, 0], x[:, 1], 'o', alpha=0.5)
plt.plot(res[:, 0], res[:, 1], 'o', alpha=0.5)
plt.title('PCA降维效果')

digits = load_digits()
pca.n_components = 2
pca.fit(digits.data)
res = pca.transform(digits.data)

plt.figure(figsize=(10, 10))
plt.scatter(
    x=res[:, 0],
    y=res[:, 1],
    c=digits.target,
    cmap=plt.cm.get_cmap('viridis', lut=10),
    alpha=0.5,
    linewidths=2,
)
plt.colorbar()
plt.title('PCA降维手写数字')

结果：
在这里插入图片描述

在这里插入图片描述

确定成分的数量

在使用PCA进行降维时，为了确定合适的成分数量，我们可以绘制累计方差贡献率与成分数量的图像，从而确定需要的成分数量，累计方差贡献率可以通过PCA类的explained_variance_ratio_属性来计算，以手写数字数据集为例：

可以看到，至少要保留20个成分以上才能保持90%的方差；

model = PCA()
model.fit(digits.data)

plt.figure(figsize=(10, 10))
plt.plot(np.arange(1, 64 + 1), np.cumsum(model.explained_variance_ratio_) * 100)
plt.xlim(1, 64)
plt.ylim(0, 110)
plt.hlines(90, 0, 64, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('保留成分数量')
plt.ylabel('累计方差贡献率（%）')
plt.title('保留成分数量-累计方差贡献率')

在这里插入图片描述

使用PCA进行噪音过滤

对于一个正常的数据集来说，数据本身的方差应该远大于噪声的方差，所以如果我们使用PCA提取数据的主成分，理论上来讲就能够实现降噪效果；
这里对sklearn中的手写数字加噪声，然后使用PCA进行降维并重建，查看去噪效果；

def plot_digits(d, t, s):
    fig, axs = plt.subplots(3, 10, figsize=(12, 5))
    fig.subplots_adjust(hspace=0.1, wspace=0.1)
    fig.suptitle(s)
    for i, ax in enumerate(axs.flatten()):
        ax.imshow(d[i], cmap='binary')
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_xlabel(f'{t[i]}')

plot_digits(digits.images, digits.target, '原始手写数字样例')
noise = np.random.normal(digits.images, 2.33)
plot_digits(noise, digits.target, '加噪声手写数字')

model = PCA(n_components=0.5)  # 保留50%的方差
model.fit(noise.reshape(1797, 64))
print(f'保留50%的方差需要的维度：{model.n_components_}')
res = model.transform(noise.reshape(1797, 64))
res = model.inverse_transform(res)
res = res.reshape(1797, 8, 8)
plot_digits(res, digits.target, '使用PCA去噪声后的手写数字')

在这里插入图片描述

PCA案例：特征脸

介绍支持向量机时，我们使用了PCA对人脸数据进行降维；

将PCA模型的特征向量当作图像数据可视化之后（这样的图像称为特征脸），我们可以直观地看到PCA模型保留了原始数据的哪些信息；

通过计算累计方差，我们发现，将原本的2914维（ $62 \times 47$ ）的数据降维至150维后，方差依旧能够保留90%以上；

最后，我们使用降维之后的数据重构图像，并显示，可以看到，数据的特征基本得到了保留；

这里使用了包含随即方法的PCA，加快处理高维数据的速度；

faces = fetch_lfw_people(min_faces_per_person=60)
model = PCA(n_components=150, random_state=42, svd_solver='randomized', whiten=False, iterated_power=3)

fig, axs = plt.subplots(3, 8, figsize=(18, 6))
fig.subplots_adjust(hspace=0.1, wspace=1)
fig.suptitle('sklearn人脸数据')
for i, ax in enumerate(axs.flatten()):  # type: int, plt.Axes
    ax.imshow(faces.images[i], cmap='bone')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel(faces.target_names[faces.target[i]])

model.fit(faces.data)
fig, axs = plt.subplots(3, 8, figsize=(18, 5))
fig.subplots_adjust(hspace=0.1, wspace=0.1)
fig.suptitle('PCA 前24个特征向量（特征脸）')
for i, ax in enumerate(axs.flatten()):  # type: int, plt.Axes
    ax.imshow(model.components_[i].reshape(62, 47), cmap='bone')
    ax.set_xticks([])
    ax.set_yticks([])

plt.figure(figsize=(5, 5))
plt.plot(np.cumsum(model.explained_variance_ratio_))
plt.xlabel('成分数量')
plt.ylabel('累计方差贡献率')

components = model.transform(faces.data)
projected = model.inverse_transform(components)

fig, axs = plt.subplots(3, 8, figsize=(18, 7))
fig.subplots_adjust(hspace=0.1, wspace=1)
fig.suptitle('使用PCA降维之后的结果重构数据')
for i, ax in enumerate(axs.flatten()):  # type: int, plt.Axes
    ax.imshow(projected[i].reshape(62, 47), cmap='bone')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel(faces.target_names[faces.target[i]])

在这里插入图片描述

完整代码（Jupyter Notebook）

#%%

%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.datasets import load_digits, fetch_lfw_people
from sklearn.decomposition import PCA

sns.set()
plt.rc('font', family='SimHei')
plt.rc('axes', unicode_minus=False)

#%% md

# 主成分分析（Principle Component Analysis）

主成分分析是一种非常基础的降维算法，适用于数据可视化、噪音过滤、特征抽取等领域；<br>
PCA的弱点之一是容易受数据中的异常点的影响，为此，PCA也有很多变体，例如Sparse PCA；<br>
同时，PCA对于非线性数据的效果不是很理想，如果要降维非线性的数据，可以尝试其他方法；

## 主轴

使用PCA拟合一组二维数据，模型训练完成后会提供components_和explained_variance_两个属性，将这两个属性作为一个向量的方向与模平方画在图上，得到以下结果：<br>
这两个向量构成了数据的**主轴**，每一个数据点在主轴上的投影就是数据的主成分；

#%%

x = np.random.rand(100) * 5
y = x + 1.5 * np.random.rand(100)
x, y = x[:, np.newaxis], y[:, np.newaxis]

plt.figure(figsize=(10, 10))
plt.plot(x, y, 'o', alpha=0.5)

pca = PCA(n_components=2)
pca.fit(np.hstack((x, y)))

print(f'PCA属性，成分：{pca.components_=}')
print(f'PCA属性：可解释差异{pca.explained_variance_=}')

for v, l in zip(pca.components_, pca.explained_variance_):
    plt.annotate('', pca.mean_ + v * np.sqrt(l), pca.mean_, arrowprops=dict(arrowstyle='->', linewidth=2, color='k'))
plt.axis('equal')
plt.title('PCA 主轴')

#%% md

## 降维

使用PCA进行降维意味着去除一个或多个主成分，从而得到一个维度更低但是保留了最大数据方差的数据投影；<br>
从示例中可以看出，

#%%

x = np.dot(np.random.randn(2, 2), np.random.randn(2, 200)).T
pca = PCA(n_components=1)
pca.fit(x)
res = pca.transform(x)
res = pca.inverse_transform(res)

plt.figure(figsize=(10, 10))
plt.plot(x[:, 0], x[:, 1], 'o', alpha=0.5)
plt.plot(res[:, 0], res[:, 1], 'o', alpha=0.5)
plt.title('PCA降维效果')

digits = load_digits()
pca.n_components = 2
pca.fit(digits.data)
res = pca.transform(digits.data)

plt.figure(figsize=(10, 10))
plt.scatter(
    x=res[:, 0],
    y=res[:, 1],
    c=digits.target,
    cmap=plt.cm.get_cmap('viridis', lut=10),
    alpha=0.5,
    linewidths=2,
)
plt.colorbar()
plt.title('PCA降维手写数字')

#%% md

## 确定成分的数量

在使用PCA进行降维时，为了确定合适的成分数量，我们可以绘制**累计方差贡献率**与成分数量的图像，从而确定需要的成分数量，以手写数字数据集为例：<br>
可以看到，至少要保留20个成分以上才能保持90%的方差；

#%%

model = PCA()
model.fit(digits.data)

plt.figure(figsize=(10, 10))
plt.plot(np.arange(1, 64 + 1), np.cumsum(model.explained_variance_ratio_) * 100)
plt.xlim(1, 64)
plt.ylim(0, 110)
plt.hlines(90, 0, 64, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('保留成分数量')
plt.ylabel('累计方差贡献率（%）')
plt.title('保留成分数量-累计方差贡献率')

#%% md

# 使用PCA进行噪音过滤

对于一个正常的数据集来说，数据本身的方差应该远大于噪声的方差，所以如果我们使用PCA提取数据的主成分，理论上来讲就能够实现降噪效果；

#%%

def plot_digits(d, t, s):
    fig, axs = plt.subplots(3, 10, figsize=(12, 5))
    fig.subplots_adjust(hspace=0.1, wspace=0.1)
    fig.suptitle(s)
    for i, ax in enumerate(axs.flatten()):
        ax.imshow(d[i], cmap='binary')
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_xlabel(f'{t[i]}')

plot_digits(digits.images, digits.target, '原始手写数字样例')
noise = np.random.normal(digits.images, 2.33)
plot_digits(noise, digits.target, '加噪声手写数字')

model = PCA(n_components=0.5)  # 保留50%的方差
model.fit(noise.reshape(1797, 64))
print(f'保留50%的方差需要的维度：{model.n_components_}')
res = model.transform(noise.reshape(1797, 64))
res = model.inverse_transform(res)
res = res.reshape(1797, 8, 8)
plot_digits(res, digits.target, '使用PCA去噪声后的手写数字')

#%% md

# PCA案例：特征脸

介绍支持向量机时，我们使用了PCA对人脸数据进行降维；<br>
将PCA模型的特征向量当作图像数据可视化之后（这样的图像称为特征脸），我们可以直观地看到PCA模型保留了原始数据的哪些信息；<br>
通过计算累计方差，我们发现，将原本的2914维（$ 62 \times 47 $）的数据降维至150维后，方差依旧能够保留90%以上；<br>
最后，我们使用降维之后的数据重构图像，并显示，可以看到，数据的特征基本得到了保留；<br>
这里使用了包含随即方法的PCA，加快处理高维数据的速度；<br>

#%%

faces = fetch_lfw_people(min_faces_per_person=60)
model = PCA(n_components=150, random_state=42, svd_solver='randomized', whiten=False, iterated_power=3)

fig, axs = plt.subplots(3, 8, figsize=(18, 6))
fig.subplots_adjust(hspace=0.1, wspace=1)
fig.suptitle('sklearn人脸数据')
for i, ax in enumerate(axs.flatten()):  # type: int, plt.Axes
    ax.imshow(faces.images[i], cmap='bone')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel(faces.target_names[faces.target[i]])

model.fit(faces.data)
fig, axs = plt.subplots(3, 8, figsize=(18, 5))
fig.subplots_adjust(hspace=0.1, wspace=0.1)
fig.suptitle('PCA 前24个特征向量（特征脸）')
for i, ax in enumerate(axs.flatten()):  # type: int, plt.Axes
    ax.imshow(model.components_[i].reshape(62, 47), cmap='bone')
    ax.set_xticks([])
    ax.set_yticks([])

plt.figure(figsize=(5, 5))
plt.plot(np.cumsum(model.explained_variance_ratio_))
plt.xlabel('成分数量')
plt.ylabel('累计方差贡献率')

components = model.transform(faces.data)
projected = model.inverse_transform(components)

fig, axs = plt.subplots(3, 8, figsize=(18, 7))
fig.subplots_adjust(hspace=0.1, wspace=1)
fig.suptitle('使用PCA降维之后的结果重构数据')
for i, ax in enumerate(axs.flatten()):  # type: int, plt.Axes
    ax.imshow(projected[i].reshape(62, 47), cmap='bone')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel(faces.target_names[faces.target[i]])

RM -RF /星

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
【Python】机器学习笔记07-主成分分析（Principle Component Analysis）

本文的参考资料：《Python数据科学手册》；本文的源代上传到了Gitee上；本文用到的包：%matplotlib inlineimport numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.datasets import load_digits, fetch_lfw_peoplefrom sklearn.decomposition import P
复制链接

扫一扫