PCA算法原理各本书中都有,具体推导也不用放了,就简单说一下步骤:
- 对变量进行标准化操作;
- 计算协方差矩阵;
- 求协方差矩阵特征值与特征向量;
- 将特征值从大到小排序,选择做大的前k个特征值对应的特征向量;
- 将原始数据乘这k个特征向量,将其转化到对应的k维空间中。
例1:对Iris数据集进行pca降维。
iris莺尾花数据集导入后为(150, 5)的矩阵,共5类数据,每类150个样本。
分别为’Sepal.Length’, ‘Sepal.Width’, ‘Petal.Length’, ‘Petal.Width’, ‘species’
为了方便可视化,我们通过PCA降维到2维,即(150,2)的矩阵,并绘制散点图。
代码如下:
# 导入类库
import pandas as pd
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
names = ['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width'