推荐好文PCA的数学原理
本文将会用Python来实现PCA,帮助更好的理解
1. 获取数据
我们用的数据是150个鸢尾花,然后通过4个维度刻画
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pandas as pd
df = pd.read_csv(
filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
header=None,
sep=',')
df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end
df.head()
X = df.ix[:,0:4].values
y = df.ix[:,4].values
现在上面数据处理后,x是一个150 * 4 的矩阵,每一行都是一个样本,y是一个 150 * 1 是向量,每个都是一个分类
我们下一步是来看3类型的花怎么分布在4个特征上,我们可以通过直方图来展示
import plotly.plotly as py
from plotly.graph_objs import *
import plotly.tools as tls
# plotting histograms
tls.set_credentials_file(username='zhuanxuhit', api_key='30dCVmghG2CqKQqfSzsu')
traces = []
legend = {0:False, 1:False, 2:False, 3:True}
colors = {'Iris-setosa': 'rgb(31, 119, 180)',
'Iris-versicolor': 'rgb(255, 127, 14)',
'Iris-virginica': 'rgb(44, 160, 44)'}
for col in range(4):
for key in colors:
traces.append(Histogram(x=X[y==key, col],
opacity=0.75,
xaxis='x%s' %(col+1),
marker=Marker(color=colors[key]),
name=key,
showlegend=legend[col]))
data = Data(traces)
layout = Layout(barmode='overlay',
xaxis=XAxis(domain=[0, 0.25], title='sepal length (cm)'),
xaxis2=XAxis(domain=[0.3, 0.5], title='sepal width (cm)'),
xaxis3=XAxis(domain=[0.55, 0.75], title='petal length (cm)'),
xaxis4=XAxis(domain=[0.8, 1], title='petal width (cm)'),
yaxis=YAxis(title='count'),
title='Distribution of the differ