机器学习之降维压缩数据

最新推荐文章于 2022-08-23 17:21:31 发布

热爱学习的小鲁同学

最新推荐文章于 2022-08-23 17:21:31 发布

阅读量702

点赞数 1

分类专栏： python机器学习笔记文章标签：机器学习 python

本文链接：https://blog.csdn.net/m0_45055763/article/details/124477836

版权

特征提取：将原始数据压缩为低纬度的

5.1用主成分分析实现无监督降维

5.1完成以下步骤：

标准化数据
构建协方差矩阵
获取协方差矩阵特征值和特征向量
以降序对特征值排序，从而对特征排序

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df=pd.read_csv('wine-Copy1.data',names=['分类标签','酒精','苹果酸',
                                  '灰','灰的碱度','镁','总酚','黄酮类化合物',
                                  '非黄烷类酚类','原花青素','色彩强度',
                                  '色调','稀释酒','脯氨酸'])
df.head()

	分类标签	酒精	苹果酸	灰	灰的碱度	镁	总酚	黄酮类化合物	非黄烷类酚类	原花青素	色彩强度	色调	稀释酒	脯氨酸
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

from sklearn.model_selection import train_test_split
X=df.iloc[:,1:].values
y=df.iloc[:,0].values

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,stratify=y,
                                              random_state=0)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(124, 13)
(54, 13)
(124,)
(54,)

#标准化
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train_std=sc.fit_transform(X_train)
X_test_std=sc.transform(X_test)

def cov(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None,aweights=None)

m:一维或则二维的数组，默认情况下每一行代表一个变量（属性），每一列代表一个观测

#获取协方差矩阵的特征值和特征向量
cov_mat=np.cov(X_train_std.T)
cov_mat.shape

(13, 13)

w,v = numpy.linalg.eig(a) 计算方形矩阵a的特征值和右特征向量

参数：

a : 待求特征值和特征向量的方阵。

w: 多个特征值组成的一个矢量。备注：多个特征值并没有按特定的次序排列。特征值中可能包含复数。

v: 多个特征向量组成的一个矩阵。每一个特征向量都被归一化了。第i列的特征向量v[:,i]对应第i个特征值w[i]。

————————————————

eigen_vals,eigen_vecs=np.linalg.eig(cov_mat)

eigen_vals

array([4.84274532, 2.41602459, 1.54845825, 0.96120438, 0.84166161,
       0.6620634 , 0.51828472, 0.34650377, 0.3131368 , 0.10754642,
       0.21357215, 0.15362835, 0.1808613 ])

eigen_vecs.shape

(13, 13)

总方差和解释方差

tot=sum(eigen_vals)

var_exp=[(i/tot) for i in sorted(eigen_vals,reverse=True)]
var_exp

[0.36951468599607645,
 0.18434927059884165,
 0.11815159094596986,
 0.07334251763785471,
 0.06422107821731672,
 0.05051724484907654,
 0.03954653891241449,
 0.026439183169220035,
 0.02389319259185293,
 0.016296137737251016,
 0.013800211221948418,
 0.01172226244308596,
 0.008206085679091375]

#累计解释方差
cum_var_exp=np.cumsum(var_exp)
cum_var_exp

array([0.36951469, 0.55386396, 0.67201555, 0.74535807, 0.80957914,
       0.86009639, 0.89964293, 0.92608211, 0.9499753 , 0.96627144,
       0.98007165, 0.99179391, 1.        ])

#coding:utf-8
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
#有中文出现的情况，需要u'内容


plt.bar(range(1,14),var_exp,alpha=0.5,align='center',
       label='解释方差')
plt.step(range(1,14),cum_var_exp,label='累计解释方差',color='k')
plt.xlabel('主成分索引')
plt.xlabel('解释方差比率')
plt.legend(loc='best')

<matplotlib.legend.Legend at 0x24cdba6d760>

在这里插入图片描述

[

特征变换

选择与前k个特征值对应的特征向量，其中k为新特征子空间的维数（k≤d)
用前k个特征向量构造投影矩阵W
用投影矩阵W变换d维输入数据集X以获得新的k维特征子空间

#做一个（特征值，特征向量）元组
eigen_pairs=[(np.abs(eigen_vals[i]),eigen_vecs[:,i]) for i in range(len(eigen_vals))]

eigen_pairs[0]

(4.842745315655895,
 array([-0.13724218,  0.24724326, -0.02545159,  0.20694508, -0.15436582,
        -0.39376952, -0.41735106,  0.30572896, -0.30668347,  0.07554066,
        -0.32613263, -0.36861022, -0.29669651]))

#对特征值排序
eigen_pairs.sort(key=lambda k:k[0],reverse=True)

#选用前两个最大特征值的特征向量
w=np.hstack((eigen_pairs[0][1][:,np.newaxis],
           eigen_pairs[1][1][:,np.newaxis]))

w#得到13×2的投影矩阵

array([[-0.13724218,  0.50303478],
       [ 0.24724326,  0.16487119],
       [-0.02545159,  0.24456476],
       [ 0.20694508, -0.11352904],
       [-0.15436582,  0.28974518],
       [-0.39376952,  0.05080104],
       [-0.41735106, -0.02287338],
       [ 0.30572896,  0.09048885],
       [-0.30668347,  0.00835233],
       [ 0.07554066,  0.54977581],
       [-0.32613263, -0.20716433],
       [-0.36861022, -0.24902536],
       [-0.29669651,  0.38022942]])

这里只选择了两个，实际中，主成分的数量必需通过在计算效率和分类器性能平衡来确定

两个新特征的的样本向量
$ X^{'}=XW $

#获得2维度的数据集
X_train_pca=X_train_std.dot(w)
X_train_pca.shape

(124, 2)

#可视化实现
colors=['r','b','g']
markers=['s','x','o']
for l,c,m in zip(np.unique(y_train),colors,markers):
    plt.scatter(X_train_pca[y_train==l,0],X_train_pca[y_train==l,1],
               c=c,label=l,marker=m)
    
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='best')
plt.show()

在这里插入图片描述

PCA是不使用任何分类标签的无监督学习技术

sklearn实现

#边界决策的可视化

from matplotlib.colors import ListedColormap

def plot_decision_regions(X,y,classifier,test_idx=None,resolution=0.02):

##简历颜色产生器和颜色绘图板
    markers=('s','x','o','^','y')
    colors=('red','blue','lightgreen','gray','cyan')
    cmap=ListedColormap(colors[:len(np.unique(y))])
    
##画出决策边界

    x1_min,x1_max=X[:,0].min()-1,X[:,0].max()+2
    x2_min,x2_max=X[:,1].min()-1,X[:,1].max()+2
    xx1,xx2=np.meshgrid(np.arange(x1_min,x1_max,resolution),
                       np.arange(x2_min,x2_max,resolution))
    z=classifier.predict(np.array([xx1.ravel(),xx2.ravel()]).T)
    z=z.reshape(xx1.shape)
    plt.contourf(xx1,xx2,z,alpha=0.2,cmap=cmap)
    plt.xlim(xx1.min(),xx2.max())
    plt.ylim(xx2.min(),xx2.max())
    
    #绘出样例
    for idx,c1 in enumerate(np.unique(y)):
        plt.scatter(x=X[y==c1,0],y=X[y==c1,1],
                   alpha=0.8,c=cmap

最低0.47元/天解锁文章

热爱学习的小鲁同学

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
机器学习之降维压缩数据

特征提取：将原始数据压缩为低纬度的5.1用主成分分析实现无监督降维5.1完成以下步骤：标准化数据构建协方差矩阵获取协方差矩阵特征值和特征向量以降序对特征值排序，从而对特征排序import pandas as pdimport numpy as npimport matplotlib.pyplot as pltdf=pd.read_csv('wine-Copy1.data',names=['分类标签','酒精','苹果酸',
复制链接

扫一扫