(Scikit-Learn)数据表示与机器学习API的基本使用流程和方法(线性回归，有监督学习，无监督学习，降维，聚类)

最新推荐文章于 2022-09-14 23:57:51 发布

STILLxjy

最新推荐文章于 2022-09-14 23:57:51 发布

阅读量578

点赞数 1

分类专栏： ——机器学习——

本文链接：https://blog.csdn.net/STILLxjy/article/details/95904532

版权

——机器学习—— 专栏收录该内容

30 篇文章 3 订阅

订阅专栏

一： Scikit-Learn的数据表示

下载数据集并加载到Pandas的DataFrame中

import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()

在这里插入图片描述
2. Seaborn对数据进行可视化
使用矩阵图（pair plot）画出所有变量中任意两个变量之间的图形

%matplotlib inline
import seaborn as sns
sns.set()
#sns.pairplot(数据，标签特征，图像大小)
sns.pairplot(iris,hue='species',size=1.5)

在这里插入图片描述
3. 从 DataFrame 中抽取特征矩阵和目标数组

X_iris = iris.drop('species',axis=1)
X_iris.shape
#(150, 4)
y_iris = iris['species']
y_iris.shape
#(150,)

二：Scikit-Learn的评估器API
Scikit-Learn 评估器 API 的常用步骤如下所示:
(1) 通过从 Scikit-Learn 中导入适当的评估器类，选择模型类。
(2) 用合适的数值对模型类进行实例化，配置模型超参数（hyperparameter）。
(3) 整理数据，通过前面介绍的方法获取特征矩阵和目标数组。
(4) 调用模型实例的 fit() 方法对数据进行拟合。
(5) 对新数据应用模型：
• 在有监督学习模型中，通常使用 predict() 方法预测新数据的标签；
• 在无监督学习模型中，通常使用 transform() 或 predict() 方法转换或推断数据的性质。

简单线性回归

import matplotlib.pyplot as plt
import numpy as np
#将随机种子设置为42
rng = np.random.RandomState(42)
#10 * 随机生成50个0-1之间的数
x = 10 * rng.rand(50)
#y = 2 * x - 1 + 0-1之间的随机噪音
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x,y)

在这里插入图片描述
(1) 选择模型类导入线性回归模型类

from sklearn.linear_model import LinearRegression

(2) 选择模型超参数创建模型实例

model = LinearRegression(fit_intercept=True) #拟合直线截距

(3) 将数据整理成特征矩阵和目标数组
在这里插入图片描述

X = x[:,np.newaxis]
X.shape
#(50, 1)

(4) 用模型拟合数据
fit() 命令会在模型内部进行大量运算，运算结果将存储在模型属性中，供用户使用
在 Scikit-Learn 中，所有通过 fit() 方法获得的模型参数都带一条下划线。

model.fit(X,y)

#斜率
model.coef_

在这里插入图片描述

#截距
model.intercept_

在这里插入图片描述
(5) 预测新数据的标签， predict() 方法进行预测

x_fit = np.linspace(-1,11)
x_fit

在这里插入图片描述

# x 值转换成 [n_samples, n_features] 的特征矩阵形式
Xfit = x_fit[:,np.newaxis]
Xfit

在这里插入图片描述

yfit = model.predict(Xfit)

把原始数据和拟合结果都可视化出来

plt.scatter(x,y)
plt.plot(x_fit,yfit,'red')

在这里插入图片描述
有监督学习示例：鸢尾花数据分类
先将数据分割成训练集（training set）和测试集（testing set）

from sklearn.cross_validation import train_test_split
Xtrain,Xtest,ytrain,ytest = train_test_split(X_iris,y_iris,random_state=1)

from sklearn.naive_bayes import GaussianNB #选择模型类
model = GaussianNB() #初始化模型
model.fit(Xtrain,ytrain)#训练模型，拟合数据
y_model = model.predict(Xtest)

用 accuracy_score 工具验证模型预测结果的准确率（预测的所有结果中，正确结果占总预测样本数的比例）：

from sklearn.metrics import accuracy_score
accuracy_score(ytest,y_model)

在这里插入图片描述

无监督学习示例：鸢尾花数据降维

from sklearn.decomposition import PCA #选择模型
model = PCA(n_components=2) #初始化模型
model.fit(X_iris) #拟合数据
X_2D = model.transform(X_iris) #将数据转化为2维

#先将二维数据插入到鸢尾花的 DataFrame 中，然后用
Seaborn 的 lmplot 方法画图
iris['PCA1'] = X_2D[:,0]
iris['PCA2'] = X_2D[:,1]
sns.lmplot("PCA1", "PCA2", hue='species', data=iris, fit_reg=False)

在这里插入图片描述
无监督学习示例：鸢尾花数据聚类
我们将用一个强大的聚类方法——高斯混合模型（Gaussian mixture model， GMM）。 GMM 模型试图将数据构造成若干服从高斯分布的概率密度
函数簇。

from sklearn.mixture import GMM
model = GMM(n_components=3,covariance_type='full')
model.fit(X_iris)
y_gmm = model.predict(X_iris)

iris['cluster'] = y_gmm
sns.lmplot("PCA1","PCA2",data=iris,hue='species',col='cluster',fit_reg=False)

在这里插入图片描述
应用：手写数字探索
加载数据

from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape
#(1797, 8, 8)

对前100 张图进行可视化

import matplotlib.pyplot as plt
fig, axes = plt.subplots(10,10,figsize=(8,8),
                        subplot_kw={'xticks':[],'yticks':[]},
                        gridspec_kw=dict(hspace=0.1,wspace=0.1))
for i, ax, in enumerate(axes.flat):
    ax.imshow(digits.images[i],cmap='binary',interpolation='nearest')
    ax.text(0.05,0.05,str(digits.target[i]),transform=ax.transAxes,color='green')

在这里插入图片描述
使用数据，需要一个维度为 [n_samples, n_features] 的二维特征矩阵——可以将每个样本图像的所有像素都作为特征，也就是将每个数字的 8 像素 ×8 像素平铺成长度为 64 的一维数组。另外，还需要一个目标数组，用来表示每个数字的真实值（标签）。这两份数据已经放在手写数字数据集的 data 与 target 属性中

X = digits.data
X.shape
#(1797, 64)
y = digits.target
y.shape
#(1797,)

降维,使用流形学习算法中的 Isomap算法对数据进行降维

from sklearn.manifold import Isomap
iso = Isomap(n_components=2)
iso.fit(digits.data)
data_projected = iso.transform(digits.data)
data_projected.shape
#(1797, 2)

把数据画出来

plt.scatter(data_projected[:,0],data_projected[:,1],c=digits.target,
           edgecolor='none',alpha=0.5,
           cmap=plt.cm.get_cmap('spectral',10))
plt.colorbar(label='digit label', ticks=range(10))
plt.clim(-0.5,9.5)

在这里插入图片描述
数字分类
先将数据分成训练集和测试集，然后用高斯朴素贝叶斯模型来拟合

Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,random_state=0)
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(Xtrain,ytrain)
y_model = model.predict(Xtest)

获得模型的准确率

from sklearn.metrics import accuracy_score
accuracy_score(ytest,y_model)

在这里插入图片描述
我们无法知道模型哪里做得不够好，解决这个问题的办法就是用混淆矩阵
用 Scikit-Learn 计算混淆矩阵，然后用 Seaborn 画出来

from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest,y_model)
sns.heatmap(mat,square=True,annot=True,cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value')

在这里插入图片描述
从图中可以看出，误判的主要原因在于许多数字 2 被误判成了数字 1 或数字 8。
另一种显示模型特征的直观方式是将样本画出来，然后把预测标签放在左下角，
用绿色表示预测正确，用红色表示预测错误

fig, axes = plt.subplots(10, 10, figsize=(8, 8),
            subplot_kw={'xticks':[], 'yticks':[]},
            gridspec_kw=dict(hspace=0.1, wspace=0.1))
test_images=Xtest.reshape(-1,8,8)
for i, ax in enumerate(axes.flat):
    ax.imshow(test_images[i], cmap='binary', interpolation='nearest')
    ax.text(0.05, 0.05, str(y_model[i]),
            transform=ax.transAxes,
            color='green' if (ytest[i] == y_model[i]) else 'red')

在这里插入图片描述
如果希望分类准确率达到80% 以上，可能需要借助更加复杂的算法，例如支持向量机、随机森林，或者其他分类算法。