sklearn学习

最新推荐文章于 2024-07-26 15:23:07 发布

Fitz-E-T

最新推荐文章于 2024-07-26 15:23:07 发布

阅读量442

点赞数

分类专栏： sklearn 文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_43916812/article/details/113727186

版权

sklearn 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

前言

这里是我自己学习sklearn的简要记录，不定时更新，如有侵权，联系必删

文章目录

前言
一、sklearn结构树
二、函数介绍

一、sklearn结构树

在这里插入图片描述

　由图中，可以看到库的算法主要有四类：分类，回归，聚类，降维。

类别	算法
常用回归	线性、决策树、SVM、KNN ；集成回归：随机森林、Adaboost、GradientBoosting、Bagging、ExtraTrees
常用分类	线性、决策树、SVM、KNN，朴素贝叶斯；集成分类：随机森林、Adaboost、GradientBoosting、Bagging、ExtraTrees
常用聚类	k均值（K-means）、层次聚类（Hierarchical clustering）、DBSCAN
常用降维	LinearDiscriminantAnalysis、PCA

二、函数介绍

1、数据导入及预处理

1.1数据导入

1.1.1直接通过datasets下载

from sklearn import datasets

Sklearn提供一些标准数据，我们不必再从其他网站寻找数据进行训练。
在这里插入图片描述
例如：

# 导入鸢尾花数据
iris = datasets.load_digits()

1.1.2从本地导入

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original',data_home='./datasets')
mnist

参数说明：

参数	说明
data_home	数据路径

1.2数据可视化

import matplotlib.pyplot as plt
fig, axes = plt.subplots(10,10, figsize=(8, 8),subplot_kw={'xticks':[], 'yticks':[]},
                        gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
    ax.text(0.05, 0.05, str(digits.target[i]),transform=ax.transAxes, color='green')
plt.show()

在这里插入图片描述

1.3显示数据集的keys

MINIST.keys()

在这里插入图片描述

2.模型

2.1梯度下降分类器（SGDClassifier）

from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state = 42)
sgd_clf.fit(X_train, y_train_6)

SGDClassifier在训练时是完全随机的，如果希望得到可复现的结果，需要设置参数random_state

2.2随机森林分类器（RandomForestClassifier）

from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state = 42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_6, cv=3, method='predict_proba')

3.性能评估

3.1交叉验证¹

三种交叉验证方法

留一法交叉验证
简单交叉验证
S折交叉验证

第一种是简单交叉验证，所谓的简单，是和其他交叉验证方法相对而言的。首先，我们随机的将样本数据分为两部分（比如： 70%的训练集，30%的测试集），然后用训练集来训练模型，在测试集上验证模型及参数。接着，我们再把样本打乱，重新选择训练集和测试集，继续训练数据和检验模型。最后我们选择损失函数评估最优的模型和参数。　
第二种是S折交叉验证（S-Folder Cross Validation）。和第一种方法不同，S折交叉验证会把样本数据随机的分成S份，每次随机的选择S-1份作为训练集，剩下的1份做测试集。当这一轮完成后，重新随机选择S-1份来训练数据。若干轮（小于S）之后，选择损失函数评估最优的模型和参数。
第三种是留一交叉验证（Leave-one-out Cross Validation），它是第二种情况的特例，此时S等于样本数N，这样对于N个样本，每次选择N-1个样本来训练数据，留一个样本来验证模型预测的好坏。此方法主要用于样本量非常少的情况，比如对于普通适中问题，N小于50时，我一般采用留一交叉验证。留一法的缺点是：当n很大的时候，计算量会很大，因为需要进行n次模型的训练，而且训练集的大小为n-1

3.1.1常用函数

cross_val_score

sklearn.cross_validation.cross_val_score(estimator, X, y=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs')

GridSearchCV（网格搜索参数空间，寻找最优参数）

sklearn.grid_search.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose

train_test_split（分割训练集为验证集，测试集）

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

cross_val_score可用于模型选择，参数选择, 并可以画出单一参数对模型影响，以决策树模型，iris 数据集为例，探究max_depth参数对模型影响，并选择最佳参数，返回的参数是评估分数。
cross_val_predict与cross_val_score很相像，不过不同于返回的是评测效果，cross_val_predict 返回的是estimator 的分类结果（或回归值），这个对于后期模型的改善很重要，可以通过该预测输出对比实际目标值，准确定位到预测出错的地方，为我们参数优化及问题排查十分的重要。
cross_val_score与cross_validate也差不多，不过cross_validate提供了有关拟合时间，训练和测试分数的更多信息，可以返回dataframe格式，清晰可见。

3.2混淆矩阵

在这里插入图片描述

先用cross_val_predict进行k-折交叉验证，返回预测值

from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_6, cv=3)

再用confusion_matrix生成混淆矩阵

from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_6, y_train_pred)

召回率
$\frac{TP}{TP+FN}.$
精度
$\frac{TP}{TP+FP}.$

from sklearn.metrics import precision_score, recall_score
precision_score(y_train_6, y_train_pred)
recall_score(y_train_6, y_train_pred)

精度：我觉得正确的里面有多少真的是正确的
召回率：真的正确的里面我觉得正确的有多少
将精度和召回率合成单一指标 $F_1$ 分数

$F_1$ 分数
$F_1=\frac{2}{\frac{1}{精度}+\frac{1}{召回率}}$

from sklearn.metrics import f1_score
f1_score(y_train_6, y_train_pred)

精度/召回率权衡
SGDClassifier的分类决策：对于每个实例，它会基于决策函数计算出一个分值，如果该值大于阈值，则判为正类，反之判为负类。Scikit-Learn不允许设置阈值，但是我们可以通过调用decision_function()(这个是SGDClassifier的，如果用RandomForestClassifier的话，应该是dict_proba())的方法，返回每个实例的分数，根据这些分数，使用任意阈值进行预测：

y_scores = sgd_clf.decision_function([small_x])
threshold = 10
y_small_x_pred = (y_scores > threshold)

提高阈值可以降低召回率

阈值的设定
使用cross_val_predict() 设置参数method 的方法获得训练集中的所有分数。再通过precision_recall_curve 计算所有可能的召回率和精度。

y_scores = cross_val_predict(sgd_clf, X_train, y_train_6, cv=3,
                             method="decision_function")

绘制精度和召回率相对于阈值的函数图

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.legend(loc="center right", fontsize=16) # Not shown in the book
    plt.xlabel("Threshold", fontsize=16)        # Not shown
    plt.grid(True)                              # Not shown
    plt.axis([-50000, 50000, 0, 1])             # Not shown



recall_90_precision = recalls[np.argmax(precisions >= 0.90)]
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]


plt.figure(figsize=(8, 4))                                                                  # Not shown
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.plot([threshold_90_precision, threshold_90_precision], [0., 0.9], "r:")                 # Not shown
plt.plot([-50000, threshold_90_precision], [0.9, 0.9], "r:")                                # Not shown
plt.plot([-50000, threshold_90_precision], [recall_90_precision, recall_90_precision], "r:")# Not shown
plt.plot([threshold_90_precision], [0.9], "ro")                                             # Not shown
plt.plot([threshold_90_precision], [recall_90_precision], "ro")                             # Not shown
save_fig("precision_recall_vs_threshold_plot")                                              # Not shown
plt.show()

在这里插入图片描述

通过np.argmax()搜索满足某一精度或召回率的最小阈值

threshold_90_precision = thresholds[np.argmax(precisions >= 0.97)]

在这个阈值下打标签

y_train_pred_90 = (y_scores >= threshold_90_precision)

计算精度和召回率

recall_score(y_train_6, y_train_pred_90)
precision_score(y_train_6, y_train_pred_90)

ROC曲线

from sklearn.metrics import roc_curve
fpr, tpr, threshold = roc_curve(y_train_6, y_scores)
# 绘制曲线
plt.figure(figsize=(8, 6))                                    # Not shown
plt.plot(fpr, tpr, linewidth=2)
plt.axis([0, 1, 0, 1])                                    # Not shown in the book
plt.xlabel('False Positive Rate (Fall-Out)', fontsize=16) # Not shown
plt.ylabel('True Positive Rate (Recall)', fontsize=16)    # Not shown
plt.grid(True)                                            # Not shown
plt.show()

在这里插入图片描述
ROC曲线与精度/召回率（PR）曲线的选择：当正类非常少见或者你更关注假正类而不是负正类时，应该选择PR曲线，反之选择ROC曲线