数据处理工具记录【三】—— 分类

最新推荐文章于 2023-07-31 17:34:35 发布

73826669

最新推荐文章于 2023-07-31 17:34:35 发布

阅读量253

点赞数

分类专栏： # sklearn python

本文链接：https://blog.csdn.net/weixin_41932046/article/details/100638423

版权

python 同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

sklearn

5 篇文章 0 订阅

订阅专栏

选择和训练模型

SGD模型

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
sgd_clf.predict([some_digit])

SVM模型（支持向量机）

线性SVM

from sklearn.svm import LinearSVC
svm_clf = Pipeline((
    ("scaler", StandardScaler()),
    ("linear_svc", LinearSVC(C=1, loss="hinge", random_state=42))
))

svm_clf.fit(X, y)

非线性SVM

多项式核

from sklearn.svm import SVC
poly_kernel_svm_clf = Pipeline((
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
))
poly_kernel_svm_clf.fit(X, y)

高斯RBF核

添加相似特征的思想

rbf_kernel_svm_clf = Pipeline((
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
))
rbf_kernel_svm_clf.fit(X, y)

决策树模型

from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)

可视化

from sklearn.tree import export_graphviz
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import os

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "decision_trees"

def image_path(fig_id):
    return os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id)

export_graphviz(
    tree_clf,
    out_file=image_path("iris_tree.dot"),
    feature_names=iris.feature_names[2:],
    class_names=iris.target_names,
    rounded=True,
    filled=True
)

然后使用命令行

dot -Tpng iris_tree.dot -o iris_tree.png

随机森林模型

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")

返回一个数组，每行为一个实例，每列代表一个类别。意思是某个给定实例属于某个给定类别的概率。

多类别分类器

OvA 一个实例给n个类别的判定，判定分数高的归为该类
OvO 类别两两判定，n个类别会产生(n-1)*n/2个分类器
Scikit-Learn会自动执行OvA，SVM除外

强制使用OvO

以SGD为例

from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([some_digit])

len(ovo_clf.estimators_) = 45
注：随机森林直接可以分为多个类别

多标签分类

KNN模型

from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
knn_clf.predict([some_digit])  # 会花很长时间，可能有几个小时

# 评估：每个标签下的发f1分数的平均值
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
f1_score(y_multilabel, y_train_knn_pred, average="macro")

多输出分类

KNN模型可以实现

性能考核

交叉验证

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

交叉验证在分类模型中不是很好的判断方式。

混淆矩阵

from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_pred)

精度和召回率

from sklearn.metrics import precision_score, recall_score
precision_score(y_train_5, y_train_pred)
recall_score(y_train_5, y_train_pred)

# 精度和召回率的调和平均值： f1分数
from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred)

决策阈值

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

ROC曲线

from sklearn.metrics import roc_curve

fpr, tpr, threshold = roc_curve(y_train_5, y_scores)
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0,1], [0,1], 'k--')
    plt.axis([0,1,0,1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    
plot_roc_curve(fpr, tpr)
plt.show()

在这里插入图片描述

ROC AUC

ROC曲线下面面积

from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)

73826669

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
数据处理工具记录【三】—— 分类

数据处理工具【三】—— 分类选择和训练模型SGD模型随机森林模型多类别分类器强制使用OvO多标签分类KNN模型多输出分类性能考核交叉验证混淆矩阵精度和召回率决策阈值ROC曲线ROC AUC前文链接：https://blog.csdn.net/weixin_41932046/article/details/100632227选择和训练模型SGD模型from sklearn.linear_mo...
复制链接

扫一扫