评估Scikit-Learn模型方法

追忆无义

已于 2022-08-13 00:06:44 修改

阅读量458

点赞数 1

分类专栏： machine learning 文章标签： python 机器学习 sklearn

于 2022-08-04 22:33:26 首次发布

本文链接：https://blog.csdn.net/m0_66235114/article/details/126167685

版权

本文介绍了评估Scikit-Learn模型的三种方法：内置方法、参数评估和特定问题的度量。在分类问题上，使用准确率作为评估标准；在回归问题中，通过决定系数衡量模型性能。同时，详细讨论了决定系数、平均绝对误差、平均平方误差等回归指标，以及准确率、AUC/ROC、混淆矩阵等分类指标。文中还展示了如何使用Scikit-Learn创建混淆矩阵。

摘要由CSDN通过智能技术生成

评估一个机器学习模型

评估Scikit-Learn模型/估算器的三种方法:

估算器的内置score()方法
scoring 参数
特定问题的度量函数

用 `score` 方法评估一个模型

让我们在分类问题上使用 `score()`

import pandas as pd
import numpy as np

# 1. 数据读取
heart_disease = pd.read_csv(r'\Users\Administrator\Desktop\data\heart-disease.csv')
heart_disease.head()

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

heart_disease.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

np.random.seed(42)

X = heart_disease.drop('target', axis = 1)
y = heart_disease['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = RandomForestClassifier()

clf.fit(X_train, y_train)

clf.score(X_test, y_test)

0.8524590163934426

y_preds = clf.predict(X_test)
y_preds

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)

np.sum(y_test == y_preds) / len(y_test)

0.8524590163934426

由此可知，分类器RandomForestClassifier内置的score是求accuracy，也就是准确率，即预测正确的个数占总体的比例，还可以用sklearn内置的方法求同样的值

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_preds)

0.8524590163934426

让我们在回归问题上使用 `score()`

让我们使用加州住房（California Housing）数据集，内置于sklearn中

# 获取加州住房数据
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 20640\n\n    :Number of Attributes: 8 numeric, predictive attributes and the target\n\n    :Attribute Information:\n        - MedInc        median income in block group\n        - HouseAge      median house age in block group\n        - AveRooms      average number of rooms per household\n        - AveBedrms     average number of bedrooms per household\n        - Population    block group population\n        - AveOccup      average number of household members\n        - Latitude      block group latitude\n        - Longitude     block group longitude\n\n    :Missing Attribute Values: None\n\nThis dataset was obtained from the StatLib repository.\nhttps://www.