【机器学习（9）】分类模型的常用评价指标：准确率Accuracy、查准率Precision、查全率Recall、图形面积AUC

最新推荐文章于 2024-04-16 08:46:11 发布

lys_828

最新推荐文章于 2024-04-16 08:46:11 发布

阅读量3.3k

点赞数

分类专栏：机器学习文章标签： python 机器学习数据分析大数据深度学习

本文链接：https://blog.csdn.net/lys_828/article/details/104676315

版权

机器学习专栏收录该内容

18 篇文章 60 订阅

订阅专栏

模型评价：分类模型的常用评价指标

1）基本指标：误差率
指标解释：错分类样本占总样本的比例

2）基本指标：准确率
指标解释：正确分类样本占总样本的比例
指标解读：准确率越接近1，模型越准确

3）混淆矩阵（二分类问题）
在这里插入图片描述
4）衍生指标：查准率（precision）
指标解释：所有真正例占所有预测为正的样本的比例(TP/(TP+FP))
指标举例：在商品推荐的过程中，我们会关心所有推荐给用户的商品（预测为正）中有多少是客户真正喜欢的（真正例）

5）衍生指标：查全率（recall）
指标解释：所有真正例占所有真实为正的样本的比例(TP/(TP+FN))
指标举例：在银行用户风险识别中，我们会关心，所有有风险的用户，有多少能被我们的模型识别出来

6）其他指标：ROC曲线与AUC值
      ROC曲线：以真正例比率为纵轴、假正例率为横轴，采用不同的截断点，来绘制ROC曲线
      AUC值：ROC曲线与坐标轴构成的图形面积
      指标解读：auc指标越接近1，则代表模型准确率越高，auc值等于0.5，代表模型准确率与随机猜测准确率一致，auc值小于0.5：模型效果不如随机猜测

使用sklearn查看回归模型的各项指标

由于这里是分类问题，那么指标就不再是之前的数值指标，而应该是类型指标，这里按照平均价格的中位数来进行房价高低的划分，其他的和上一个回归模型的处理是一致的，都需要先处理掉共线数据

1）加载数据并对数据进行处理

import pandas as pd
import matplotlib.pyplot as plt
import os
os.chdir(r'C:\Users\86177\Desktop')
# 样例数据读取
df = pd.read_excel('realestate_sample_preprocessed.xlsx')
# 根据共线性矩阵，保留与房价相关性最高的日间人口，将夜间人口和20-39岁夜间人口进行比例处理
def age_percent(row):
    if row['nightpop'] == 0:
        return 0
    else:
        return row['night20-39']/row['nightpop']
df['per_a20_39'] = df.apply(age_percent,axis=1)
df = df.drop(columns=['nightpop','night20-39'])
# 制作标签变量
price_median = df['average_price'].median()
print(price_median)
df['is_high'] = df['average_price'].map(lambda x: True if x>= price_median else False)
print(df['is_high'].value_counts())
# 数据集基本情况查看
print(df.shape)
print(df.dtypes)
print(df.isnull().sum())

–> 输出的结果为：

30273.0

True     449
False    449
Name: is_high, dtype: int64

(898, 10)

id                 int64
complete_year      int64
average_price    float64
area             float64
daypop           float64
sub_kde          float64
bus_kde          float64
kind_kde         float64
per_a20_39       float64
is_high             bool
dtype: object

id               0
complete_year    0
average_price    0
area             0
daypop           0
sub_kde          0
bus_kde          0
kind_kde         0
per_a20_39       0
is_high          0
dtype: int64

2）划分数据集

x = df[['complete_year','area', 'daypop', 'sub_kde',
       'bus_kde', 'kind_kde','per_a20_39']]
y = df['is_high']
print(x.shape)
print(y.shape)

–> 输出的结果为：（和之前的不同就是在于这里的y值）

(898, 7)
(898,)

3）建立分类模型

使用pipeline整合数据处理、特征筛选与模型

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
# 构建模型工作流
pipe_clf = Pipeline([
        ('sc',StandardScaler()),
        ('power_trans',PowerTransformer()),
        ('polynom_trans',PolynomialFeatures(degree=3)),
        ('lgostic_clf', LogisticRegression(penalty='l1', fit_intercept=True, solver='liblinear'))
        ])
print(pipe_clf)

–> 输出的结果为：（回归选择的逻辑回归，惩罚系数选择L1，相当于是逻辑回归版本的lasso模型）

Pipeline(memory=None,
         steps=[('sc',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('power_trans',
                 PowerTransformer(copy=True, method='yeo-johnson',
                                  standardize=True)),
                ('polynom_trans',
                 PolynomialFeatures(degree=3, include_bias=True,
                                    interaction_only=False, order='C')),
                ('lgostic_clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l1', random_state=None,
                                    solver='liblinear', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

4）查看模型表现

import warnings
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
warnings.filterwarnings('ignore')
pipe_clf.fit(x,y)
y_predict = pipe_clf.predict(x)
print(f'Accuracy score is: {accuracy_score(y,y_predict)}')
print(f'Precision score is: {precision_score(y,y_predict)}')
print(f'Recall score is: {recall_score(y,y_predict)}')
print(f'AUC: {roc_auc_score(y,y_predict)}')

–> 输出的结果为：（全部的指标都可以调用sklearn的模块）

Accuracy score is: 0.8741648106904232
Precision score is: 0.8783783783783784
Recall score is: 0.8685968819599109
AUC: 0.8741648106904232

lys_828

关注

0
点赞
踩
11

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录