分类问题之决策阈值 - precision vs recall 详解

最新推荐文章于 2024-07-16 20:52:21 发布

Sany 何灿

最新推荐文章于 2024-07-16 20:52:21 发布

阅读量7.9k

点赞数 7

分类专栏：数据挖掘

本文链接：https://blog.csdn.net/SanyHo/article/details/115387550

版权

数据挖掘专栏收录该内容

32 篇文章 14 订阅

订阅专栏

什么是决策阈值？

sklearn没有让我们直接设置决策阈值，但它让我们可以访问用于进行预测的决策得分(决策函数o/p)。我们可以从决策函数输出中选择最佳得分，并将其设置为决策阈值，并将小于该决策阈值的所有决策得分值视为负类(0)，大于该决策阈值的所有决策得分值视为正类(1)。

使用各种决策阈值的精度-召回率曲线，我们可以选择决策阈值的最佳值，以便根据我们的项目是面向精度还是面向召回率分别给出高精度(不太影响召回率)或高召回率(不太影响精度)。

这样做的主要目的是根据我们的ML项目分别是面向精度的还是面向召回的，得到一个高精度的ML模型，或者高召回率的ML模型。

Code: 建立一个高精度的ML模型

# Import required modules.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, recall_score, precision_score, accuracy_score

# Get the data.
data_set = datasets.load_breast_cancer()

# Get the data into an array form.
x = data_set.data	 # Input feature x.
y = data_set.target	 # Input target variable y.

# Get the names of the features.
feature_list = data_set.feature_names

# Convert the data into pandas data frame.
data_frame = pd.DataFrame(x, columns = feature_list)

# To insert an output column in data_frame.
data_frame.insert(30, 'Outcome', y)	 # Run this line only once for every new training.

# Data Frame.
data_frame.head(7)

在这里插入图片描述
Code: 训练模型

# Train Test Split.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

# Create Classifier Object.
clf = SVC()
clf.fit(x_train, y_train)

# Use decision_function method.
decision_function = clf.decision_function(x_test)

Code: 获取

# Actual obtained results without any manual setting of Decision Threshold.
predict_actual = clf.predict(x_test)	 # Predict using classifier.
accuracy_actual = clf.score(x_test, y_test)
classification_report_actual = classification_report(y_test, predict_actual)
print(predict_actual, accuracy_actual, classification_report_actual, sep ='\n')

在这里插入图片描述
在上面的分类报告中，我们可以看到(1)的模型精度值为0.92，(1)的召回值为1.00。由于我们在本文中的目标是在不太影响召回率的情况下建立一个预测(1)的最大高精度模型，我们需要从下面的精度-召回率曲线中手动选择决策阈值的最佳值，以便提高该模型的精度。

Code

# Plot Precision-Recall curve using sklearn.
from sklearn.metrics import precision_recall_curve
precision, recall, treshold = precision_recall_curve(y_test, decision_function)

# Plot the output.
plt.plot(treshold, precision[:-1], c ='r', label ='PRECISION')
plt.plot(treshold, recall[:-1], c ='b', label ='RECALL')
plt.grid()
plt.legend()
plt.title('Precision-Recall Curve')

在这里插入图片描述
在上面的图中，我们可以看到，如果我们想要高精度值，那么我们需要增加决策阈值(x轴)的值，但是这会降低召回率的值(这是不利的)。因此，我们需要选择决策阈值的值，它将提高精确度，但不会大大降低召回率。上图中的一个值约为0.6决策阈值。

# Implementing main logic.

# Based on analysis of the Precision-Recall curve.
# Let Decision Threshold value be around 0.6... to get high Precision without affecting recall much.
# Desired results.

# Decision Function output for x_test.
df = clf.decision_function(x_test)

# Set the value of decision threshold.
decision_teshold = 0.5914643767268305

# Desired prediction to increase precision value.
desired_predict =[]

# Iterate through each value of decision function output
# and if decision score is > than Decision threshold then,
# append (1) to the empty list ( desired_prediction) else
# append (0).
for i in df:
	if i<decision_teshold:
		desired_predict.append(0)
	else:
		desired_predict.append(1)

Code: 旧的精确度与新的精确度的比较

# Comparison

# Old Precision Value
print("old precision value:", precision_score(y_test, predict_actual))
# New precision Value
print("new precision value:", precision_score(y_test, desired_predict))

old precision value: 0.922077922077922
new precision value: 0.9714285714285714

精度值从0.92增加到0.97。
由于精确-召回的权衡，召回的价值已经降低。

*翻译自Decision Threshold In Machine Learning

Sany 何灿

关注

7
点赞
踩
35

收藏

觉得还不错? 一键收藏
2
评论
分类问题之决策阈值 - precision vs recall 详解

什么是决策阈值？sklearn没有让我们直接设置决策阈值，但它让我们可以访问用于进行预测的决策得分(决策函数o/p)。我们可以从决策函数输出中选择最佳得分，并将其设置为决策阈值，并将小于该决策阈值的所有决策得分值视为负类(0)，大于该决策阈值的所有决策得分值视为正类(1)。使用各种决策阈值的精度-召回率曲线，我们可以选择决策阈值的最佳值，以便根据我们的项目是面向精度还是面向召回率分别给出高精度(不太影响召回率)或高召回率(不太影响精度)。这样做的主要目的是根据我们的ML项目分别是面向精度的还是面向召回
复制链接

扫一扫

专栏目录