手撕二分类的评价指标AUC

最新推荐文章于 2024-07-27 11:38:12 发布

EWilsen

最新推荐文章于 2024-07-27 11:38:12 发布

阅读量1.5k

点赞数

分类专栏：数据挖掘文章标签：数据挖掘机器学习

本文链接：https://blog.csdn.net/juanmengmu2595/article/details/79549476

版权

数据挖掘专栏收录该内容

1 篇文章 0 订阅

订阅专栏

对于二类分类问题常用的评价指标是精准度（precision）与召回率（recall）。通常以关注的类为正类，其他类为负类，分类器在测试数据集上的预测或正确或不正确，4种情况出现的总数分别记作：

TP——将正类预测为正类数

FN——将正类预测为负类数

FP——将负类预测为正类数

TN——将负类预测为负类数

由此：

精准率定义为：P = TP / (TP + FP)

召回率定义为：R = TP / (TP + FN)

F1值定义为： F1 = 2 P R / (P + R)

精准率和召回率和F1取值都在0和1之间，精准率和召回率高，F1值也会高，不存在数值越接近0越高的说法，应该是数值越接近1越高。

通俗的理解发方法：
假设一共有10篇文章，里面4篇是你要找的。根据你某个算法，你认为其中有5篇是你要找的，但是实际上在这5篇里面，只有3篇是真正你要找的。那么你的这个算法的precision是3/5=60%，也就是，你找的这5篇，有3篇是真正对的。这个算法的recall是3/4=75%，也就是，一共有用的这4篇里面，你找到了其中三篇。

AUC计算: 精确方法与近似方法

#　coding=utf-8
#　auc值的大小可以理解为: 随机抽一个正样本和一个负样本，正样本预测值比负样本大的概率
# 根据这个定义，我们可以自己实现计算auc

import random
import time

def timeit(func):
    """
    装饰器，计算函数执行时间
    """
    def wrapper(*args, **kwargs):
        time_start = time.time()
        result = func(*args, **kwargs)
        time_end = time.time()
        exec_time = time_end - time_start
        print "{function} exec time: {time}s".format(function=func.__name__,time=exec_time)
        return result
    return wrapper

def gen_label_pred(n_sample):
    """
    随机生成n个样本的标签和预测值
    """
    labels = [random.randint(0,1) for _ in range(n_sample)]
    preds = [random.random() for _ in range(n_sample)]
    return labels,preds

@timeit
def naive_auc(labels,preds):
    """
    最简单粗暴的方法
　　　先排序，然后统计有多少正负样本对满足：正样本预测值>负样本预测值, 再除以总的正负样本对个数
     复杂度 O(NlogN), N为样本数
    """
    n_pos = sum(labels)
    n_neg = len(labels) - n_pos
    total_pair = n_pos * n_neg

    labels_preds = zip(labels,preds)
    labels_preds = sorted(labels_preds,key=lambda x:x[1])
    accumulated_neg = 0
    satisfied_pair = 0
    for i in range(len(labels_preds)):
        if labels_preds[i][0] == 1:
            satisfied_pair += accumulated_neg
        else:
            accumulated_neg += 1

    return satisfied_pair / float(total_pair)

@timeit
def approximate_auc(labels,preds,n_bins=100):
    """
    近似方法，将预测值分桶(n_bins)，对正负样本分别构建直方图，再统计满足条件的正负样本对
    复杂度 O(N)
    这种方法有什么缺点？怎么分桶？
    
    """
    n_pos = sum(labels)
    n_neg = len(labels) - n_pos
    total_pair = n_pos * n_neg
    
    pos_histogram = [0 for _ in range(n_bins)]
    neg_histogram = [0 for _ in range(n_bins)]
    bin_width = 1.0 / n_bins
    for i in range(len(labels)):
        nth_bin = int(preds[i]/bin_width)
        if labels[i]==1:
            pos_histogram[nth_bin] += 1
        else:
            neg_histogram[nth_bin] += 1
    
    accumulated_neg = 0
    satisfied_pair = 0
    for i in range(n_bins):
        satisfied_pair += (pos_histogram[i]*accumulated_neg + pos_histogram[i]*neg_histogram[i]*0.5)
        accumulated_neg += neg_histogram[i]
    
    return satisfied_pair / float(total_pair)