题目描述
As a student of the Scholomance Academy, you are studying a course called \textit{Machine Learning}. You are currently working on your course project: training a binary classifier.
A binary classifier is an algorithm that predicts the classes of instances, which may be positive (+)({+})(+) or negative (−)({-})(−). A typical binary classifier consists of a scoring function S{S}S that gives a score for every instance and a threshold θ\thetaθ that determines the category. Specifically, if the score of an instance S(x)≥θS(x) \geq \thetaS(x)≥θ, then the instance x{x}x is classified as positive; otherwise, it is classified as negative. Clearly, choosing different thresholds may yield different classifiers.
Of course, a binary classifier may have misclassification: it could either classify a positive instance as negative (false negative) or classify a negative instance as positive (false positive).
Given a dataset and a classifier, we may define the true positive rate (TPR{TPR}TPR) and the false positive rate (FPR{FPR}FPR) as follows:
TPR=#TP#TP+#FN,FPR=#FP#TN+#FP{TPR} = \frac{\# {TP}} {\# {TP} + \# {FN}}, \quad {FPR} = \frac{\# {FP}} {\# {TN} + \# {FP}}TPR=#TP+#FN#TP,FPR=#TN+#FP#FP
where #TP\# TP#TP is the number of true positives in the dataset; #FP,#TN,#FN\# FP, \#TN, \#FN#FP,#TN,#FN are defined likewise.
Now you have trained a scoring function, and you want to evaluate the performance of your classifier. The classifier may exhibit different TPR and FPR if we change the threshold θ\thetaθ. Let TPR(θ),FPR(θ){TPR}(\theta), FPR(\theta)TPR(θ),FPR(θ) be the TPR,FPR{TPR, FPR}TPR,FPR when the threshold is θ\thetaθ, define the area under curve{area\;under\;curve}areaundercurve (AUC{AUC}AUC) as
AUC=∫01maxθ∈R{TPR(θ)∣FPR(θ)≤r}dr{AUC} = \int_{0}^{1} \max_{\theta \in \mathbb{R}} \{TPR(\theta)|FPR(\theta) \leq r\} d rAUC=∫01maxθ∈R{TPR(θ)∣FPR(θ)≤r}dr
where the integrand, called receiver operating characteristic{receiver\;operating\;characteristic}receiveroperatingcharacteristic (ROC), means the maximum possible of TPR{TPR}TPR given that FPR≤rFPR \leq rFPR≤r.
Given the actual classes and predicted scores of the instances in a dataset, can you compute the AUC{AUC}AUC of your classifier?
For example, consider the third test data. If we set threshold θ=30\theta = 30θ=30, there are 3 true positives, 2 false positives, 2 true negatives, and 1 false negative; hence, TPR(30)=0.75{TPR}(30) = 0.75TPR(30)=0.75 and FPR(30)=0.5{FPR}(30) = 0.5FPR(30)=0.5. Also, as θ\thetaθ varies, we may plot the ROC curve and compute the AUC accordingly, as shown in Figure 1.
输入描述:
The first line contains a single integer n{n}n (2≤n≤106)(2 \leq n \leq 10^6)(2≤n≤106), the number of instances in the dataset. Then follow n{n}n lines, each line containing a character c∈{+,−}c \in \{{+},{-}\}c∈{+,−} and an integer s{s}s (1≤s≤109)(1 \leq s \leq 10^9)(1≤s≤109), denoting the actual class and the predicted score of an instance. It is guaranteed that there is at least one instance of either class.
输出描述:
Print the AUC{AUC}AUC of your classifier within an absolute error of no more than 10−910^{-9}10−9.
示例1
输入
3 + 2 - 3 - 1
输出
0.5
示例2
输入
6 + 7 - 2 - 5 + 4 - 2 + 6
输出
0.888888888888889
示例3
输入
8 + 34 + 33 + 26 - 34 - 38 + 39 - 7 - 27
输出
0.5625
说明
ROC and AUC{AUC}AUC of the third sample data.
题意: 题目巨长无比,实在考验人的耐心......有一台分类器,可以根据设定的指标θ来把目标分类成+或者-,如果目标参数大于等于θ就分类成+,如果小于θ分类成-。给出n个目标的目标参数以及它们真正的类别,设FPR为真实类别为-的目标中被机器分类为+的目标个数 / 真实类别为-的目标个数,设TPR为真实类别为+的目标中被机器分类为+的目标个数 / 真实类别为+的目标个数,显然FPR与TPR是关于θ的函数。令θ取遍实数可以得到一系列的FPR(θ)、TPR(θ),即以FPR和TPR为轴的一系列散点,构造函数值f(FPR)为小于等于FPR的区域内TPR的最大值,求f函数在[0, 1]上的积分。
分析: 显然散点一定在θ取每个目标参数时可以全部获取到,因此只需要枚举目标参数就可以得到图上的所有散点。根据f函数的定义可以得知f是个分段函数且每段都是直线,同时f一定递增。因此求积分就是一个求矩形面积的过程,for循环枚举断点累加求和即可。
具体代码如下:
#include<cstdio>
#include<cstring>
#include<algorithm>
#include<iostream>
#include<queue>
#include<map>
#define int long long
#define double long double
using namespace std;
const int N = 1e6+10;
typedef pair<int,int> PII;
map<int,int> mp;
int a[N];
int p[N],ne[N],cnt1,cnt2;
signed main()
{
int n;
cin >> n;
char t[2];
for(int i = 1; i <= n; i++)
{
scanf("%s%lld", t, &a[i]);
if(t[0] == '+')
p[cnt1++] = a[i];
else
ne[cnt2++] = a[i];
}
sort(p,p+cnt1);
sort(ne,ne+cnt2);
double pp = 0;
if(cnt2 == 0){
printf("%.9Lf\n",pp);
return 0;
}
for(int i=0;i<cnt2;i++){
int x = cnt2 - (lower_bound(ne,ne+cnt2,ne[i]) - ne);
int t = cnt1 - (lower_bound(p,p+cnt1,ne[i]) - p);
mp[x] = max(mp[x],t);
}
for(int i=0;i<cnt1;i++){
int x = cnt2 - (lower_bound(ne,ne+cnt2,p[i]) - ne);
int t = cnt1 - (lower_bound(p,p+cnt1,p[i]) - p);
mp[x] = max(mp[x],t);
}
int xl = 0,y = mp[0],ans = 0;
for(map<int,int>::iterator it = mp.begin();it != mp.end();it++){
int xr = it->first;
ans += (xr-xl)*y;
y = it->second;
xl = xr;
}
printf("%.9Lf\n",(double)ans/cnt1/cnt2);
return 0;
}