kaggle:上输入手机号_过度优化：关于Kaggle的故事。

最新推荐文章于 2024-04-05 19:31:48 发布

cumei1658

最新推荐文章于 2024-04-05 19:31:48 发布

阅读量220

点赞数

文章标签： python 机器学习深度学习人工智能数据挖掘

原文链接：https://www.pybloggers.com/2016/02/over-optimizing-a-story-about-kaggle/

版权

kaggle:上输入手机号

I recently took a stab at a Kaggle competition. The premise was simple, given some information about insurance quotes, predict whether or not the customer who requested the quote will follow through and buy the insurance. Straight forward classification problem, data already clean and in one place, clear scoring metric (Area under the ROC curve).

我最近在Kaggle比赛中受了伤。前提很简单，给定一些有关保险报价的信息，可以预测请求报价的客户是否会继续购买保险。直截了当的分类问题，数据已经干净并且集中在一个位置，清晰的评分指标（ROC曲线下的面积）。

I took a starter script to do bare minimum formatting and trained a few big random forests on it with slightly different parameters, nothing too serious. Total human time invested, probably less than 30 minutes, runtime was a few hours.

我使用了一个入门脚本来进行最低限度的格式化，并使用一些稍微不同的参数对其进行了一些大的随机森林训练，没有什么太严重的。总的人力投入可能少于30分钟，而运行时间只有几个小时。

I came in 1113th place.

我排在第1113位。

Out of 1762.

从1762年起。

Not so hot.

不太热。

You can check the results here.

您可以在此处检查结果。

But let’s dig into what these results mean. The area under the ROC curve represents the overall quality of a binary classifier (assuming that it has roughly even class distributions). It has false positive rate on the x-axis and the true positive rate on the y-axis and has one point per valid threshold available. If the class distribution is exactly 50/50 (even number of true and false samples), then a totally random model would result in an ROC score of 0.5 (straight diagonal line). A perfect model would have score of 1 (one point at 100% TPR and 0% FPR).

但是，让我们深入研究一下这些结果的含义。 ROC曲线下的区域代表二进制分类器的整体质量（假设它具有大致均匀的类分布）。它在x轴上具有假正率，在y轴上具有真正率，并且每个有效阈值都有一个点。如果类别分布恰好是50/50（真实样本和虚假样本的数量均匀），则完全随机的模型将导致ROC得分为0.5（直线对角线）。完美的模型得分为1（100％TPR和0％FPR时得分1分）。

In the Kaggle competition linked, my illustrious contribution had an ROC AUC of 0.96290, while the winner had 0.97024. This particular competition had a prize of $20,000 and it’s not uncommon for teams to spend man weeks or months on a given contest. So while my 30 minutes wasn’t nearly enough to win, how close was it in practical terms?

在相关的Kaggle比赛中，我的杰出贡献是ROC AUC为0.96290，而冠军则为0.97024。这项特殊比赛的奖金为20,000美元，对于团队来说，花几个星期或几个月的时间参加特定比赛并不罕见。因此，尽管我的30分钟还远远不够赢，但实际上距离我有多近？

By simple percentage, it would have taken a 0.7% improvement in my score to win. But that’s not really all that informative. Recall that a totally random model will achieve an ROC AUC score of 0.5, and let’s do an experiment.

按照简单的百分比，我的得分需要提高0.7％才能获胜。但这还不是全部有用的信息。回想一下，一个完全随机的模型将使ROC AUC得分达到0.5，让我们做一个实验。

Take a random set of true values: [0, 1, 1, 0, …, 0, 1, 0]

取一组真实值：[0，1，1，0，…，0，1，0]

Then take some random variable and use it to produce noisy predictions based on that truth variable: [0.03, 0.98, 0.97, 0.02, …, 0.10, 0.99, 0.01]

然后取一些随机变量，并使用它基于该真实变量生成嘈杂的预测：[0.03、0.98、0.97、0.02，…，0.10、0.99、0.01]

As this random variable increases in standard deviation, the predictions get noisier, the model gets worse and the ROC AUC drops. More generically:

随着此随机变量标准偏差的增加，预测变得更嘈杂，模型变得更糟，ROC AUC下降。更一般地：

where entropy ranges between 0 and 1. Any time entropy is less than 0.5, the model still perfectly classifies the problem, but between 0.5 and 1, it becomes increasingly inaccurate. Using a simple python script, we can plot this behavior:

其中熵的范围在0到1之间。只要熵低于0.5，模型仍然可以完美地对问题进行分类，但是在0.5到1之间，模型变得越来越不准确。使用简单的python脚本，我们可以绘制以下行为：

import random
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
plt.style.use('ggplot')

__author__ = 'willmcginnis'


def create_preds(entropy=0.2, n_samples=10000):
    truth = [random.randint(0, 1) for _ in range(n_samples)]
    preds = [abs(t - (random.random() * entropy)) for t in truth]

    return truth, preds


def score(truth, preds):
    return roc_auc_score(truth, preds)


def score_n(n=10):
    data = []
    for entropy in np.linspace(0.01, 0.99, n):
        t, p = create_preds(entropy)
        data.append([entropy, score(t, p)])
    return pd.DataFrame(data, columns=['entropy', 'ROC AUC'])

if __name__ == '__main__':
    df = score_n(1000)
    df.plot(kind='line', x='entropy', y='ROC AUC')
    plt.xlabel('Entropy of Predictions')
    plt.ylabel('ROC AUC Score')
    plt.title('ROC Scores of Random Models')
    plt.show()

Which gives the plot:

给出图：

In a perfect simulation you would see the line slope smoothly from (0.5, 1) to (1, 0). So the slope of this curve represents the differential change in ROC AUC score given random noise in predictions. Using this we can see that the difference between my ROC AUC score and the winners’ amounts to:

在理想的模拟中，您会看到线斜率从（0.5，1）平滑到（1，0）。因此，该曲线的斜率表示在预测中给定随机噪声的情况下，ROC AUC分数的差异变化。使用此方法，我们可以看到我的ROC AUC得分与获胜者的区别为：

$entropy = frac{0.00734}{frac{2}{1}} = 0.00367$

翻译自: https://www.pybloggers.com/2016/02/over-optimizing-a-story-about-kaggle/

kaggle:上输入手机号

cumei1658

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
kaggle:上输入手机号_过度优化：关于Kaggle的故事。

kaggle:上输入手机号I recently took a stab at a Kaggle competition. The premise was simple, given some information about insurance quotes, predict whether or not the customer who requested the quote will fo...
复制链接

扫一扫