cc和毫升换算_其他毫升行话标签泄漏

最新推荐文章于 2021-02-18 01:55:07 发布

羊牮

最新推荐文章于 2021-02-18 01:55:07 发布

阅读量481

点赞数

原文链接：https://towardsdatascience.com/other-ml-jargons-label-leakage-9e85b22c6fd0

版权

cc和毫升换算

其他ML行话(Other ML Jargons)

Have you been overwhelmed with a perfect or near-perfect model performance? Was that joy crumbled by that one feature that sold you out?

您是否对完美或接近完美的模型表现不知所措？ 那种把你卖光了的功能使这种喜悦崩溃了吗？

In short, Label leakage or Target Leakage or simply Leakage happens when the information you want to predict is directly or indirectly present in your training dataset. It causes a model to overrepresent its generalization error and incredibly boosts the model’s performance but makes the model useless for any real-world application.

简而言之，当您要预测的信息直接或间接出现在训练数据集中时，就会发生标签泄漏或目标泄漏或仅泄漏。这会导致模型过度代表其泛化错误，并极大地提高了模型的性能，但使该模型对任何实际应用程序都无用。

数据泄漏如何发生 (How Data Leakage occurs)

The simplest example would be training a model with the labels itself. In practice, an indirect representation of the target variable is inadvertently introduced while data collection and preparation processes. Features that trigger the outcome and features which are direct consequences of the target variable are gathered during data mining processes and thus should be manually identified while conducting exploratory data analysis.

最简单的示例是使用标签本身训练模型。在实践中，在数据收集和准备过程中无意中引入了目标变量的间接表示。触发结果的特征和目标变量的直接结果是在数据挖掘过程中收集的，因此在进行探索性数据分析时应手动识别它们。

The primary indicator of data leakage is a “too-good-to-be-true” model. This model is most likely to perform poorly during prediction time since it is a sub-optimal version of it.

数据泄漏的主要指标是“太好了，不能成为现实”模型。由于该模型不是最佳模型，因此在预测期间最有可能表现不佳。

Data leakage could not only be in terms of training features being an indirect representation of the label. It could also be because some information from the validation or test data remained in the training data or from the future into historical records.

数据泄漏不仅可以通过训练特征作为标签的间接表示来实现。也可能是因为来自验证或测试数据的某些信息保留在培训数据中，或者仍来自将来的历史记录。

标签泄漏问题的示例： (Examples of Label Leakage Problem:)

Predicting the probability of whether a person will open a bank account with a feature that represents if the person has a bank account number associated.
预测某人是否会开立具有代表该人是否具有关联的银行帐号的功能的银行帐户的可能性。
In a churn prediction problem, it turns out that a feature called “Interviewer” is the best indicator if the customer has churned or not. The models performed poorly. The reason is this “interviewer” is a salesperson who is assigned only after the customer has confirmed that they intended to churn.
在客户流失预测问题中，事实证明，无论客户是否流失，称为“采访者”的功能都是最好的指示。这些模型表现不佳。原因是此“采访者”是仅在客户确认他们打算流失之后才分配的销售员。

如何应对标签泄漏 (How to combat label leakage)

Remove them or add noise to introduce randomness that could be smoothened
删除它们或添加噪音以引入可以平滑的随机性
Use Cross-Validation or make to sure use a validation set to test your model on unseen instances.
使用交叉验证，或确保使用验证集在看不见的实例上测试模型。

3. Use pipeline instead of scaling or transforming the whole dataset. When a feature is scaled down based on the whole dataset provided, for example, using a min-max scaler, and then a train and test split is applied, the scaled test set also contains the information from the scaled training feature since the minimum and maximum of the entire dataset was used. Therefore, the usage of pipelines is always advised to combat label leakage.

3.使用管道而不是缩放或变换整个数据集。当基于提供的整个数据集按比例缩小特征时(例如，使用最小-最大缩放器)，然后应用训练和测试拆分，按比例缩放的测试集还包含来自按比例缩放的训练特征的信息，因为最小值和最小值使用了整个数据集的最大值。因此，始终建议使用管道来防止标签泄漏。

4. Test your model on hold-out data and assess the performance. This is the most expensive way, in terms of infrastructure, time, and resources, since the entire process has to be carried out again using the correct method.

4.根据保留数据测试模型并评估性能。就基础架构，时间和资源而言，这是最昂贵的方法，因为必须使用正确的方法再次执行整个过程。

结论： (Conclusion:)

Data leakage is one of the most common mistakes and can occur while feature engineering, working with time series, dataset labeling, and subtly pass validation set information to the training set. It is important that the machine learning model is only exposed to information that is available at the time of prediction. Therefore, it is advisable to carefully pick features, split data before applying transformations, avoid fitting transformations on the validation set, and use pipelines.

数据泄漏是最常见的错误之一，可能发生在要素工程，使用时间序列，数据集标签以及将验证集信息巧妙地传递给训练集的过程中。重要的是，机器学习模型仅暴露于预测时可用的信息。因此，建议仔细选择要素，在应用转换之前分割数据，避免将转换适合验证集并使用管道。

Reference:

参考：

翻译自: https://towardsdatascience.com/other-ml-jargons-label-leakage-9e85b22c6fd0

cc和毫升换算

羊牮

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
cc和毫升换算_其他毫升行话标签泄漏

cc和毫升换算其他ML行话(Other ML Jargons)Have you been overwhelmed with a perfect or near-perfect model performance? Was that joy crumbled by that one feature that sold you out?您是否对完美或接近完美的模型表现不知所措？那种把你卖光了的功能...
复制链接

扫一扫