机器学习处理不平衡数据_在机器学习中处理不平衡数据

最新推荐文章于 2024-07-02 09:17:01 发布

weixin_26752765

最新推荐文章于 2024-07-02 09:17:01 发布

阅读量834

点赞数

文章标签：机器学习 python 人工智能算法大数据

原文链接：https://heartbeat.fritz.ai/dealing-with-imbalanced-data-in-machine-learning-18e45fea7bb5

版权

机器学习处理不平衡数据

As an ML engineer or data scientist, sometimes you inevitably find yourself in a situation where you have hundreds of records for one class label and thousands of records for another class label.

作为ML工程师或数据科学家，有时您不可避免地会遇到这样的情况：一个类标签有数百条记录，而另一个类标签有数千条记录。

Upon training your model you obtain an accuracy above 90%. You then realize that the model is predicting everything as if it’s in the class with the majority of records. Excellent examples of this are fraud detection problems and churn prediction problems, where the majority of the records are in the negative class. What do you do in such a scenario? That will be the focus of this post.

训练模型后，您可以获得90％以上的准确性。然后，您会意识到该模型正在预测所有内容，就好像它属于具有大部分记录的类一样。欺诈检测问题和客户流失预测问题就是一个很好的例子，其中大多数记录为负类。在这种情况下您会做什么？这将是这篇文章的重点。

收集更多数据 (Collect More Data)

The most straightforward and obvious thing to do is to collect more data, especially data points on the minority class. This will obviously improve the performance of the model. However, this is not always possible. Apart from the cost one would have to incur, sometimes it's not feasible to collect more data. For example, in the case of churn prediction and fraud detection, you can’t just wait for more incidences to occur so that you can collect more data.

最直接，最明显的方法是收集更多数据，尤其是有关少数群体的数据点。这显然会改善模型的性能。但是，这并不总是可能的。除了必须承担的费用外，有时收集更多数据也不可行。例如，对于流失预测和欺诈检测，您不能仅等待发生更多的事件以收集更多的数据。

考虑精度以外的指标 (Consider Metrics Other than Accuracy)

Accuracy is not a good way to measure the performance of a model where the class labels are imbalanced. In this case, it's prudent to consider other metrics such as precision, recall, Area Under the Curve (AUC) — just to mention a few.

精度不是衡量类标签不平衡的模型性能的好方法。在这种情况下，请谨慎考虑其他指标，例如精度，召回率，曲线下面积(AUC)-仅举几例。

Precision measures the ratio of the true positives among all the samples that were predicted as true positives and false positives. For example, out of the number of people our model predicted would churn, how many actually churned?

精度测量所有被预测为真阳性和假阳性的样本中真阳性的比率。例如，在我们的模型预测的流失人数中，实际上有多少人会流失？

Recall measures the ratio of the true positives from the sum of the true positives and the false negatives. For example, the percentage of people who churned that our model predicted would churn.

召回率衡量的是真实肯定与错误肯定的总和。例如，我们的模型预测的会搅动的人群会流失。

The AUC is obtained from the Receiver Operating Characteristics (ROC) curve. The curve is obtained by plotting the true positive rate against the false positive rate. The false positive rate is obtained by dividing the false positives by the sum of the false positives and the true negatives.

AUC从接收器工作特性(ROC)曲线获得。通过绘制真实的阳性率对假阳性率来获得曲线。误报率是通过将误报除以误报和真实否定之和得出的。

AUC closer to one is better, since it indicates that the model is able to find the true positives.

AUC越接近一个越好，因为它表明该模型能够找到真实的阳性结果。

Machine learning is rapidly moving closer to where data is collected — edge devices. Subscribe to the Fritz AI Newsletter to learn more about this transition and how it can help scale your business.

机器学习正Swift向收集数据的地方(边缘设备)靠近。订阅Fritz AI新闻通讯以了解有关此过渡及其如何帮助您扩展业务的更多信息。

强调少数民族阶层 (Emphasize the Minority Class)

Another way to deal with imbalanced data is to have your model focus on the minority class. This can be done by computing the class weights. The model will focus on the class with a higher weight. Eventually, the model will be able to learn equally from both classes. The weights can be computed with the help of scikit-learn.

处理不平衡数据的另一种方法是让模型关注少数群体。这可以通过计算类权重来完成。该模型将重点关注权重较高的课程。最终，该模型将能够从两个类中平均学习。权重可以借助scikit-learn进行计算。

from sklearn.utils.class_weight import compute_class_weight
weights = compute_class_weight(‘balanced’, y.unique(), y)
array([ 0.51722354, 15.01501502])

You can then pass these weights when training the model. For example, in the case of logistic regression:

然后，在训练模型时可以传递这些权重。例如，对于逻辑回归：

class_weights = {
 0:0.51722354,
 1:15.01501502
}lr = LogisticRegression(C=3.0, fit_intercept=True, warm_start = True, class_weight=class_weights)

Alternatively, you can pass the class weights as balanced and the weights will be automatically adjusted.

或者，您可以将班级权重传递为balanced ，并且权重将自动调整。

lr = LogisticRegression(C=3.0, fit_intercept=True, warm_start = True, class_weight=’balanced’)

Here’s the ROC curve before the weights are adjusted.

这是调整权重之前的ROC曲线。

And here’s the ROC curve after the weights have been adjusted. Note the AUC moved from 0.69 to 0.87.

这是权重调整后的ROC曲线。请注意，AUC从0.69变为0.87。

尝试不同的算法 (Try Different Algorithms)

As you focus on the right metrics for imbalanced data, you can also try out different algorithms. Generally, tree-based algorithms perform better on imbalanced data. Furthermore, some algorithms such as LightGBM have hyperparameters that can be tuned to indicate that the data is not balanced.

当您专注于针对不平衡数据的正确指标时，您还可以尝试不同的算法。通常，基于树的算法在不平衡数据上表现更好。此外，某些算法(例如LightGBM)具有超参数，可以对其进行调整以指示数据不平衡。

生成综合数据 (Generate Synthetic Data)

You can also generate synthetic data to increase the number of records in the minority class — usually known as oversampling. This is usually done on the training set after doing the train test split. In Python, this can be done using the Imblearn package. One of the strategies that can be implemented from the package is known as the Synthetic Minority Over-sampling Technique (SMOTE). The technique is based on k-nearest neighbors.

您还可以生成综合数据，以增加少数派类别中的记录数量(通常称为过采样)。通常在进行火车测试拆分后，对训练集执行此操作。在Python中，可以使用Imblearn包来完成。可以从该软件包中实施的策略之一就是合成少数族裔过采样技术(SMOTE) 。该技术基于k最近邻。

When using SMOTE:

使用SMOTE时：

The first parameter is a float that indicates the ratio of the number of samples in the minority class to the number of samples in the majority class, once resampling has been done.
第一个参数是float ，表示完成重采样后，少数类中的样本数与多数类中的样本数之比。
The number of neighbors to be used to generate the synthetic samples can be specified via the k_neighbors parameter.
可以通过k_neighbors指定用于生成合成样本的k_neighbors 参数。

from imblearn.over_sampling import SMOTEsmote = SMOTE(0.8)X_resampled,y_resampled = smote.fit_resample(X.values,y.values)pd.Series(y_resampled).value_counts()0    9667
1    7733 
dtype: int64

You can then fit your resampled data to your model.

然后，您可以将重新采样的数据拟合到模型中。

model = LogisticRegression()model.fit(X_resampled,y_resampled)predictions = model.predict(X_test)

多数类别欠采样 (Undersample the Majority Class)

You can also experiment on reducing the number of samples in the majority class. One such strategy that can be implemented is the NearMiss method. You can also specify the ratio just like in SMOTE, as well as the number of neighbors via n_neighbors.

您也可以尝试减少多数类中的样本数量。可以实施的一种这样的策略是NearMiss方法。您也可以像在n_neighbors一样指定比率，并通过n_neighbors邻居的数量。

from imblearn.under_sampling import NearMissunderSample = NearMiss(0.3,random_state=1545)pd.Series(y_resampled).value_counts()0  1110 1  333 dtype: int64

最后的想法 (Final Thoughts)

Other techniques that can be used include using building an ensemble of weak learners to create a strong classifier. Metrics such as precision-recall curve and area under curve (PR, AUC) are also worth trying when the positive class is the most important.

可以使用的其他技术包括使用一组弱学习者来创建强分类器。当肯定类别最重要时，诸如精确调用曲线和曲线下面积(PR，AUC)之类的指标也值得尝试。

As always, you should experiment with different techniques and settle on the ones that give you the best results for your specific problems. Hopefully, this piece has given some insights on how to get started.

与往常一样，您应该尝试不同的技术，然后选择能够为您的特定问题提供最佳结果的技术。希望这篇文章对如何入门提供了一些见解。

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to exploring the emerging intersection of mobile app development and machine learning. We’re committed to supporting and inspiring developers and engineers from all walks of life.

编者注： 心跳 是由贡献者驱动的在线出版物和社区，致力于探索移动应用程序开发和机器学习的新兴交集。 我们致力于为各行各业的开发人员和工程师提供支持和启发。

Editorially independent, Heartbeat is sponsored and published by Fritz AI, the machine learning platform that helps developers teach devices to see, hear, sense, and think. We pay our contributors, and we don’t sell ads.

Heartbeat在编辑上是独立的，由以下机构赞助和发布 Fritz AI ，一种机器学习平台，可帮助开发人员教设备看，听，感知和思考。 我们向贡献者付款，并且不出售广告。

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Fritz AI Newsletter), join us on Slack, and follow Fritz AI on Twitter for all the latest in mobile machine learning.

如果您想做出贡献，请继续我们的 呼吁捐助者 。 您还可以注册以接收我们的每周新闻通讯(《 深度学习每周》 和《 Fritz AI新闻通讯》 )，并加入我们 Slack ，然后继续关注Fritz AI Twitter 提供了有关移动机器学习的所有最新信息。

翻译自: https://heartbeat.fritz.ai/dealing-with-imbalanced-data-in-machine-learning-18e45fea7bb5

机器学习处理不平衡数据

weixin_26752765

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
机器学习处理不平衡数据_在机器学习中处理不平衡数据

机器学习处理不平衡数据As an ML engineer or data scientist, sometimes you inevitably find yourself in a situation where you have hundreds of records for one class label and thousands of records for another class...
复制链接

扫一扫