分类数据失衡_管理失衡分类场景的4种独特方法

分类数据失衡

Out of billions of financial transactions, only a few involve cheating and fraud. Out of millions of cars on roads, only a few break down in the middle of the highway and rest drive fine. If we pay close attention to our daily activities, then a few exception incidents can also be identified. The same skewed data is present in many data points where one or a few couples of classes covers the majority of the cases.

在数十亿笔金融交易中,只有少数涉及欺诈和欺诈。 在数以百万计的道路上行驶的汽车中,只有少数在高速公路中间发生故障,其余部分行驶良好。 如果我们密切注意日常活动,那么也会发现一些异常事件。 相同的偏斜数据存在于许多数据点中,其中一个或几个类覆盖了大多数情况。

When we provide these imbalanced data points to machine learning algorithms, a few majority classes heavily influence at the cost of ignoring the minority class. Most business cases relate to predicting minority class events like fraud, detection of alignment etc. The prediction accuracy of the rare event of the machine learning model trained on imbalanced data is abysmally bad.

当我们将这些不平衡的数据点提供给机器学习算法时,少数多数类会以忽略少数类的代价产生重大影响。 大多数业务案例都涉及预测少数类事件,例如欺诈,对齐检测等。在不平衡数据上训练的机器学习模型的罕见事件的预测准确性非常差。

In this article, I will discuss four unique ways to handle the imbalance dataset and improve the prediction accuracy of the minority class. Also, we will learn why only considering classification metric scores like F1 score or accuracy can mislead about model prediction performance in the rare event forecast.

在本文中,我将讨论处理不平衡数据集并提高少数派类别的预测准确性的四种独特方法。 此外,我们还将了解为什么仅考虑分类指标得分(例如F1得分或准确性)会误导罕见事件预测中的模型预测性能。

We will be using the make_classification method in Scitkit-Learn to generate the imbalanced dataset.

我们将在Scitkit-Learn中使用make_classification方法来生成不平衡数据集。

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,accuracy_score
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt

We will be learning the five unique approaches to handle the imbalanced dataset of 5000 samples with one class comprising of 98% cases.

我们将学习五种独特的方法来处理5000个样本的不平衡数据集,其中一类包含98%的案例。

X, y = make_classification(n_samples=5000,weights=[0.02, 0.98],
random_state=0,n_clusters_per_class=1)ycount=pd.DataFrame(y)
print(ycount[0].value_counts())
Image for post
The proportion of majority and minority class in-sample data set ( output of the above code)
多数和少数族裔样本数据集的比例(以上代码的输出)

Out of 5000 sample records, we have 4871 records for class 1 and 129 class 0 records. Lets us consider class 1 suggests normal transaction and class 0 as fraudulent transactions.

在5000个样本记录中,我们有1类的4871条记录和0类的129条记录。 让我们考虑类别1建议正常交易,类别0建议欺诈交易。

Sample dataset is split into two parts viz. training and test set. The training set is to train the machine learning model, and the test set is to check the prediction of the model.

样本数据集分为两部分。 培训和测试集。 训练集用于训练机器学习模型,而测试集用于检查模型的预测。

We will train the model with 80% of the sample data set and the remaining 20% records which model has not seen before is reserved for the testing set.

我们将使用80%的样本数据集来训练模型,而之前从未见过的其余20%记录将保留给测试集。

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42,stratify=y)

To understand the effect of the imbalanced dataset even on the sophisticated algorithm like Random Forest Classifier let us first train the standard Random Forest Classifier directly with the imbalanced training set and without any weight parameters.

为了了解不平衡数据集的影响,甚至是对像随机森林分类器这样的复杂算法的影响,让我们首先使用不平衡训练集直接训练标准随机森林分类器,而无需任何权重参数。

clf =RandomForestClassifier(max_depth=2,random_state=0).fit(X_train, y_train)print("F1 Score is ", f1_score(y_test,clf.predict(X_test)))
print("Accuracy Score is ", accuracy_score(y_test,clf.predict(X_test)))

The F1 Score and accuracy score of the trained Random Forest Classifier Model on the test dataset is high. But only considering only these metric to judge the prediction performance of the model can be very misleading in case of the imbalanced dataset.

测试数据集上经过训练的随机森林分类器模型的F1得分和准确性得分很高。 但是,在数据集不平衡的情况下,仅考虑这些指标来判断模型的预测性能可能会产生极大的误导。

Image for post
F1 and Accuracy score of Random Forest Classifier trained on the imbalanced dataset ( output of the above code)
F1和在不平衡数据集上训练的随机森林分类器的准确度得分(上面代码的输出)

Deploying such a model only based on these two metrics and not understanding the areas in which the classification model is making an error can be quite costly.

仅基于这两个指标来部署这样的模型,而不了解分类模型在哪里出错,可能是非常昂贵的。

A visual metric, like the confusion matrix, outshines other metrics in several ways. We get an instant view on the model performance in terms of classification areas the model excelled and the areas which require fine-tuning. Based on the business use case, we can judge quickly from the false positive, false negative, true positive and true negative counts whether the model is ready for deployment.

视觉指标(如混淆矩阵)在几种方面优于其他指标。 我们从模型擅长的分类区域和需要微调的区域方面即时了解模型性能。 根据业务用例,我们可以从误报,误报,真正和真负数快速判断模型是否已准备好进行部署。

You can learn in-depth about the confusion matrix in the article Accuracy Visualisation: Supervised Machine Learning Classification Algorithms

您可以在文章《 准确度可视化:有监督的机器学习分类算法》中深入了解混淆矩阵。

fig=plot_confusion_matrix(clf, X_test, y_test)
plt.show()

As expected, the majority class has completely influenced the model, and the trained model has predicted the classification of all records in test dataset as the majority class. Such misclassification prediction in case of rare fraud detection or uncommon malignant disease prediction is very detrimental.

不出所料,多数类已经完全影响了模型,而经过训练的模型已经预测了测试数据集中所有记录的分类为多数类。 在罕见欺诈检测或恶性疾病罕见预测的情况下,这种错误分类预测是非常有害的。

Image for post
Confusion Matrix of test dataset prediction by Random Forest Classifier trained on the imbalanced dataset (output of the above code)
随机森林分类器在不平衡数据集上训练的测试数据集预测的混淆矩阵(上述代码的输出)

Fortunately, Random Forrest Classifier has a parameter “class_weight” to specify the weights of each class in case of an imbalanced dataset.

幸运的是,Random Forrest分类器具有参数“ class_weight”,用于在数据集不平衡的情况下指定每个类别的权重。

In the sample dataset, class 1 is approx 38 times more prevalent than class 0. Hence, we will mention the “class weights” in such proportion for the algorithm to compensate during training.

在样本数据集中,类别1的流行度约为类别0的38倍。因此,我们将以这种比例提及“类别权重”,以便算法在训练期间进行补偿。

weighted_clf = RandomForestClassifier(max_depth=2, random_state=0,class_weight={0:38,1:1}).fit(X_train, y_train)print("F1 Score for RandomForestClassifier with class_weight parameter is ", f1_score(y_test,weighted_clf.predict(X_test)))print("Accuracy  Score for RandomForestClassifier with class_weight parameter is ", accuracy_score(y_test,weighted_clf.predict(X_test)))

The F1 Score and accuracy score for Random Forest Classifier Model with class weigh compensated is also high, but we can ascertain the real performance by checking the confusion matrix.

带有分类权重补偿的随机森林分类器模型的F1得分和准确性得分也很高,但是我们可以通过检查混淆矩阵来确定实际性能。

Image for post
class_weight parameter ( output of the above code) class_weight参数的类优化数据集上训练的随机森林分类器的F1和准确性得分(上述代码的输出)

We can see that the majority class has not completed overtaken the Random Forest Classifier Model with weight-adjusted.

我们可以看到,多数类还没有完成权重调整后的随机森林分类器模型的完善。

fig=plot_confusion_matrix(weighted_clf, X_test, y_test)
plt.show()

Out of the total 1000 records in the test dataset, it has misclassified only 14 records. Also, it has correctly classified 20 of the minority class sample records out of 26 minority class records in the test dataset.

在测试数据集中的1000条记录中,只有14条记录分类错误。 此外,它已正确分类了测试数据集中的26个少数族裔记录中的20个少数族裔样品记录。

Image for post
Confusion Matrix of test dataset prediction by Random Forest Classifier trained on the weight-optimised dataset (output of the above code)
由随机森林分类器在权重优化的数据集上训练的测试数据集预测的混淆矩阵(上述代码的输出)

We have learned the way to handle imbalanced dataset with class_weight parameter in Random Forrest Classifier and improve the prediction accuracy of the minority class.

我们已经学习了在Random Forrest分类器中使用class_weight参数处理不平衡数据集的方法,并提高了少数类的预测准确性。

Next, we will learn a different approach to manage the imbalance input training dataset with BalancedRandomForestClassifier in the imbalanced-learn library.

接下来,我们将学习在不平衡学习库中使用BalancedRandomForestClassifier管理不平衡输入训练数据集的另一种方法。

Install the imbalance-learn library with pip

使用pip安装不平衡学习库

pip install imbalanced-learn

点安装不平衡学习

In the below code, we trained the BalancedRandomForestClassifier with the training dataset and then checked the metrics scores on the testing dataset.

在下面的代码中,我们使用训练数据集训练了BalancedRandomForestClassifier,然后在测试数据集上检查了指标得分。

from imblearn.ensemble import BalancedRandomForestClassifierbrfc = BalancedRandomForestClassifier(n_estimators=500,
random_state=0).fit(X_train,y_train)print("F1 Score for Balanced Random Forest Classifier is ", f1_score(y_test,brfc.predict(X_test)))print("Accuracy Score for Balanced Random Forest Classifier is ", accuracy_score(y_test,brfc.predict(X_test)))

Like the previous two examples, it also indicates high F1 and accuracy score.

像前两个示例一样,它也表示较高的F1和准确性得分。

Image for post
F1 and Accuracy score of Balanced Random Forest Classifier trained ( output of the above code)
F1和经过训练的平衡随机森林分类器的准确度得分(上面代码的输出)

We can see in the confusion matrix that BalancedRandomForestClassifier handles the class weight internally quite well compare to RandomForestClassifier without weight_class parameter.

我们可以从混淆矩阵中看到,与没有weight_class参数的RandomForestClassifier相比,BalancedRandomForestClassifier在内部处理类权重非常好。

fig=plot_confusion_matrix(brfc, X_test, y_test)
plt.show()

Out of 1000 test records, it predicted the classification of 968 records correctly. It also performed slightly better than Random Forrest Classifier with class_weight by correctly classifying 21 records out of 26 records in the minority class.

在1000条测试记录中,它正确预测了968条记录的分类。 通过正确地对少数类的26条记录中的21条进行分类,它的性能也比class_weight的Random Forrest分类器稍好。

Image for post
Confusion Matrix of test dataset prediction by Balanced Random Forest Classifier (output of the above code)
平衡随机森林分类器对测试数据集预测的混淆矩阵(上述代码的输出)

Next, we will use a completely different approach of oversampling to manage the minority class in the training dataset.

接下来,我们将使用完全不同的过采样方法来管理训练数据集中的少数派类别。

Next, we will use a completely different approach of oversampling to manage the minority class in the training dataset. The basic idea is to randomly generate examples in the minority class to have a more balanced dataset.

接下来,我们将使用完全不同的过采样方法来管理训练数据集中的少数派类别。 基本思想是在少数类中随机生成示例,以具有更平衡的数据集。

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)
print("Number of records for X_train is ", X_train.shape)
print("Number of records for X_resampled oversampling is ",X_resampled.shape)

Earlier we divided the sample dataset of 5000 records into training and test dataset with 4000 and 1000 records respectively.

之前我们将5000条记录的样本数据集分为训练和测试数据集,分别具有4000条和1000条记录。

Training dataset fit on RandomOverSampler generated minority class record at random and resampled balanced training data has 7794 records.

训练数据集在RandomOverSampler生成的少数类记录上随机拟合,并且重新采样的平衡训练数据有7794条记录。

Image for post
Training dataset count with oversampled strategy- generated minority class record at random to balance the training dataset (Output of the above code)
训练数据集计数与随机抽取的策略生成的少数族裔类别记录随机抽样以平衡训练数据集(上述代码的输出)

Once the training dataset is artificially balanced, then we can train the standard Random Forest Classifier without “class_weight” parameter.

一旦训练数据集得到人为平衡,我们就可以训练没有“ class_weight”参数的标准随机森林分类器。

oclf = RandomForestClassifier(max_depth=2, random_state=0).fit(X_resampled, y_resampled)

We see that standard Random Forest classifier trained on artificially balanced training dataset with oversampling could predict pretty good.

我们看到,在带有过采样的人工平衡训练数据集上训练的标准随机森林分类器可以很好地预测。

Oversampling helped the Random Classifier to overcome the influence of majority classifier and predict the test data record classes with high accuracy.

过采样有助于随机分类器克服多数分类器的影响,并以较高的准确性预测测试数据记录的分类。

Image for post
Confusion Matrix of test dataset prediction by Random Forest Classifier trained on the oversampled dataset (output of the above code)
由随机森林分类器在过采样数据集上训练的测试数据集预测的混淆矩阵(上述代码的输出)

Out of 1000 test records, it predicted the classification of 985 records correctly. It also performed nearly at par with BalancedRandomForestClassifier with classifying 20 records out of 26 records in the minority class.

在1000条测试记录中,它正确预测了985条记录的分类。 它也与BalancedRandomForestClassifier表现差不多,对少数类的26条记录中的20条进行了分类。

Finally, we will learn about the undersampling strategy to handle the imbalanced dataset. It is a completely different approach than oversampling we learnt earlier. Randomly delete examples in the majority class. The key idea is to randomly delete the majority class records to have a more balanced dataset.

最后,我们将了解处理不平衡数据集的欠采样策略。 这与我们先前了解的过采样完全不同。 随机删除多数类中的示例。 关键思想是随机删除多数类记录,以得到更平衡的数据集。

from imblearn.under_sampling import RandomUnderSamplerrus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)print("Number of records for X_train is ", X_train.shape)
print("Number of records for X_resampled undersampling is ",X_resampled.shape)

A number for majority class records deleted at random to balance the training data set with 206 records from 4000 data records.

随机删除多数班级记录的编号,以使训练数据集与4000条数据记录中的206条记录保持平衡。

Image for post
Training dataset count with under-sampling strategy. Deleted majority class record at random to balance the training dataset (Output of the above code)
训练数据集使用欠采样策略进行计数。 随机删除多数班记录以平衡训练数据集(以上代码的输出)

Once the training dataset is balanced, we can use it directly to train the model just like oversampling strategy discussed earlier.

训练数据集平衡后,我们可以直接使用它来训练模型,就像前面讨论的过采样策略一样。

uclf=RandomForestClassifier(max_depth=2,
random_state=0).fit(X_resampled, y_resampled)

It seems undersampling strategy enables to predict the rare minority class event with similar accuracy like other strategies discussed in this article, but in comparison to other strategies, it performed pretty badly in of predicting the majority class. It predicted incorrectly 82 majority class records from the test dataset.

欠采样策略似乎能够像本文中讨论的其他策略那样以相似的准确性预测罕见的少数群体事件,但是与其他策略相比,它在预测多数群体方面表现很差。 它从测试数据集中错误地预测了82个多数班记录。

fig=plot_confusion_matrix(uclf, X_test, y_test)
plt.show()
Image for post
Confusion Matrix of test dataset prediction by Random Forest Classifier trained on the undersampling dataset (output of the above code)
由随机森林分类器在欠采样数据集上训练的测试数据集预测的混淆矩阵(上述代码的输出)

Key takeaways and my approach

关键要点和我的方法

Most of the machine learning classification algorithms expect balanced training dataset. It is vital to check whether the training dataset is imbalanced and take appropriate pre-processing actions before training the machine learning model with the data.

大多数机器学习分类算法期望平衡的训练数据集。 在使用数据训练机器学习模型之前,检查训练数据集是否不平衡并采取适当的预处理措施至关重要。

GIGO — Garbage In and Garbage Out: If we train a model with imbalanced data, then there is a high probability that the model will miss to predict the minority classes in production.

GIGO —垃圾回收和垃圾回收:如果我们用不平衡的数据训练模型,那么该模型很可能会错过预测生产中的少数派的机会。

Data is very valuable. I do not prefer the undersampling strategy as it forces to prune the data related to the majority class. We saw that due to this even though the model could predict the minority class records nearly with the same accuracy as other strategies discussed in this article it performed miserably in predicting the majority class records.

数据非常有价值。 我不喜欢欠采样策略,因为它会强制删除与多数类相关的数据。 我们看到,由于这个原因,即使模型可以预测少数群体记录的准确性几乎与本文中讨论的其他策略相同,但它在预测多数群体记录方面却表现不佳。

I prefer the Random Forest Classifier with ‘class_weight’ parameter and BalancedRandomForestClassifier in imbalanced-learn library.

我更喜欢不平衡学习库中带有'class_weight'参数的Random Forest分类器和BalancedRandomForestClassifier。

I will suggest you check the training sample performance with all the strategies discussed in this article before selecting any one of them for your project.

我建议您在选择项目中的任何策略之前,先使用本文中讨论的所有策略检查训练样本的性能。

You can gain insight into the imbalanced dataset with exploratory data analysis. To know more about it read the article- 5 Advanced Visualisation for Exploratory data analysis (EDA)

您可以通过探索性数据分析深入了解不平衡数据集。 要了解更多信息,请阅读文章-5探索性数据分析的高级可视化(EDA)

翻译自: https://towardsdatascience.com/4-unique-approaches-to-manage-imbalance-classification-scenario-7c5b92637b9c

分类数据失衡

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值