欠采样和过采样
简介 (Introduction)
The Imbalanced classification problem is what we face when there is a severe skew in the class distribution of our training data. Okay, the skew may not be extremely severe (it can vary), but the reason we identify imbalanced classification as a problem is because it can influence the performance on our Machine Learning algorithms.
吨他不均衡分类问题是,当有在我们的训练数据的类分布的严重扭曲了我们的脸。 好的,偏斜可能不会非常严重(可能会有所不同),但是我们将分类不平衡视为问题的原因是,它会影响我们的机器学习算法的性能。
One way the imbalance may affect our Machine Learning algorithm is when our algorithm completely ignores the minority class. The reason this is an issue is because the minority class is often the class that we are most interested in. For instance, when building a classifier to classify fraudulent and non-fraudulent transactions from various observations, the data is likely to have more non-fraudulent transactions than that of fraud — I mean think about it, it would be very worrying if we had an equal amount of fraudulent transactions as non-fraud.
不平衡可能影响我们的机器学习算法的一种方式是,当我们的算法完全忽略少数派类别时。 之所以会出现这个问题,是因为少数派类别通常是我们最感兴趣的类别。例如,当建立一个分类器以根据各种观察结果对欺诈性和非欺诈性交易进行分类时,数据可能会包含更多的非欺诈交易要比欺诈交易多-我的意思是,考虑一下,如果我们有同等数量的欺诈交易与非欺诈交易,那将非常令人担忧。
![Image for post](https://miro.medium.com/max/9999/1*emgamRvmZiswj9AYoycEFQ.png)
An approach to combat this challenge is Random Sampling. There are two main ways to perform random resampling, both of which have there pros and cons:
应对这种挑战的一种方法是随机采样。 执行随机重采样的主要方法有两种,两种方法各有利弊:
Oversampling — Duplicating samples from the minority class
过度采样 -复制少数群体的样本
Undersampling — Deleting samples from the majority class.
欠采样-从多数类别中删除样本。
In other words, Both oversampling and undersampling involve introducing a bias to select more samples from one class than from another, to compensate for an imbalance that is either already present in the data, or likely to develop if a purely random sample were taken (Source: