机器学习中正负例不平衡问题

This is a good question, and one that seems to get raised time and time again.

Myself and a colleague (Sven Crone from Lancaster University in the UK) published a paper on this issue last year in the International Journal of Forecasting. "Instance sampling in credit scoring: An empirical study of sample size and balancing."  A summary of our findings can also be found in the book "Credit Scoring, Response Modeling and Insurance Rating.  A Practical Guide to Forecasting Consumer Behavior.”

There are also some very good papers by G. Weiss from 2004/5 which are highly cited and referenced in our paper/book.

What we found was that for some methods of model construction sample imbalance was not an issue at all – not even a tiny amount. For logistic regression in particular, there was absolutely no benefit to creating a balanced sample. What was far more important was using all the data you had available. For example, for a marketing campaign, if you had 1,000 responses and 50,000 non-responses you got better models by using all 51,000 cases, compared to sampling down the non-responses to 1000 or by weighting up the 1,000 responses.

We also looked at Neural Networks, Discriminant Analysis and Decision Trees. Discriminant Analysis was somewhat sensitive to the class imbalance (Balanced better than imbalanced) but the method that was the most sensitive, by far, was the Decision Tree approach (CART 4.5). We saw differences in model performance of more than 10% - with the balanced sample performing much better than the imbalanced one.

We also considered the two different ways of creating a balanced sample. The first was “under-sampling” where you “throw away” some of the larger class (non-responders in the above example) to create a sample with the same numbers of each class.  The second method was “over-sampling” which is where you weight up the minority class – so in the above example you would treat each response as if it appeared 51 times in your sample. Over sampling was generally better than under-sampling, particularly when small samples were involved – which makes intuitive sense given that with under sampling you are not making full use of the data available to you. Weiss comes to the same conclusion in his paper, if memory serves me right.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值