样本不均衡问题

医疗数据经常highly biased (比如很少一部分人得心脏病,大部分人不得心脏病) 。即样本在不同类别上的不均衡分布问题( class distribution imbalance problem)

采用什么策略处理数据不均衡问题?当数据不均衡时,采用什么指标来衡量模型的优劣?


[b]1. 当数据样本过少时[/b],Leave One Out Cross Validation or 10-fold Cross Validation

[b]2. 当数据样本很多时,Assuming you have a large data set[/b]
[i]假设样本集中25%正例,75%负例。[/i] 运行算法10次,每次都从负例中随机挑选,使得新样本集中正负例 1:1 ( run your algorithm 10 times, where I would select randomly from those not readmitted to make sure the total sample is equal (1:1).)
[i]在每一次运行中[/i] for each of the 10 runs


[list]
[*]case 1:If your algorithm has several competing models. use the validation set to find the best model, and then you test on your test set. divide the sample size into 50/25/25 where you have 50% training, 25% validation and 25% test data.
[*]case 2: If your algorithm does not have several competing models, then you just have a train and test set (no validation set), in this case divide it into 70/30.

[*]within each of the cases, case 1 and case 2 you can run 10-fold CV, or leave one out cross validation. But that is only necessary if you have a smaller data set.

[/list]

[*] average across the results of 10 runs.


[b]当数据不均衡时,采用什么指标来衡量模型的优劣?[/b]AUC:Area Under roc Curve,处于ROC curve下方的那部分面积的大小,较大的AUC代表了较好的performance.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值