样本不均衡问题

最新推荐文章于 2023-11-29 13:52:33 发布

zygzdf

最新推荐文章于 2023-11-29 13:52:33 发布

阅读量320

点赞数

分类专栏： machine learning healthcare

本文链接：https://blog.csdn.net/zygzdf/article/details/84711820

版权

machine learning 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

healthcare

1 篇文章 0 订阅

订阅专栏

医疗数据经常highly biased (比如很少一部分人得心脏病，大部分人不得心脏病) 。即样本在不同类别上的不均衡分布问题( class distribution imbalance problem)

采用什么策略处理数据不均衡问题？当数据不均衡时，采用什么指标来衡量模型的优劣？

[b]1. 当数据样本过少时[/b]，Leave One Out Cross Validation or 10-fold Cross Validation

[b]2. 当数据样本很多时，Assuming you have a large data set[/b]
[i]假设样本集中25%正例，75%负例。[/i] 运行算法10次，每次都从负例中随机挑选，使得新样本集中正负例 1：1 ( run your algorithm 10 times, where I would select randomly from those not readmitted to make sure the total sample is equal (1:1).)
[i]在每一次运行中[/i] for each of the 10 runs

[list]
[*]case 1:If your algorithm has several competing models. use the validation set to find the best model, and then you test on your test set. divide the sample size into 50/25/25 where you have 50% training, 25% validation and 25% test data.
[*]case 2: If your algorithm does not have several competing models, then you just have a train and test set (no validation set), in this case divide it into 70/30.

[*]within each of the cases, case 1 and case 2 you can run 10-fold CV, or leave one out cross validation. But that is only necessary if you have a smaller data set.

[/list]

[*] average across the results of 10 runs.

[b]当数据不均衡时，采用什么指标来衡量模型的优劣？[/b]AUC：Area Under roc Curve，处于ROC curve下方的那部分面积的大小，较大的AUC代表了较好的performance.