医疗数据经常highly biased (比如很少一部分人得心脏病,大部分人不得心脏病) 。即样本在不同类别上的不均衡分布问题( class distribution imbalance problem)
采用什么策略处理数据不均衡问题?当数据不均衡时,采用什么指标来衡量模型的优劣?
[b]1. 当数据样本过少时[/b],Leave One Out Cross Validation or 10-fold Cross Validation
[b]2. 当数据样本很多时,Assuming you have a large data set[/b]
[i]假设样本集中25%正例,75%负例。[/i] 运行算法10次,每次都从负例中随机挑选,使得新样本集中正负例 1:1 ( run your algorithm 10 times, where I would select randomly from those not readmitted to make sure the total sample is equal (1:1).)
[i]在每一次运行中[/i] for each of the 10 runs
[list]
[*]case 1:If your algorithm has several competing models. use the validation set to find the best model, and then you test on your test set. divide the sample size into 50/25/25 where you have 50% training, 25% validation and 25% test data.
[*]case 2: If your algorithm does not have several competing models, then you just have a train and test set (no validation set), in this case divide it into 70/30.
[*]within each of the cases, case 1 and case 2 you can run 10-fold CV, or leave one out cross validation. But that is only necessary if you have a smaller data set.
[/list]
[*] average across the results of 10 runs.
[b]当数据不均衡时,采用什么指标来衡量模型的优劣?[/b]AUC:Area Under roc Curve,处于ROC curve下方的那部分面积的大小,较大的AUC代表了较好的performance.
采用什么策略处理数据不均衡问题?当数据不均衡时,采用什么指标来衡量模型的优劣?
[b]1. 当数据样本过少时[/b],Leave One Out Cross Validation or 10-fold Cross Validation
[b]2. 当数据样本很多时,Assuming you have a large data set[/b]
[i]假设样本集中25%正例,75%负例。[/i] 运行算法10次,每次都从负例中随机挑选,使得新样本集中正负例 1:1 ( run your algorithm 10 times, where I would select randomly from those not readmitted to make sure the total sample is equal (1:1).)
[i]在每一次运行中[/i] for each of the 10 runs
[list]
[*]case 1:If your algorithm has several competing models. use the validation set to find the best model, and then you test on your test set. divide the sample size into 50/25/25 where you have 50% training, 25% validation and 25% test data.
[*]case 2: If your algorithm does not have several competing models, then you just have a train and test set (no validation set), in this case divide it into 70/30.
[*]within each of the cases, case 1 and case 2 you can run 10-fold CV, or leave one out cross validation. But that is only necessary if you have a smaller data set.
[/list]
[*] average across the results of 10 runs.
[b]当数据不均衡时,采用什么指标来衡量模型的优劣?[/b]AUC:Area Under roc Curve,处于ROC curve下方的那部分面积的大小,较大的AUC代表了较好的performance.