weka up-sampling & down-sampling

最新推荐文章于 2022-04-08 11:33:00 发布

zygzdf

最新推荐文章于 2022-04-08 11:33:00 发布

阅读量563

点赞数

分类专栏： machine learning 文章标签：取样

本文链接：https://blog.csdn.net/zygzdf/article/details/84725794

版权

machine learning 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

[b]up-sampling:[/b]

SMOTE algorithm，over-sampled by creating ``synthetic'' examples rather than by over-sampling with replacement.

[b]Weka supervised SMOTE filter [/b]
两个参数：
[list]
[*]nearestNeighbors:how many nearest neighbor instances (surrounding the currently considered instance) are used to build an inbetween synthetic instance. 默认取值5.
[*]percentage.how many synthetic instances are created based on the number of the class with less instances. 默认值100，假设minority class有25个样本，则25个新样本将会根据nearest Neighbors来合成，此时minority class的样本数变成了50.
[/list]

[b]down-sampling[/b]
The majority class is under-sampled by randomly removing samples from the majority class population until the minority class becomes some specified percentage of the majority class.

[b]Weka supervised SpreadSubsample filter[/b]
maxCount:可以取minority class的样本数量 n。
如果 maxCount < n: 则正负例的样本数量都减少到maxCount
如果 maxCount > n: 则minority class的样本数量 n不变，majority class的样本数量减少到maxCount


		Instances train = DataSource
				.read(path);
		train.setClassIndex(rawins.numAttributes() - 1);
		weka.filters.supervised.instance.SpreadSubsample sps = new SpreadSubsample();
		sps.setMaxCount(n); //minority class的样本数量 n
		sps.setInputFormat(train);
		Instances ins = sps.useFilter(train, sps);