I have dataset with 12 columns + 1 target (binary) and about 4000 rows. I need to split it into train (70%), validation (20%) and test (10%) set.
The dataset is quite undersampled (95% of class 0 to 5% of class 1) so I need to keep the ratio of target in each sample.
I am able to split the dataset somehow, but I have no idea how to keep the ratio.
I am working with subset Wine Quality data here
解决方案
If you have access to Matlab's Statistical processing toolbox you can used the cvpartition
function.
From matlab help on cvpartition -:
c = cvpartition(group,'HoldOut',p) randomly partitions observations into a training set and a test set with stratification, using the class information in group; that is, both training and test sets have roughly the same class proportions as in group.
You can apply the function twice to get three partitions. This function preserves the original class distribution.