RUSBoost is especially effective at classifying imbalanced data, meaning some class in the training data has many fewer members than another. RUS stands for Random Under Sampling.
%http://cn.mathworks.com/help/stats/ensemble-methods.html#bsx62vu
% Classification with Imbalanced Data
% Step 1. Obtain the data.
% Step 2. Import the data and prepare it for classification.
% Step 3. Examine the response data.
% Step 4. Partition the data for quality assessment.
% Step 5. Create the ensemble.
% Step 6. Inspect the classification error.
% Step 7. Compact the ensemble.
% Step 1. Obtain the data.
urlwrite('http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz','forestcover.gz');
% Then, extract the data from the forestcover.gz file. The data is in the covtype.data file.
% Step 2. Import the data and prepare it for classification.
% Import the data into your workspace. Extract the last data column into a variable named Y.
load covtype.data
Y = covtype(:,end);
covtype(:,end) = [];
% Step 3. Examine the response data.
tabulate(Y)
% Value Count Percent
% 1 211840 36.46%
% 2 283301 48.76%
% 3 35754 6.15%
% 4 2747 0.47%
% 5 9493 1.63%
% 6 17367 2.99%
% 7 20510 3.53%
% There are hundreds of thousands of data points. Those of class 4 are less than 0.5% of the total. This imbalance indicates that RUSBoost is an appropriate algorithm.
% Step 4. Partition the data for quality assessment.
% Use half the data to fit a classifier, and half to examine the quality of the resulting classifier.
part = cvpartition(Y,'holdout',0.5);
istrain = training(part); % data for fitting
istest = test(part); % data for quality assessment
tabulate(Y(istrain))
% Value Count Percent
% 1 105920 36.46%
% 2 141651 48.76%
% 3 17877 6.15%
% 4 1374 0.47%
% 5 4746 1.63%
% 6 8683 2.99%
% 7 10255 3.53%
% Step 5. Create the ensemble.
% Use deep trees for higher ensemble accuracy. To do so, set the trees to have minimal leaf size of 5. Set LearnRate to 0.1 in order to achieve higher accuracy as well. The data is large, and, with deep trees, creating the ensemble is time consuming.
t = templateTree('MinLeafSize',5);
tic
rusTree = fitensemble(covtype(istrain,:),Y(istrain),'RUSBoost',1000,t,...
'LearnRate',0.1,'nprint',100);
toc
% Training RUSBoost...
% Grown weak learners: 100
% Grown weak learners: 200
% Grown weak learners: 300
% Grown weak learners: 400
% Grown weak learners: 500
% Grown weak learners: 600
% Grown weak learners: 700
% Grown weak learners: 800
% Grown weak learners: 900
% Grown weak learners: 1000
% Elapsed time is 918.258401 seconds.
% Step 6. Inspect the classification error.
%
% Plot the classification error against the number of members in the ensemble.
figure;
tic
plot(loss(rusTree,covtype(istest,:),Y(istest),'mode','cumulative'));
toc
grid on;
xlabel('Number of trees');
ylabel('Test classification error');
% Elapsed time is 775.646935 seconds.
% The ensemble achieves a classification error of under 24%
% using 150 or more trees. It achieves the lowest error for 400 or more trees.
% Examine the confusion matrix for each class as a percentage of the true class.
tic
Yfit = predict(rusTree,covtype(istest,:));
toc
tab = tabulate(Y(istest));
bsxfun(@rdivide,confusionmat(Y(istest),Yfit),tab(:,2))*100
%All classes except class 2 have over 80% classification accuracy,
%and classes 3 through 7 have over 90% accuracy. But class 2 makes up
%close to half the data, so the overall accuracy is not that high.
% Step 7. Compact the ensemble.
% The ensemble is large. Remove the data using the compact method.
cmpctRus = compact(rusTree);
sz(1) = whos('rusTree');
sz(2) = whos('cmpctRus');
[sz(1).bytes sz(2).bytes]
% The compacted ensemble is about half the size of the original.
% Remove half the trees from cmpctRus. This action is likely to
% have minimal effect on the predictive performance, based on
% the observation that 400 out of 1000 trees give nearly optimal accuracy.
cmpctRus = removeLearners(cmpctRus,[500:1000]);
sz(3) = whos('cmpctRus');
sz(3).bytes
% The reduced compact ensemble takes about a quarter
% the memory of the full ensemble. Its overall loss rate is under 24%:
L = loss(cmpctRus,covtype(istest,:),Y(istest))