Classification with Imbalanced Data

RUSBoost is especially effective at classifying imbalanced data, meaning some class in the training data has many fewer members than another. RUS stands for Random Under Sampling.

%http://cn.mathworks.com/help/stats/ensemble-methods.html#bsx62vu

% Classification with Imbalanced Data

% Step 1. Obtain the data.
% Step 2. Import the data and prepare it for classification.
% Step 3. Examine the response data.
% Step 4. Partition the data for quality assessment.
% Step 5. Create the ensemble.
% Step 6. Inspect the classification error.
% Step 7. Compact the ensemble.
% Step 1. Obtain the data.

urlwrite('http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz','forestcover.gz');
% Then, extract the data from the forestcover.gz file. The data is in the covtype.data file.

% Step 2. Import the data and prepare it for classification.

% Import the data into your workspace. Extract the last data column into a variable named Y.

load covtype.data
Y = covtype(:,end);
covtype(:,end) = [];
% Step 3. Examine the response data.

tabulate(Y)
%   Value    Count   Percent
%       1    211840     36.46%
%       2    283301     48.76%
%       3    35754      6.15%
%       4     2747      0.47%
%       5     9493      1.63%
%       6    17367      2.99%
%       7    20510      3.53%
% There are hundreds of thousands of data points. Those of class 4 are less than 0.5% of the total. This imbalance indicates that RUSBoost is an appropriate algorithm.

% Step 4. Partition the data for quality assessment.

% Use half the data to fit a classifier, and half to examine the quality of the resulting classifier.

part = cvpartition(Y,'holdout',0.5);
istrain = training(part); % data for fitting
istest = test(part); % data for quality assessment
tabulate(Y(istrain))
%   Value    Count   Percent
%       1    105920     36.46%
%       2    141651     48.76%
%       3    17877      6.15%
%       4     1374      0.47%
%       5     4746      1.63%
%       6     8683      2.99%
%       7    10255      3.53%
% Step 5. Create the ensemble.

% Use deep trees for higher ensemble accuracy. To do so, set the trees to have minimal leaf size of 5. Set LearnRate to 0.1 in order to achieve higher accuracy as well. The data is large, and, with deep trees, creating the ensemble is time consuming.

t = templateTree('MinLeafSize',5);
tic
rusTree = fitensemble(covtype(istrain,:),Y(istrain),'RUSBoost',1000,t,...
    'LearnRate',0.1,'nprint',100);
toc
% Training RUSBoost...
% Grown weak learners: 100
% Grown weak learners: 200
% Grown weak learners: 300
% Grown weak learners: 400
% Grown weak learners: 500
% Grown weak learners: 600
% Grown weak learners: 700
% Grown weak learners: 800
% Grown weak learners: 900
% Grown weak learners: 1000
% Elapsed time is 918.258401 seconds.

% Step 6. Inspect the classification error.
% 
% Plot the classification error against the number of members in the ensemble.

figure;
tic
plot(loss(rusTree,covtype(istest,:),Y(istest),'mode','cumulative'));
toc
grid on;
xlabel('Number of trees');
ylabel('Test classification error');
% Elapsed time is 775.646935 seconds.

% The ensemble achieves a classification error of under 24% 
% using 150 or more trees. It achieves the lowest error for 400 or more trees.

% Examine the confusion matrix for each class as a percentage of the true class.

tic
Yfit = predict(rusTree,covtype(istest,:));
toc
tab = tabulate(Y(istest));
bsxfun(@rdivide,confusionmat(Y(istest),Yfit),tab(:,2))*100

%All classes except class 2 have over 80% classification accuracy, 
%and classes 3 through 7 have over 90% accuracy. But class 2 makes up 
%close to half the data, so the overall accuracy is not that high.

% Step 7. Compact the ensemble.
% The ensemble is large. Remove the data using the compact method.

cmpctRus = compact(rusTree);

sz(1) = whos('rusTree');
sz(2) = whos('cmpctRus');
[sz(1).bytes sz(2).bytes]

% The compacted ensemble is about half the size of the original.
% Remove half the trees from cmpctRus. This action is likely to 
% have minimal effect on the predictive performance, based on 
% the observation that 400 out of 1000 trees give nearly optimal accuracy.


cmpctRus = removeLearners(cmpctRus,[500:1000]);

sz(3) = whos('cmpctRus');
sz(3).bytes

% The reduced compact ensemble takes about a quarter 
% the memory of the full ensemble. Its overall loss rate is under 24%:

L = loss(cmpctRus,covtype(istest,:),Y(istest))

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值