Classification with Imbalanced Data

最新推荐文章于 2021-06-15 11:26:44 发布

帅气的弟八哥

最新推荐文章于 2021-06-15 11:26:44 发布

阅读量1.3k

点赞数

分类专栏：不均衡数据处理 matlab 文章标签： matlab

matlab 同时被 2 个专栏收录

31 篇文章 1 订阅

订阅专栏

不均衡数据处理

7 篇文章 2 订阅

订阅专栏

RUSBoost is especially effective at classifying imbalanced data, meaning some class in the training data has many fewer members than another. RUS stands for Random Under Sampling.

%http://cn.mathworks.com/help/stats/ensemble-methods.html#bsx62vu

% Classification with Imbalanced Data

% Step 1. Obtain the data.
% Step 2. Import the data and prepare it for classification.
% Step 3. Examine the response data.
% Step 4. Partition the data for quality assessment.
% Step 5. Create the ensemble.
% Step 6. Inspect the classification error.
% Step 7. Compact the ensemble.
% Step 1. Obtain the data.

urlwrite('http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz','forestcover.gz');
% Then, extract the data from the forestcover.gz file. The data is in the covtype.data file.

% Step 2. Import the data and prepare it for classification.

% Import the data into your workspace. Extract the last data column into a variable named Y.

load covtype.data
Y = covtype(:,end);
covtype(:,end) = [];
% Step 3. Examine the response data.

tabulate(Y)
%   Value    Count   Percent
%       1    211840     36.46%
%       2    283301     48.76%
%       3    35754      6.15%
%       4     2747      0.47%
%       5     9493      1.63%
%       6    17367      2.99%
%       7    20510      3.53%
% There are hundreds of thousands of data points. Those of class 4 are less than 0.5% of the total. This imbalance indicates that RUSBoost is an appropriate algorithm.

% Step 4. Partition the data for quality assessment.

% Use half the data to fit a classifier, and half to examine the quality of the resulting classifier.

part = cvpartition(Y,'holdout',0.5);
istrain = training(part); % data for fitting
istest = test(part); % data for quality assessment
tabulate(Y(istrain))
%   Value    Count   Percent
%       1    105920     36.46%
%       2    141651     48.76%
%       3    17877      6.15%
%       4     1374      0.47%
%       5     4746      1.63%
%       6     8683      2.99%
%       7    10255      3.53%
% Step 5. Create the ensemble.

% Use deep trees for higher ensemble accuracy. To do so, set the trees to have minimal leaf size of 5. Set LearnRate to 0.1 in order to achieve higher accuracy as well. The data is large, and, with deep trees, creating the ensemble is time consuming.

t = templateTree('MinLeafSize',5);
tic
rusTree = fitensemble(covtype(istrain,:),Y(istrain),'RUSBoost',1000,t,...
    'LearnRate',0.1,'nprint',100);
toc
% Training RUSBoost...
% Grown weak learners: 100
% Grown weak learners: 200
% Grown weak learners: 300
% Grown weak learners: 400
% Grown weak learners: 500
% Grown weak learners: 600
% Grown weak learners: 700
% Grown weak learners: 800
% Grown weak learners: 900
% Grown weak learners: 1000
% Elapsed time is 918.258401 seconds.

% Step 6. Inspect the classification error.
% 
% Plot the classification error against the number of members in the ensemble.

figure;
tic
plot(loss(rusTree,covtype(istest,:),Y(istest),'mode','cumulative'));
toc
grid on;
xlabel('Number of trees');
ylabel('Test classification error');
% Elapsed time is 775.646935 seconds.

% The ensemble achieves a classification error of under 24% 
% using 150 or more trees. It achieves the lowest error for 400 or more trees.

% Examine the confusion matrix for each class as a percentage of the true class.

tic
Yfit = predict(rusTree,covtype(istest,:));
toc
tab = tabulate(Y(istest));
bsxfun(@rdivide,confusionmat(Y(istest),Yfit),tab(:,2))*100

%All classes except class 2 have over 80% classification accuracy, 
%and classes 3 through 7 have over 90% accuracy. But class 2 makes up 
%close to half the data, so the overall accuracy is not that high.

% Step 7. Compact the ensemble.
% The ensemble is large. Remove the data using the compact method.

cmpctRus = compact(rusTree);

sz(1) = whos('rusTree');
sz(2) = whos('cmpctRus');
[sz(1).bytes sz(2).bytes]

% The compacted ensemble is about half the size of the original.
% Remove half the trees from cmpctRus. This action is likely to 
% have minimal effect on the predictive performance, based on 
% the observation that 400 out of 1000 trees give nearly optimal accuracy.


cmpctRus = removeLearners(cmpctRus,[500:1000]);

sz(3) = whos('cmpctRus');
sz(3).bytes

% The reduced compact ensemble takes about a quarter 
% the memory of the full ensemble. Its overall loss rate is under 24%:

L = loss(cmpctRus,covtype(istest,:),Y(istest))

帅气的弟八哥

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Classification with Imbalanced Data

RUSBoost is especially effective at classifying imbalanced data, meaning some class in the training data has many fewer members than another. RUS stands for Random Under Sampling.%http://cn.mathworks.c
复制链接

扫一扫