原论文是:《Pattern Classification with Corrupted Labeling via Robust Broad Learning System》
GRBLS的思路:
The aforementioned BLS models are based on the mean square error (MSE) criterion to fit the approximation errors [23]. In fact, MSE aims to measure the sum of quadratic loss of data, and the approximation results would skew to the data with large errors. ... The purpose of the current paper is to alleviate the negative impact of data with corrupted labels on BLS.By rewriting the objective function of BLS from the matrix form to a error vector form, we conduct a maximum likelihood estimation (MLE) on the approximation errors. Then a MLE-like estimator can be gotten tomodel the residuals in BLS. An interesting point is that if the probability density function of errors is predefined as the Gaussian distribution, the MLE-like estimator can degenerate to the MSE criterion. Obviously, the presence of label outliers in the data causes the error distribution to depart from Gaussianity, which is the probabilistic interpretation of lack of robustness in standard BLS. ...
文章是为了解决negative impact of data with corrupted labels on BLS的问题。基本的BLS对于残差的假设可以看作是满足高斯分布,但是很明显高斯分布并不能满足所有的数据集。在《Regularized robust Broad Learning System for uncertain data modeling》这篇论文中,有做过一个假设是满足拉普拉斯分布,在一些实验条件下也取得了不错得效果,但是和基本的BLS一样“受众”比较小,并且ENRBLS也让我感觉很奇怪。文章引入得是MLE,引入该算子的BLS是可以退化原BLS的(满足高斯分布),所以我觉得GRBLS在理论上是更好的。GRBLS的G部分有用到基本的Manifold learning的知识。基于manifold的优化在《Discriminative graph regularized broad learning system for image recognition》这篇论文中就有使用,文章也展现了manifold learning的有效性。
BLS:
略
Manifold learning:
manifold learning 是为了在对数据进行降维等一系列操作时,依旧保持数据内部的结构。关于manifold相关的算子中重要的组成部分是Adjacency graph,这个部分反映了数据间的相邻关系(反应了与之间的关系)。
H是mapping result。Tr(.)是矩阵的迹,,L是graph Laplacian,这部分称之为diagonal entries。当然在《Discriminative graph regularized broad learning system for image recognition》这篇文章中使用的是normalized graph Laplacian。
The Proposed Method 部分:
定义了error vector e, 。定义了概率密度函数。等价的似然函数为, where ,
目标由原来的问题变成了:
对于问题的求解,有几个基本的假设:
①Symmetry:
② Monotonicity: For ,
对问题的求解:
将上式泰勒展开至一阶(余项直接用最后一个部分估计),有
D代表海森矩阵,原文中说:
As the error residuals are i.i.d., the mixed derivatives must be 0 for , matrix D should be diagonal.
结合之前的假设可得:
由假设有
Hessian Matrix :
原问题变为:
加上之前的G部分有:
求解有:
Recurrence:
全部文件已经上传github了,GRBLS
宽度学习BLS部分:
我一般都是套用之前官网保存的代码,我又怎么写得这么工整
main函数:
clear;
warning off all;
format compact;
if ~exist('num.mat','file')
experiment_num=0;
else
load('num.mat'); %记录实验次数,这样生成数据的时候就不会覆盖之前的数据了
end
prop = 0.4 ;
train_num = 430;
test_num = 253;
load('E:\image-about\dataBase\breast_cancer\breast_cancer.mat')
[train_x,train_y,test_x,test_y,NN] = shuffle_index(x,y,train_num,test_num);
[contaminated_train_y, C_id, contamination_num] = contaminate_label(train_y,prop,NN.train);
save('C_id.mat','C_id','contamination_num');
clear x y C_id
lambda1 = 2^(0); %------manifold learning criterion
lambda2 = 2^(-5); %------the regularization parameter
best_test = 0 ;
result = [];
k = 10; %-------k-NN
options = [];
options.NeighborMode = 'KNN';
options.k = k;
options.WeightMode = 'Binary';
options.t = 1;
file_name_1 = ['test_result/test_result ',num2str(experiment_num),'/contamination_proportion ', num2str(prop)];
for NumFea= 1:7 %searching range for feature nodes per window in feature layer
for NumWin=1:8 %searching range for number of windows in feature layer
file_name = [file_name_1 ,'/NumFea ',num2str(NumFea),'/NumWin ', num2str(NumWin)];
if ~isfolder(file_name)
mkdir(file_name);
end
for NumEnhan=2:50 %searching range for enhancement nodes
clc;
rng('shuffle');
for i=1:NumWin
WeightFea=2*rand(size(train_x,2)+1,NumFea)-1;
% b1=rand(size(train_x,2)+1,N1); % sometimes use this may lead to better results, but not for sure!
WF{i}=WeightFea;
end %generating weight and bias matrix for each window in feature layer
WeightEnhan=2*rand(NumWin*NumFea+1,NumEnhan)-1;
fprintf(1, 'Fea. No.= %d, Win. No. =%d, Enhan. No. = %d\n', NumFea, NumWin, NumEnhan);
[train_rate,test_rate,C_train_rate,NetoutTrain,NetoutTest] = GRBLS_train(train_x,train_y,contaminated_train_y,test_x,test_y,lambda1,lambda2,WF,WeightEnhan,NumFea,NumWin,NN,options);
result = [result;NumEnhan, train_rate, test_rate, C_train_rate];
if test_rate > best_test
best_test = test_rate;
load('C_id.mat');
save(fullfile(file_name_1,['contamination_proportion ', num2str(prop), ' best_result.mat']),'best_test','train_rate','C_train_rate','NumFea','NumWin','NumEnhan','lambda1','lambda2','k',...
'train_x','train_y','test_x','test_y','contaminated_train_y','NetoutTrain','NetoutTest','C_id','prop');
end
clearvars -except train_x train_y test_x test_y lambda1 lambda2 WF WeightEnhan NumFea NumWin NumEnhan NN best_test experiment_num ...
k result file_name file_name_1 contaminated_train_y prop options
end
result_plot(result,file_name);
clear result
result = [];
end
end
experiment_num=experiment_num+1;
save('num.mat','experiment_num');
EuDIst.m计算欧氏距离,
constraintW.m生成W. 这部分完全使用的别人的代码,当然自己也有写过简单版本的,但是还是差距很大。
shuffle_index.m (这部分就是起到随机的作用)
rng('shuffle');
x = x';
gross = train_num + test_num ;
category_box = unique(y);
category_box = sort(category_box);
category = size(category_box,1);
category_rule = zeros(category, category);
for i=1:category
category_rule(i,i)=1;
end
save('category_map.mat','category','category_box','category_rule')
len = size(y);
rand_id = randperm(len(1));
train_x = x(:, rand_id(1:train_num));
train_y = y(rand_id(1:train_num), :);
test_x = x(:, rand_id(train_num+1:gross));
test_y = y(rand_id(train_num+1:gross), :);
[train_x, PS] = mapminmax(train_x);
test_x = mapminmax('apply', test_x, PS);
train_x = train_x';
test_x = test_x';
train_y1 = zeros(size(train_y, 1), category);
test_y1 = zeros(size(test_y, 1), category);
NN.train = zeros(1,category); % number of two category
NN.test = zeros(1,category);
for i=1:size(train_y, 1)
for j=1:category
if train_y(i, 1) == category_box(j, 1)
train_y1(i, j) = 1;
NN.train(1,j) = NN.train(1,j)+1;
end
end
end
for i=1:size(test_y, 1)
for j=1:category
if test_y(i, 1) == category_box(j, 1)
test_y1(i, j) = 1;
NN.test(1,j) = NN.test(1,j)+1;
end
end
end
train_y = train_y1;
test_y = test_y1;
contaminate_label.m: (污染)
total = sum(NN);
contamination_num = ceil(proportion * total);
C_id = randperm(total);
new_y = zeros(size(y));
new_y(C_id(contamination_num+1:total),:) = y(C_id(contamination_num+1:total),:);
load('category_map.mat');
for i = 1:contamination_num
j = find(y(C_id(i), :) == max(y(C_id(i), :))); %1
pol_label = randperm(category); %1,2 2,1
if pol_label(1) ~= j
new_y(C_id(i),:) = category_rule(pol_label(1),:);
else
new_y(C_id(i),:) = category_rule(pol_label(2),:);
end
end
contaminated_y = new_y;
plot的部分代码:
fig1=figure;
set(fig1,'visible','off');
set(0, 'currentFigure', fig1);
plot(result(:,1),result(:,2),'-vr');
hold on;
plot(result(:,1),result(:,3),'-^b');
legend('training_sample', 'testing_sample' );
xlabel('\itenhancement nodes','FontSize',12);ylabel('\itrate','FontSize',12);
frame = getframe(fig1);
im = frame2im(frame);
pic_name=fullfile(file_name,['rate_comparion','.png']);
imwrite(im,pic_name);
close all;
从UCI上面下载的数据好像不能直接用吗,比如有缺失啥的,反正一般用python过滤一遍,然后用matlab直接保存:
python:
import csv
import re
f = open(".txt",encoding='utf-8')
f_new = open('new.txt','w')
line = f.readline()
Nan_num=0
num=0
i=0
while line:
c=re.search('\?',line)
if bool(c):
Nan_num += 1
else:
num += 1
f_new.write(line)
line = f.readline()
i=i+1
if i>1000:
line=''
f_new.close()
f.close()
matlab:
sample=importdata('.txt');
x=sample(:,1:);
y=sample(:,);
save('.mat','x','y')