graph regularized robust BLS (GRBLS)的复现

原论文是:《Pattern Classification with Corrupted Labeling via Robust Broad Learning System》

GRBLS的思路:

The aforementioned BLS models are based on the mean square error (MSE) criterion to fit the approximation errors [23]. In fact, MSE aims to measure the sum of quadratic loss of data, and the approximation results would skew to the data with large errors. ... The purpose of the current paper is to alleviate the negative impact of data with corrupted labels on BLS.By rewriting the objective function of BLS from the matrix form to a error vector form, we conduct a maximum likelihood estimation (MLE) on the approximation errors. Then a MLE-like estimator can be gotten tomodel the residuals in BLS. An interesting point is that if the probability density function of errors is predefined as the Gaussian distribution, the MLE-like estimator can degenerate to the MSE criterion. Obviously, the presence of label outliers in the data causes the error distribution to depart from Gaussianity, which is the probabilistic interpretation of lack of robustness in standard BLS. ...

文章是为了解决negative impact of data with corrupted labels on BLS的问题。基本的BLS对于残差的假设可以看作是满足高斯分布,但是很明显高斯分布并不能满足所有的数据集。在《Regularized robust Broad Learning System for uncertain data modeling》这篇论文中,有做过一个假设是满足拉普拉斯分布,在一些实验条件下也取得了不错得效果,但是和基本的BLS一样“受众”比较小,并且ENRBLS也让我感觉很奇怪。文章引入得是MLE,引入该算子的BLS是可以退化原BLS的(满足高斯分布),所以我觉得GRBLS在理论上是更好的。GRBLS的G部分有用到基本的Manifold learning的知识。基于manifold的优化在《Discriminative graph regularized broad learning system for image recognition》这篇论文中就有使用,文章也展现了manifold learning的有效性。

BLS:

Manifold learning:

manifold learning 是为了在对数据进行降维等一系列操作时,依旧保持数据内部的结构。关于manifold相关的算子中重要的组成部分是Adjacency graph,这个部分反映了数据间的相邻关系(V_{i,j}反应了x_{i}x_{j}之间的关系)。

\sum_{i,j}^{}V_{i,j}||h(x_{i})-h(x_{j})||_{2}^{2} = Tr(\hat{H}^{T}L\hat{H})

H是mapping result。Tr(.)是矩阵的迹,L = \hat{V}-V,L是graph Laplacian,\hat{V}_{i,j}=\sum_{j}^{}V_{i,j}这部分称之为diagonal entries。当然在《Discriminative graph regularized broad learning system for image recognition》这篇文章中使用的是normalized graph Laplacian。

The Proposed Method 部分:

定义了error vector e, E = Y - AW。定义了概率密度函数p_{\theta}(e)。等价的似然函数为-ln\mathcal{L_{\theta}} = \prod_{n=0}^{N}p_{\theta}(e_{n}), where v_{\theta}(e_{n}) = - ln p_{\theta}(e_{n}),

目标由原来的问题变成了:

\min_{W}\sum_{n=1}^{N} v_{\theta}(e_{n}) + \lambda||W||_{F}^{2}

对于问题的求解,有几个基本的假设:

           ①Symmetry:                                   p_{\theta}(e) = p_{\theta}(-e)

           ② Monotonicity:   For     |e_{1}|>|e_{2}|     ,  p_{\theta}(e_{1})>p_{\theta}(e_{2})

对问题的求解:

P_{\theta}(e) = \sum_{n=1}^{N}v_{\theta}(e_{n})

将上式泰勒展开至一阶(余项直接用最后一个部分估计),有

\tilde{P}_{\theta}(e)=P_{\theta}(e)+(e-e_{0})^{T}P'_{\theta}(e_{0}) + (e-e_{0})^{T}D(e-e_{0})

D代表海森矩阵,原文中说:

As the error residuals e_{n} are i.i.d., the mixed derivatives must be 0 for  i\neq j, matrix D should be diagonal.

结合之前的假设可得:

\tilde{P}'_{\theta}(e) = P'_{\theta}(e_{0}) +D(e-e_{0}) 

由假设有

\tilde{P}'_{\theta}(0) = 0

Hessian Matrix :

D_{n,n}=v'_{\theta}(e_{0,n})/e_{0,n}

\tilde{P}_{\theta}=\tfrac{1}{2}||D^{1/2}e||_{F}^{2}+b_{e_{0}}

原问题变为:

\min_{W}||D^{1/2}(Y-AW)||_{F}^{2}+\lambda||W||_{F}^{2}

加上之前的G部分有:

\min_{W}||D^{1/2}(Y-AW)||_{F}^{2}+\lambda_{1}Tr((AW)^{T}L(AW))+\lambda_{2}||W||_{F}^{2}

求解有:

W=(A^{T}DA+\lambda_{1}A^{T}LA+\lambda_{2}I)^{-1}A^{T}DY

Recurrence:

全部文件已经上传github了,GRBLS

宽度学习BLS部分:

我一般都是套用之前官网保存的代码,我又怎么写得这么工整

main函数:

clear;
warning off all;
format compact;

if ~exist('num.mat','file')
   experiment_num=0;
else 
    load('num.mat');  %记录实验次数,这样生成数据的时候就不会覆盖之前的数据了
end

prop  = 0.4  ;
train_num = 430;
test_num = 253;

load('E:\image-about\dataBase\breast_cancer\breast_cancer.mat')
[train_x,train_y,test_x,test_y,NN] = shuffle_index(x,y,train_num,test_num);
[contaminated_train_y, C_id, contamination_num] = contaminate_label(train_y,prop,NN.train);
save('C_id.mat','C_id','contamination_num');
clear x y C_id

lambda1 = 2^(0);  %------manifold learning criterion
lambda2 = 2^(-5);   %------the regularization parameter
best_test = 0 ;
result = [];
k = 10;             %-------k-NN
options = [];
options.NeighborMode = 'KNN';
options.k = k;
options.WeightMode = 'Binary';
options.t = 1;
file_name_1 = ['test_result/test_result ',num2str(experiment_num),'/contamination_proportion ', num2str(prop)];

for NumFea= 1:7              %searching range for feature nodes  per window in feature layer
    for NumWin=1:8           %searching range for number of windows in feature layer
        file_name = [file_name_1 ,'/NumFea ',num2str(NumFea),'/NumWin ', num2str(NumWin)];
            if ~isfolder(file_name)
                mkdir(file_name);
            end
            
        for NumEnhan=2:50     %searching range for enhancement nodes
            
            clc;
            rng('shuffle');
            for i=1:NumWin
                WeightFea=2*rand(size(train_x,2)+1,NumFea)-1;
                %   b1=rand(size(train_x,2)+1,N1);  % sometimes use this may lead to better results, but not for sure!
                WF{i}=WeightFea;
            end                                                          %generating weight and bias matrix for each window in feature layer
             WeightEnhan=2*rand(NumWin*NumFea+1,NumEnhan)-1;
             fprintf(1, 'Fea. No.= %d, Win. No. =%d, Enhan. No. = %d\n', NumFea, NumWin, NumEnhan);
             [train_rate,test_rate,C_train_rate,NetoutTrain,NetoutTest] = GRBLS_train(train_x,train_y,contaminated_train_y,test_x,test_y,lambda1,lambda2,WF,WeightEnhan,NumFea,NumWin,NN,options);
             result = [result;NumEnhan, train_rate, test_rate, C_train_rate];
             if test_rate > best_test
                 best_test = test_rate;
                 load('C_id.mat');
                 save(fullfile(file_name_1,['contamination_proportion ', num2str(prop), ' best_result.mat']),'best_test','train_rate','C_train_rate','NumFea','NumWin','NumEnhan','lambda1','lambda2','k',...
                     'train_x','train_y','test_x','test_y','contaminated_train_y','NetoutTrain','NetoutTest','C_id','prop');
             end
             clearvars -except train_x train_y test_x test_y lambda1 lambda2 WF WeightEnhan NumFea NumWin NumEnhan NN best_test experiment_num ...
             k result file_name file_name_1 contaminated_train_y prop options
        end
        result_plot(result,file_name);
        clear result
        result = [];
    end
end

experiment_num=experiment_num+1;
save('num.mat','experiment_num');

EuDIst.m计算欧氏距离,

constraintW.m生成W. 这部分完全使用的别人的代码,当然自己也有写过简单版本的,但是还是差距很大。

shuffle_index.m   (这部分就是起到随机的作用)

rng('shuffle');
x = x';
gross = train_num + test_num ;

category_box = unique(y);
category_box = sort(category_box);
category = size(category_box,1);

category_rule = zeros(category, category);
for i=1:category
    category_rule(i,i)=1;
end
save('category_map.mat','category','category_box','category_rule')

len = size(y);
rand_id = randperm(len(1));

train_x = x(:, rand_id(1:train_num));
train_y = y(rand_id(1:train_num), :);

test_x = x(:, rand_id(train_num+1:gross));
test_y = y(rand_id(train_num+1:gross), :);

[train_x, PS] = mapminmax(train_x);
test_x = mapminmax('apply', test_x, PS);

train_x = train_x';
test_x = test_x';

train_y1 = zeros(size(train_y, 1), category);
test_y1 = zeros(size(test_y, 1), category);

NN.train = zeros(1,category);   % number of two category
NN.test = zeros(1,category);


for i=1:size(train_y, 1)
    for j=1:category
        if train_y(i, 1) == category_box(j, 1)
           train_y1(i, j) = 1; 
           NN.train(1,j) = NN.train(1,j)+1;
        end
    end
end

for i=1:size(test_y, 1)
    for j=1:category
        if test_y(i, 1) == category_box(j, 1)
           test_y1(i, j) = 1; 
           NN.test(1,j) = NN.test(1,j)+1;
        end
    end
end

train_y = train_y1;
test_y = test_y1;

contaminate_label.m: (污染)

total = sum(NN);
contamination_num = ceil(proportion * total);

C_id = randperm(total);

new_y = zeros(size(y));
new_y(C_id(contamination_num+1:total),:) = y(C_id(contamination_num+1:total),:);

load('category_map.mat');

for i = 1:contamination_num
    j = find(y(C_id(i), :) == max(y(C_id(i), :))); %1
    pol_label = randperm(category); %1,2 2,1
    if pol_label(1) ~= j
        new_y(C_id(i),:) = category_rule(pol_label(1),:);
    else 
        new_y(C_id(i),:) = category_rule(pol_label(2),:);
    end
    
end

contaminated_y = new_y;

plot的部分代码:

fig1=figure;
set(fig1,'visible','off');
set(0, 'currentFigure', fig1);

plot(result(:,1),result(:,2),'-vr');
hold on;
plot(result(:,1),result(:,3),'-^b');
legend('training_sample', 'testing_sample' );
xlabel('\itenhancement nodes','FontSize',12);ylabel('\itrate','FontSize',12);
frame = getframe(fig1);
im = frame2im(frame);
pic_name=fullfile(file_name,['rate_comparion','.png']);
imwrite(im,pic_name);
close all;

从UCI上面下载的数据好像不能直接用吗,比如有缺失啥的,反正一般用python过滤一遍,然后用matlab直接保存:

python:

import csv
import re
f = open(".txt",encoding='utf-8')
f_new = open('new.txt','w')
line = f.readline()
Nan_num=0
num=0
i=0
while line:
    c=re.search('\?',line)
    if bool(c):
        Nan_num += 1
    else:
        num += 1
        f_new.write(line)
    line = f.readline()
    i=i+1
    if i>1000:
        line=''
f_new.close()
f.close()       

matlab:

sample=importdata('.txt');
x=sample(:,1:);
y=sample(:,);
save('.mat','x','y')

 

  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值