Feature selection algorithm based on bare bones particleswarm optimization

0、论文背景

本论文主要是通过改进的BPSO算法来进行特征选择,特征选择的效果通过1-NN分类算法来检验。通过比较得出提出的算法在特征选择上具有一定的优越性。

Zhang Y, Gong D, Hu Y, et al. Feature selection algorithm based on bare bones particle swarm optimization[J]. Neurocomputing, 2015, 148: 150-157.

是一篇二区的论文,个人觉得针对特征选择的实际问题而提出的改进的BPSO算法实际上创新性不是太大,但是可以让我们就相关问题有一个清晰的认识。

1、论文的核心思想

本文提出了一种利用BPSO求解最优特征子集的新方法,称为 binary BPSO。该算法(binary BPSO)设计了(a reinforced memory strategy)一种增强记忆策略来更新粒子的局部最优值,以避免粒子中突出基因的退化,并提出了一种统一的组合算法( uniform combination (UC))来平衡局部开发和全局探索。此外,利用1-NN方法作为分类器来评价粒子的分类精度。本文选取了一些国际标准数据集,对该算法进行了评价。

有关Binary PSO算法请参见:Binary PSO;有关BPSO算法参见:BPSO

2、The Binary BPSO

The Binary BPSO的算法流程如下:

 其中1.1中涉及到一个算法样本的初始化

\large X_{i}=\left(x_{i, 1}, x_{i, 2}, \ldots, x_{i, N}\right), \quad i=1,2, \ldots,|S|

N代表特征(变量的个数),|S|代表初始化样本的个数,x_{i, j} \in[0,1]表示第i个样本中第j个特征被选择的概率。

那如何表示第i个样本中第j个特征被选中的情况呢?

z_{i, j}=\left\{\begin{array}{ll} 1, & x_{i, j} \geq \text { rand } \\ 0 & \text { otherwise } \end{array}\right.

zij如果等于1表示第i个样本中第j个特征被选中。否则表示未选中。

1.2中怎么计算适应度值呢?

根据样本特征是否选中,从原来数据集中选择特征选中的数据和标签,得到新的数据集来进行分类准确率的计算,采用的分类算法是1-NN,从原始数据集中选取单个数据作为测试样本,其余数据构成训练样本。计算准确率的方法是:

F\left(X_{i}\right)=\frac{\text { No. of correctly predicted samples }}{\text { Size of the original data set }} \times 100 \%

下面就涉及到关键的两个算法创新点:a reinforced memory strategyuniform combination (UC)

2.1 a reinforced memory strategy

传统的更新Pbest的方法在之前的PSO算法中提到过,但是这种更新方式有个缺陷:当x_{i, j}为0.1时,恰好随机生成的数小于0.1,该特征被选择;虽然看似表现不好的特征被选择,但是总体预测正确率却变高了,说明该特征的重要性不止0.1;之前的更新Pbest的只是单纯地将xi值更新到Pbest,而忽略了应该提高xij这些特征在下次被选中的概率;因为如果不提高,下次这个特征可能就选不上了。因此提出了a reinforced memory strategy:

\text { Pbest }_{i}(t+1)=\left\{\begin{array}{ll} 0.5\left(X_{i}(t+1)+Z_{i}(t+1)\right) & \text { if } F\left(\text { Pbest }_{i}(t)\right)<F\left(X_{i}(t+1)\right) \\ 0.5\left(X_{i}(t+1)+Z_{i}(t+1)\right) & \text { if } \left.F\left(\text { Pbest }_{i}(t)\right)=F\left(X_{i}(t+1)\right) \& \mid Z\left(\text { Pbest }_{i}(t)\right)|\geq| Z_{i}(t+1)\right) \mid \\ \text { Pbest }_{i}(t) & \text { otherwise } \end{array}\right.

2.2 uniform combination

由于The Binary BPSO有可能会陷入到局部最优的情况,那么如何摆脱局部最优,达到全局搜索的目的呢?这就uniform combination。它的思路是:当Gbest不更新时候,使Pbest基因突变的概率加大。具体策略如下:

p_{c}=\frac{0.2}{1+e^{5-n u m}}

Gbest不更新时候,num++;Gbest更新时候,num重置为0。pc表示Pbest基因突变的概率。

Pbest基因突变的过程如下:

3、论文中的实验结果 

实验条件是:初始化种群20,迭代次数如下图:

不同算法对应数据集的分类准确率:

算法迭代过程中,准确率和特征选择个数的变化:

迭代过程中特征选择的具体展现:

当不采用 uniform combination或者a reinforced memory strategy与两个策略都采用的Binary BPSO算法实际运行结果比较。

 

 4. 个人算法的复现以及简单的实验验证

数据集采用了4个,复现了Binary PSO和Binary BPSO两个算法:

function test_y = knn(x, y, test_x, k, num)
% x:nxm,y:nx1,test_x:n0xm,test_y:n0x1
n0 = size(test_x, 1);
test_y = zeros(n0, 1);
for i = 1 : n0
    ind = [];
    max_label = 0;
    Dist = sum((x - test_x(i, :)).^2,2);
    % 从小大大排序,取前k个
    [D,I] = sort(Dist);
    D = D(1:k, :);
    D_y = y(I(1:k), :);
    for j = 1 : num
        % 计算k个样本中各类样本的数量,记录下最大数量的样本的类别
        label1 = sum(D_y==j);
        if label1 > max_label
            max_label = label1;
            ind = [];
            ind = [ind; j];
        elseif label1 == max_label
            ind = [ind; j];
        end
    end
    num_ind = size(ind, 1);
    min_dist = inf;
    % 如果存在数量相等的情况,就将距离之和最小的类标签作为test_y的类别
    for j1 = 1 : num_ind
        dist = sum(D(D_y == ind(j1)));
        if dist <= min_dist
            test_y(i, :) = ind(j1);
        end
    end
end
end
function F = evaluate(Z, data_x, data_y, num_lable)
num_simple = size(Z, 1);
F = zeros(num_simple, 1);
% 采用1-NN
k = 1;
for i = 1 : num_simple
    data_x1 = data_x(:, Z(i, :) == 1);
    n = size(data_x1, 1);
    test_y = zeros(n ,1);
    for i1 = 1 : n
        % 在样本集中选择一个样本作为测试集,其他样本作为训练集
        test_x = data_x1(i1, :);
        if i1 == 1
            ind = i1 + 1 : n;
            test_y(i1, :) = knn(data_x1(ind, :), data_y(ind, :), test_x, k, num_lable);
        else
            ind = [1 : i1 - 1, i1 + 1 : n];
            test_y(i1, :) = knn(data_x1(ind, :), data_y(ind, :), test_x, k, num_lable);
        end
    end
    % 正确率与正确个数
    acc_num = sum(test_y == data_y);
    acc_rate =1.0 * acc_num ./ n;
    % fprintf("data total: %d feature select: Z(%d,:) acc_num: %d acc_rate: %.4f\n",n,i,acc_num,acc_rate); 
    F(i, :) = acc_rate;
end
end
function [best_Zi, best_acc] = Bi_PSO(num_simple, num_iter, num_feature, data_x, data_y, num_lable)
% num_simple:初始化粒子群样本数量;num_iter:搜索迭代次数;num_feature:每个样本包含的特征个数
% data_x代表聚类样本数据,data_y代表聚类样本的标签;num_lable代表聚类样本的标签的种类数
% best_Zi:准确率最高的样本的特征选择情况

% 初始化粒子群样本sample_x:num_simple*num_feature
X = rand(num_simple, num_feature);
% 将X转换成Z
r1 = ones(num_simple, num_feature) * 0.5;
Z = zeros(num_simple, num_feature);
Z(r1 < X) = 1;
Z(r1 >= X) = 0;
pbestz = Z;
% 计算分类准确率F:num_simple*1
F = evaluate(Z, data_x, data_y, num_lable);
pbestf = F;
% 计算gbest,准确率最大的对应的索引
[fmin, gbest] = max(pbestf);
fprintf("iter 0 feature: pbestz(%d,:) fmin: %.4f\n",gbest, fmin);
% 当前位置信息presentx
presentx = rand([num_simple, num_feature]);
% 将presentx转换成presentz
presentz = zeros(num_simple, num_feature);
r2 = ones(num_simple, num_feature) * 0.5;
presentz(r2 < presentx) = 1;
presentz(r2 >= presentx) = 0;
vz = Z;
w = 1;
for i = 1 : num_iter
    r3 = rand(num_simple, num_feature);
    % pso更新下一步的位置
    vz = w.*vz + 2 * r3 .* (pbestz - presentz) + 2 * r3 .* (pbestz(gbest, :) - presentz);
    % vz设置边界目的是为了使s在0~1之间
    vz(vz > 10) = 10;
    vz(vz < -10) = -10;
    s = 1 ./ (1 + exp(-vz));
    r4 = rand(num_simple, num_feature);
    %r4 = ones(num_simple, num_feature) * 0.5;
    presentz(r4 < s) = 1;
    presentz(r4 >= s) = 0;
    presentf = evaluate(presentz, data_x, data_y, num_lable);
    % 更新每个单独个体最佳位置
    pbestz(presentf > pbestf, :) = presentz(presentf > pbestf, :);
    pbestf(presentf > pbestf, :) = presentf(presentf > pbestf, :);
    % 更新所有个体最佳位置
    [fmin, gbest] = max(pbestf);
    fprintf("iter %d feature: pbestz(%d,:) fmin: %.4f\n",i, gbest, fmin);
%     best_feature = find(pbestz(gbest, :) == 1);
%     disp(sum(pbestz(gbest, :)));
%     disp(best_feature);
end
best_Zi = pbestz(gbest, :);
best_acc = fmin;
end
clc;clear;clearvars;
% load('lonosphere_X.mat', 'ionosphere');
% data_x = ionosphere;
% load('lonosphere_Y.mat', 'ionosphere1');
% data_y = ionosphere1;
% load('Vehicle_X.mat')
% load('Vehicle_Y.mat')
% data_x = xaa;
% data_y = xaa1;
load('sonar_X.mat')
data_x = sonar1;
load('sonar_Y.mat')
data_y = sonar1;
% load('WDBC_X.mat')
% data_x = UCIBeastcancerwdbc;
% load('WDBC_Y.mat')
% data_y = UCIBeastcancerwdbc;
num_lable = 2;
% num_lable = 4;
num_simple = 20;
num_iter = 100;
num_feature = size(data_x, 2);
best_Z1 = zeros(1, num_feature);
best_acc1 = 0;
%for i = 1 : 10
for i = 1
    [best_Zi, best_acc] = Bi_PSO(num_simple, num_iter, num_feature, data_x, data_y, num_lable);
    if best_acc > best_acc1
        best_acc1 = best_acc;
        best_Z1 = best_Zi;
    end
end
best_feature = find(best_Z1 == 1);
disp(sum(best_Z1));
disp(best_feature);
function [best_Zi, best_acc] = Bi_BPSO(num_simple, num_iter, num_feature, data_x, data_y, num_lable)
% num_simple:初始化粒子群样本数量;num_iter:搜索迭代次数;num_feature:每个样本包含的特征个数
% data_x代表聚类样本数据,data_y代表聚类样本的标签;num_lable代表聚类样本的标签的种类数
% best_Zi:准确率最高的样本的特征选择情况

% 初始化粒子群样本sample_x:num_simple*num_feature
X = rand(num_simple, num_feature);
% 将X转换成Z
r1 = ones(num_simple, num_feature) * 0.5;
Z = zeros(num_simple, num_feature);
Z(r1 < X) = 1;
Z(r1 >= X) = 0;
pbestx = X;
% 计算分类准确率F:num_simple*1
F = evaluate(Z, data_x, data_y, num_lable);
pbestf = F;
% 计算gbest,准确率最大的对应的索引
[fmin, gbest] = max(pbestf);
fprintf("iter 0 feature: pbestz(%d,:) fmin: %.4f\n",gbest, fmin);
num1 = 0;
for i = 1 : num_iter
    s = zeros(num_simple, num_feature);
    % pso更新下一步的位置,这里可以设置一下超过搜索范围的就设置为边界
    r = rand;
    if r > 0.5
        x = pbestx;
    else
        x = normrnd((pbestx + pbestx(gbest, :)) ./ 2,abs(pbestx - pbestx(gbest, :)));
    end
    % uniform combination (UC)
    pc = 0.2 / (1 + exp(5 - num1));
    for i1 = 1 : num_simple
        r3 = rand;
        if r3 < pc
            k = randi(num_simple, 1, 1);
            u_n = ceil(pc * num_feature);
            for i2 = 1 : u_n
                l = randi(num_feature, 1, 1);
                x(i1, l) = x(k, l) + rand;
            end
        end
    end
    r2 = rand(num_simple, num_feature);
    s(r2 < x) = 1;
    s(r2 >= x) = 0;
    presentf = evaluate(s, data_x, data_y, num_lable);
    % 更新每个单独个体最佳位置
    % pbestx(presentf > pbestf, :) = x(presentf > pbestf, :);
    % a reinforced memory strategy
    pbestx(presentf > pbestf, :) = 0.5 * (x(presentf > pbestf, :) + s(presentf > pbestf, :));
    pbestx((presentf == pbestf)&(sum(Z, 2) > sum(s, 2)),:) = 0.5 * (x((presentf == pbestf)&(sum(Z, 2) > sum(s, 2)), :) + ...
        s((presentf == pbestf)&(sum(Z, 2) > sum(s, 2)), :));
    pbestf(presentf > pbestf, :) = presentf(presentf > pbestf, :);
    pbestf((presentf == pbestf)&(sum(Z, 2) > sum(s, 2)),:) = presentf((presentf == pbestf)&(sum(Z, 2) > sum(s, 2)), :);
    Z(presentf > pbestf, :) = s(presentf > pbestf, :);
    Z((presentf == pbestf)&(sum(Z, 2) > sum(s, 2)),:) = s((presentf == pbestf)&(sum(Z, 2) > sum(s, 2)), :);
    % 更新所有个体最佳位置
    [fmin1, gbest] = max(pbestf);
    if fmin1 > fmin
        num1 = 0;
    else
        num1 = num1 + 1;
    end
    fmin = fmin1;
    fprintf("iter %d feature: pbestz(%d,:) fmin: %.4f\n",i, gbest, fmin);
end
best_Zi = Z(gbest, :);
best_acc = fmin;
end
clc;clear;clearvars;
% load('lonosphere_X.mat', 'ionosphere');
% data_x = ionosphere;
% load('lonosphere_Y.mat', 'ionosphere1');
% data_y = ionosphere1;
% load('Vehicle_X.mat')
% load('Vehicle_Y.mat')
% data_x = xaa;
% data_y = xaa1;
load('sonar_X.mat')
data_x = sonar1;
load('sonar_Y.mat')
data_y = sonar1;
% load('Beast-cancer_X.mat')
% data_x = UCIBeastcancerwdbc;
% load('Beast-cancer_Y.mat')
% data_y = UCIBeastcancerwdbc;
num_lable = 2;
num_simple = 20;
num_iter = 100;
num_feature = size(data_x, 2);
best_Z1 = zeros(1, num_feature);
best_acc1 = 0;
%for i = 1 : 10
for i = 1
    [best_Zi, best_acc] = Bi_BPSO(num_simple, num_iter, num_feature, data_x, data_y, num_lable);
    if best_acc > best_acc1
        best_acc1 = best_acc;
        best_Z1 = best_Zi;
    end
end
best_feature = find(best_Z1 == 1);
disp(sum(best_Z1));
disp(best_feature);
disp(best_acc1);

目前只展现sonar和lonosphere两个数据集的测试结果:

Binary PSO sonar数据集测试的准确率:

 Binary PSO sonar数据集测试的特征数,最后定格在31个(总共60):

 Binary PSO lonosphere数据集测试的准确率:

 Binary PSO lonosphere数据集测试的特征数,最后定格在9个(总共34):

Binary BPSO sonar数据集测试的准确率:

 Binary BPSO sonar数据集测试的特征数,最后定格在25个(总共60):

Binary BPSO lonosphere数据集测试的准确率:

Binary PSO lonosphere数据集测试的特征数,最后定格在9个(总共34):

如有错误,恳请私下交流和改正。

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

身影王座

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值