1.项目介绍
对于现代雷达探测系统而言,无人机和飞鸟同属于低空小、微特征的一类典型目标,而面对比较复杂的环境,如何有效区分两者类型并完成识别是当下急迫且重要的难题。常规方法是从目标的微动特征差异进行区分,但由于两者回波微弱,很难通过时频分析方法提取目标特征。针对无人机与鸟类轨迹的特性差异,我们设计了多维特征提取方法,包括轨迹角度变化、航向角振荡、速度分布等物理量,为分类模型提供了充分的信息支持。我们采用了多种主流机器学习算法的组合,通过Stacking集成学习方法,有效提升了模型的预测能力和泛化性能。本项目聚焦无人机与鸟类航迹识别的挑战,旨在通过深度分析两者在运动特性上的差异,开发一个基于特征提取和机器学习模型的低空小、微无人机识别技术。无人机和鸟类的飞行轨迹具有显著不同的动态特性,如何准确区分二者是智能监控、边境安防、航空安全、环境与生态监测等场景中的重要任务。为此,本项目采用了一套严谨的预处理与特征提取流程,并通过先进的集成学习方法提升模型的精度和泛化能力。首先对比两者在运动轨迹等信息上的差异性,进行特征分析,提出了时间相关的航向震荡频率与速度分布等特征量描述方法,利用雷达系统记录的航迹数据,提取两者的有效特征量;然后利用XGBoost算法对样本进行训练,并在获得最优模型参数后,通过测试样本进行测试,测试分类结果显示识别的准确率能够达93%,其结果表明了该方法的正确性也体现了该方法在识别中的稳定性和可靠性,具有较高的实际应用价值。
2.数据预处理
项目原始数据包含无人机和飞鸟的雷达数据信息,标签为1和0,但并不能确定1和0分别是哪一类。先将这两类的轨迹画出来,观察轨迹图,因为无人机和飞鸟的轨迹跨度较大,所以先将其统一轨迹范围后再画图,从视觉上先感受无人机合和飞鸟轨迹的区别。
由于初始数据含有较多重复数据和异常数据,所以我们需要先对初始数据进行数据清洗,数据一般在时间上都是间隔2s的,但是在一组无人机/飞鸟的数据内,可能会出现时间上的不连续,所以我们预处理代码先把这些时间间隔出现突变的数据,在标签列标记上3,在以后读取的时候,不能将含有3的这一组数据视为1组数据。下面的代码是去除数据中的重复文件
% 此文件检测数据文件中的重复值,并删除,保留最初一行,保持绝对行不变
clear;
clc;
% 读取Excel数据时忽略首行
data = readtable("data.xlsx", 'ReadVariableNames', false, 'Range', 'A:G');
% 提取第7列数据
column7_data = data{:, 7};
% 初始化一个空的单元数组,大小和 column7_data 相同
% 遍历每个 cell 并进行处理
% for i = 1:length(column7_data)
% if isempty(column7_data{i})
% % 如果 cell 为空,保留空值(可以选择 NaN 或其他占位符)
% output_data{i} = NaN; % 对于数值数据,用 NaN 作为占位符
% elseif isnumeric(column7_data{i})
% % 如果是数值,直接保留
% output_data{i} = column7_data{i};
% elseif ischar(column7_data{i})
% % 如果是字符串,转换为数值占位符,或选择其他处理方法
% output_data{i} = output_data{i}; % 或者可以选择不同的占位符,如 0 或空字符串
% else
% % 处理其他类型的数据
% output_data{i} = NaN; % 可根据需要替换
% end
% end
%
%
% % 最后将 cell 转换为数值矩阵
% output_data = cell2mat(output_data);
% output_data = array2table(output_data);
%
%
%
% data(:,7) = [];
% data(:,7) = output_data;
if iscell(column7_data)
column7_data = str2double(column7_data); % 转换为数值
end
column7_data(end+1,:) = 0;
output_data = cell(size(column7_data));
% 提取数据
azimuth = data{:, 1}; % 方位角
range = data{:, 2}; % 距离
relative_height = data{:, 3}; % 相对高度
labels = data{:, 7}; % 标签列
RCS = 1000 * data{:, 6}; % RCS
vel_jingxiang = data{:, 4}; % 径向速度
time = data{:, 5};
% 检查标签列是否为cell类型,如果是,则将其转换为合适的数值格式
if iscell(labels)
labels = cellfun(@str2double, labels); % 使用 cellfun 转换为数值
end
% 无人机记录起始点和终止点索引
start_indices = find(~isnan(column7_data)); % 找到标签为0的起始点
% 初始化保存所有data_seg的表格
all_data_segs = table(); % 保留为table格式
for i = 1:length(start_indices) - 1
current_start = start_indices(i);
current_end = start_indices(i + 1) - 1;
% 提取该时间段内的数据
data_seg = data(current_start:current_end, :);
data_without_labels = data_seg(:, 1:6);
% 删除重复行
[~, unique_idx] = unique(data_without_labels, 'rows', 'stable');
duplicate_rows = setdiff(1:size(data_without_labels, 1), unique_idx);
num_duplicates = length(duplicate_rows);
% 删除重复行
data_seg(duplicate_rows, :) = [];
% 生成具有相同变量名的空行(保持表格类型一致)
empty_rows = table(zeros(length(duplicate_rows),1)*NaN, ...
zeros(length(duplicate_rows),1)*NaN, ...
zeros(length(duplicate_rows),1)*NaN,...
zeros(length(duplicate_rows),1)*NaN, ...
zeros(length(duplicate_rows),1)*NaN, ...
zeros(length(duplicate_rows),1)*NaN, ...
zeros(length(duplicate_rows),1)*NaN);
if num_duplicates > 0
% 只有当 num_duplicates > 0 时才执行这行代码
data_seg = [data_seg; empty_rows];
end
% 将 data_seg 追加到 all_data_segs
all_data_segs = [all_data_segs; data_seg];
%disp("执行了%d\n",i)
end
% disp(all_data_segs)
% 保存所有 data_seg 到 Excel 文件
writetable(all_data_segs, 'data_test_chongfu.xlsx');
经过去重操作后,给时间超出阈值的数据标记为3
clear;
% 读取Excel数据
data = readtable('data_test_chongfu.xlsx');
% 提取数据列
azimuth = data{:, 1}; % 方位角
range = data{:, 2}; % 斜距
relative_height = data{:, 3}; % 相对高度
v_h = data{:, 4}; % 径向速率
time = data{:, 5}; % 时间列
RCS = data{:, 6}; % RCS
labels = data{:, 7}; % 标签列
time_threshold = 20; % 时间间隔阈值,增大时间间隔阈值,航迹分类更准确
tolerance = 0.2; % 容差,用于统计4秒左右的时间差
count4 = 0; % 记录4秒左右的时间差的次数
count6 = 0; % 记录6秒左右的时间差的次数
count8 = 0; % 记录8秒左右的时间差的次数
count10 = 0; % 记录10秒左右的时间差的次数
count_outrange = 0;
count_1_8 = 0;
count_dt2 = 0;
data.DeltaTime2 = NaN(height(data), 1);
% 用于保存符合条件的时间差及其所在行数
matching_deltas = {}; % 初始化为空 cell 数组
% 检查是否标签列为cell类型,如果是,则将其转换为数值
if iscell(labels)
labels = str2double(labels); % 转换为数值
end
labels(end+1,:) = 0;
% 如果时间是日期格式,则将其转换为秒
if isdatetime(time)
time = seconds(time - time(1)); % 转换为秒,从第一个时间点开始
end
% 测试记录起始点和终止点索引
start_indices = find(~isnan(labels)); % 找到标签为0的索引
% 初始化存储所有时间间隔的数组
delta_ts = zeros(height(data), 1); % 存储时间差,初始化为0
% 遍历每个测试数据
for i = 1:length(start_indices) - 1
% 取出一个测试的所有数据 (从当前0到下一个0)
current_start = start_indices(i);
current_end = start_indices(i + 1) - 1;
% 提取该段数据(按列提取)
group_time = time(current_start:current_end);
group_labels = labels(current_start:current_end);
% 添加标志位,防止重复统计
angle_exceeded = false;
angle_exceeded_4 = false;
angle_exceeded_6 = false;
angle_exceeded_8 = false;
angle_exceeded_10 = false;
% 计算时间差
for j = 2:length(group_time)
delta_t = group_time(j) - group_time(j - 1);
delta_ts(current_start + j - 1) = delta_t; % 记录时间差
% 检查时间间隔是否大于阈值
if ~angle_exceeded && abs(delta_t) > time_threshold
% 如果时间间隔大于阈值,将标签改为3
labels(current_start + j - 1) = 3;
count_outrange = count_outrange + 1;
angle_exceeded = true; % 设置标志位,避免重复统计
end
if ~angle_exceeded && abs(delta_t) < 1.8
% 如果时间间隔小于1.8秒,统计
count_1_8 = count_1_8 + 1;
angle_exceeded = true; % 设置标志位,避免重复统计
end
% 处理 DeltaTime2.时间间隔大于3秒小于20秒的轨迹有
if ~angle_exceeded && abs(delta_t - 2) < time_threshold
if abs(delta_t - 2) > 3
DeltaTime2(current_start + j - 1) = 4;
count_dt2 = count_dt2 + 1;
angle_exceeded = true; % 设置标志位,避免重复统计
end
end
% 检查时间差是否在3.9到4.2秒之间,并且尚未统计过
if ~angle_exceeded_4 && abs(delta_t - 4) < tolerance
count4 = count4 + 1;
angle_exceeded_4 = true; % 设置标志位,避免重复统计
% 保存时间差及所在行数
matching_deltas{end + 1, 1} = current_start + j - 1; % 行号
matching_deltas{end, 2} = delta_t; % 时间差
end
% 类似地检查其他时间差
if ~angle_exceeded_6 && abs(delta_t - 6) < tolerance
count6 = count6 + 1;
angle_exceeded_6 = true; % 设置标志位
end
if ~angle_exceeded_8 && abs(delta_t - 8) < tolerance
count8 = count8 + 1;
angle_exceeded_8 = true; % 设置标志位
end
if ~angle_exceeded_10 && abs(delta_t - 10) < tolerance
count10 = count10 + 1;
angle_exceeded_10 = true; % 设置标志位
end
end
end
disp("测试数据超出时间间隔阈值的轨迹有:");
disp(count_outrange);
disp("测试数据的时间间隔小于1.8秒的轨迹有:");
disp(count_1_8);
disp("测试数据的时间间隔在4秒左右的轨迹有:");
disp(count4);
disp("测试数据的时间间隔在6秒左右的轨迹有:");
disp(count6);
disp("测试数据的时间间隔在8秒左右的轨迹有:");
disp(count8);
disp("测试数据的时间间隔在10秒左右的轨迹有:");
disp(count10);
disp("测试数据的时间间隔大于3秒小于20秒的轨迹有:");
disp(count_dt2);
% 将数据和标签更新回表格
labels([23170],:) = [];
labels = num2cell(labels);
data{:, 7} = labels; % 更新标签列
% 保存修改后的表格为新的Excel文件
output_filename = 'data_test_chongfu_biao3.xlsx';
writetable(data, output_filename);
% 确认文件保存成功
disp(['数据已保存至 ', output_filename]);
然后给这些处理过的数据做标记,因为在这里我们认为含有3的数据是重新开始的一组无人机/飞鸟数据,所以我们在这里对整体的数据进行标记,其中把正常标签和标签3的数据都做标记,按顺序往下标号
%此代码用于给测试和飞鸟数据读取0到3的数据并且记录标签,序号和开始行结束行
%直接读取标签列0和3,1和3,假如不足4条则略过
clear;
% 读取Excel数据
data = readtable('data_test_chongfu_biao3.xlsx');
% 提取数据列
azimuth = data{:, 1}; % 方位角
range = data{:, 2}; % 斜距
relative_height = data{:, 3}; % 相对高度
v_h = data{:, 4}; % 径向速率
time = data{:, 5}; % 时间列
RCS = data{:, 6}; % RCS
labels = data{:, 7}; % 标签列
data.DroneNum = NaN(height(data), 1);
% 添加startline和endline列,初始为NaN
data.startline = NaN(height(data), 1);
data.endline = NaN(height(data), 1);
time_threshold = 2.5; % 时间间隔阈值
tolerance = 0.2; % 容差,用于统计4左右的时间差
count4 = 0; % 记录4秒左右的时间差的次数
count6 = 0; % 记录6秒左右的时间差的次数
count8 = 0; % 记录8秒左右的时间差的次数
count10 = 0; % 记录10秒左右的时间差的次数
count = 0;
droneNum = 1;
combinedData = [];
% 用于保存符合条件的时间差及其所在行数
matching_deltas = {}; % 初始化为空 cell 数组
count_31 = 0;
% 检查是否标签列为cell类型,如果是,则将其转换为数值
if iscell(labels)
labels = str2double(labels); % 转换为数值
end
labels(end+1,:) = 0;
% 如果时间是日期格式,则将其转换为秒
if isdatetime(time)
time = seconds(time - time(1)); % 转换为秒,从第一个时间点开始
end
% 测试记录起始点和终止点索引
start_indices = find(~isnan(labels)); % 找到标签为0的索引
start3_indices = find(labels == 3);
% 初始化存储所有时间间隔的数组
delta_ts = zeros(height(data), 1); % 存储时间差,初始化为0
data.begin30 = NaN(height(data), 1);
for i = 1:length(start3_indices) - 1
current3_start = start3_indices(i);
data.begin30(current3_start) = 1;
count_31 = count_31 + 1;
end
% 遍历每个测试数据
for i = 1:length(start_indices) - 1
% 取出一个测试的所有数据 (从当前0到下一个0)
current_start = start_indices(i);
current_end = start_indices(i + 1) - 1;
% 找到 azimuth 列中空值的位置
empty_rows = find(isnan(azimuth(current_start:current_end)));
if ~isempty(empty_rows)
% 如果存在空行,将 current_end 调整为空行的前一行
current_end = current_start + empty_rows(1) - 2; % 找到第一个空行的前一行
end
azi = azimuth(current_start:current_end);
data.DroneNum(current_start) = droneNum;
droneNum = droneNum + 1; % 增加无人机编号
data.startline(current_start) = current_start;
data.endline(current_start) = current_end;
% 统计标签列的空行数
nan_count = sum(isnan(azimuth(current_start:current_end)));
% 标注结束行,减去空行数
data.endline(current_start) = current_end - nan_count;
% 提取该段数据(按列提取)
% group_time = time(current_start:current_end);
% group_labels = labels(current_start:current_end);
% 计算时间差
% for j = 2:length(group_time)
% delta_t = group_time(j) - group_time(j - 1);
% delta_ts(current_start + j - 1) = delta_t; % 记录时间差
%
% % 检查时间间隔是否大于阈值
% if abs(delta_t) > time_threshold
% % 如果时间间隔大于阈值,将标签改为3
% labels(current_start + j - 1) = 3;
% end
%
% % 检查时间差是否在3.9到4.2秒之间
% if abs(delta_t - 4) < tolerance
% count4 = count4 + 1;
% % 保存时间差及所在行数
% matching_deltas{end + 1, 1} = current_start + j - 1; % 行号
% matching_deltas{end, 2} = delta_t; % 时间差
% end
%
% % 类似地检查其他时间差
% if abs(delta_t - 6) < tolerance
% count6 = count6 + 1;
% end
%
% if abs(delta_t - 8) < tolerance
% count8 = count8 + 1;
% end
%
% if abs(delta_t - 10) < tolerance
% count10 = count10 + 1;
% end
% end
% 在datatemp的birdNum列标注编号
% datatemp.birdNum(current_start) = droneNum;
% droneNum = droneNum + 1; % 增加测试编号
% 将 datatemp 添加到总的表格 combinedData 中
datatemp = data(current_start:current_end, :);
combinedData = [combinedData; datatemp];
end
disp("测试的轨迹共有:");
disp(length(start_indices))
disp("从标签3到下一个标签的轨迹有:");
disp(count_31);
disp("纯的标签轨迹占比:");
disp((length(start_indices)-count_31)/length(start_indices));
save combinedData.mat;
3.特征选取
不论是用深度学习的方法还是机器学习等方法,都需要进行特征提取,所以我们需要根据已知信息来提取特征,目前已知信息有物体的目标方位角、目标斜距、相对高度、径向速率、记录时间、RCS这几个物理量,我们首先通过观察三维空间下的无人机和飞鸟的轨迹,确定几个物理量,然后画出其二维直方图
仅仅选取这几个物理量是不够的,我们还需要选取,为了更科学的选取,我们从一些论文中选择了部分物理量作为特征值
剩下就不一一列举了。
4.模型训练
XGBoost 是一种高效的梯度提升决策树算法框架,它以集成学习的方式,通过对多个弱学习器累加拟合残差,利用正则化、并行计算等技术优化训练过程,实现精准快速的分类与回归任务处理。
4.1XGboost原理
XGboost相比于GBDT’有以下提升:
在选取好特征之后,就可以进入模型训练了,一开始我们试验了一些经典的模型例如决策树、随机森林、SVM等效果都不是特别好,后来又尝试了Adaboost
其一,聚焦算法自身的优化。在弱学习器模型抉择方面,GBDT仅支持决策树,与之不同,本算法还能够直接选用诸多其他类型的弱学习器,极大拓宽了模型选择的边界。于损失函数而言,算法不仅涵盖基础损失部分,还额外增添了正则化内容,让模型的稳定性得以强化。再看优化方式,GBDT的损失函数仅针对误差部分展开负梯度(一阶泰勒)运算,而本算法的XGBoost损失函数采用二阶泰勒展开处理误差,精准度更上一层楼。算法自身的优化,将是后续探讨的核心要点。
其二,着眼算法运行效率的优化。针对每一个弱学习器,拿决策树的构建流程举例,会执行并行选择操作,精准定位适配的子树分裂特征及其对应特征值。在开启并行选择前,先对所有特征值进行排序分组,为后续并行选择流程铺好路、搭好桥。针对已分组的特征,妥善选定分组规模,借助CPU缓存实现读取加速,还会把各个分组分别存储到多个硬盘之中,借此提升IO速率,全方位为算法的高效运行赋能。
其三,留意算法健壮性的优化。碰到存在缺失值的特征时,通过逐一列举缺失值在当前节点进入左子树或者右子树的所有情况,来敲定缺失值的处置方案。而且,算法内部融入了L1与L2正则化项,能有效预防过拟合现象,让模型的泛化能力获得显著提升。
4.2 预测过程
function model = xgboost_train(Xtrain,ytrain,params,max_num_iters,eval_metric,model_filename)
%%% eg:
% load carsmall; Xtrain = [Acceleration Cylinders Displacement Horsepower MPG]; ytrain = cellstr(Origin); ytrain = double(ismember(ytrain,'USA'));
% X = Xtrain(1:70,:); y = ytrain(1:70); Xtest = Xtrain(size(X,1)+1:end,:); ytest = ytrain(size(X,1)+1:end);
% model_filename = []; model = xgboost_train(X,y,[],999,'AUC',model_filename); %%% model_filename = 'xgboost_model.xgb'
% loadmodel = 0; Yhat = xgboost_test(Xtest,model,loadmodel);
% [XX,YY,~,AUC] = perfcurve(ytest,Yhat,1);
% figure; plot(XX,YY,'LineWidth',2); xlabel('False positive rate'); ylabel('True positive rate'); title('ROC for Classification by Logistic Regression'); grid on
% figure; scatter(Yhat,ytest + 0.1*rand(length(ytest),1)); grid on
% unloadlibrary('xgboost')
%
% eg:
skipme = 1;
if skipme == 0
load carsmall; Xtrain = [Acceleration Cylinders Displacement Horsepower MPG]; ytrain = cellstr(Origin); ytrain = double(ismember(ytrain,'USA'));
% Xtrain = [25,93,43;5,90,98;66,73,49;69,36,40;51,5,99]; ytrain = [0,1,1,0,0]';
params = struct;
params.booster = 'gbtree';
params.objective = 'binary:logistic';
params.eta = 0.1;
params.min_child_weight = 1;
params.subsample = 1; % 0.9
params.colsample_bytree = 1;
params.num_parallel_tree = 1;
params.max_depth = 5;
num_iters = 3;
model = xgboost_train(Xtrain,ytrain,params,num_iters,'None',[]);
Yhat = xgboost_test(Xtrain,model,0);
figure; plot(Yhat)
figure; scatter(Yhat,ytrain + 0.1*rand(length(ytrain),1)); grid on
end
%%% Function inputs:
% Xtrain: matrix of inputs for the training set
% ytrain: vetor of labels/values for the test set
% params: structure of learning parameters
% max_num_iters: max number of iterations for learning
% eval_metric: evaluation metric for cross-validation performance regarding turning the optimal number of learning iterations.
% Suppoted are 'AUC', 'Accuracy', 'None'.
% In case eval_metric=='None', learning of final model will be performed with max_num_iters (without internal cross validation).
% For other evaluation metrics, up to max_num_iters learning iterations will be performed in a cross-validation procedure.
% In each cross-validation fold, learning will the stopped if eval_metric is not improved over last early_stopping_rounds (number of) iterations.
% Then, learning of the final model will be performed using the average (over CV folds) number of resulting learning iterations.
% model_filename : a string. If nonempty or ~= '', the final name will be saved to a file specified by this string
%%% Function output:
% model: a structure containing:
% iters_optimal; % number of iterations performs by xgboost (final model)
% h_booster_ptr; % pointer to the final model
% params; % model parameters (just for info)
% missing; % value considered "missing"
% train an xgboost model See "D:\cpcardio\physionet_2020\call_xgboost.m", "D:\r\xgboost\py\ex_xgboost.py"
% see also: xgboost_test.m
% See https://xgboost.readthedocs.io/en/stable/dev/c__api_8h.html for info on the xgboost library functions
% See https://xgboost.readthedocs.io/en/latest/parameter.html for info on xgboost inputs parameters
%%% Steps to compile xgboost library and use it in Matlab:
% Step 1: create xgboost.dll (on windows)
% Follow these instructions: https://xgboost.readthedocs.io/en/latest/build.html#build-the-shared-library
% - make folder D:\r\xgboost (e.g.)
% - create an empty git repository
% - pull from https://github.com/dmlc/xgboost
% - Git bash here (D:\r\xgboost) - open a git bash. In it type:
% - git submodule init
% - git submodule update
% - install cmake and add path to the env (automatically, just select the option)
% = https://cgold.readthedocs.io/en/latest/first-step/installation.html
% = download and install: https://github.com/Kitware/CMake/releases/download/v3.17.2/cmake-3.17.2-win64-x64.msi
% - In dos, go to folder D:\r\xgboost. In it execute:
% mkdir build
% cd build
% cmake .. -G"Visual Studio 14 2015 Win64"
% # for VS15: cmake .. -G"Visual Studio 15 2017" -A x64
% # for VS16: cmake .. -G"Visual Studio 16 2019" -A x64
% cmake --build . --config Release
%
% Result:
% xgboost.dll is created here: "D:\r\xgboost\lib\xgboost.dll"
%
% Step 2: get a header file:
% - downlaod header file: https://raw.githubusercontent.com/dmlc/xgboost/master/include/xgboost/c_api.h
% - save it to "D:\r\xgboost\lib"
% - rename c_api.h to xgboost.h
%
% Result:
% xgboost.h is created here: "D:\r\xgboost\lib\xgboost.h"
%
% Step 3: run this file (xgboost_train.m) to produce a model and xgboost_test.m to make predictions using the model created in xgboost_train.m
% - The script utilizes an explanation on "Using XGBOOST in c++" provided here: https://stackoverflow.com/questions/36071672/using-xgboost-in-c
% Example code to perform cross-validation for tuning (some) parameters:
skipme = 1;
if skipme == 0
%%% load example data:
% load carsmall; Xtrain = [Acceleration Cylinders Displacement Horsepower MPG]; ytrain = cellstr(Origin); ytrain = double(ismember(ytrain,'USA'));
save_model_to_disk = 0; % 0 or 1 - whether to save model to disk (==1) or not (==0)
eval_metric = 'Accuracy'; % Out-of-sample evaluation metric. One of: 'Accuracy', 'AUC'
if save_model_to_disk == 1
model_filename = 'xgb_model.xgb';
else
model_filename = [];
end
% create CV fold indices
folds = 5;
cvind = ceil(folds*[1:size(Xtrain,1)]/(size(Xtrain,1)))'; % column containing the folder indices for cross validation
rand('state', 0); u1 = rand(size(Xtrain,1),1); cvind = sortrows([u1 , cvind],1); cvind = cvind(:,2:end); clear u1
% create params structure
params = struct;
params.booster = 'gbtree';
params.objective = 'binary:logistic';
params.eta = 0.1;
params.min_child_weight = 1;
params.subsample = 1; % 0.9
params.colsample_bytree = 1;
params.num_parallel_tree = 1;
% set range of possible values for (some) entries of the params structure
params.max_depth_all = [1:5];
num_iters_all = 2.^(1:10); % n_estimators
% perform search over the range of parameters
CVACC = []; % cross validation accuracy criterion
CVAUC = []; % cross validation AUC criterion
for i=1:length(params.max_depth_all)
params.max_depth = params.max_depth_all(i);
for j=1:length(num_iters_all)
disp(['i=' num2str(i) '/' num2str(length(params.max_depth_all)) ', j=' num2str(j) '/' num2str(length(num_iters_all))])
YhatCV_all = zeros(size(Xtrain,1),1);
for kk=1:folds % perform a cross-validation step for params.max_depth_all(i) and num_iters_all(j)
num_iters = num_iters_all(j);
model = xgboost_train(Xtrain(cvind~=kk,:),ytrain(cvind~=kk),params,num_iters,'None',[]);
YhatCV_all(cvind==kk) = xgboost_test(Xtrain(cvind==kk,:),model,0);
end
if strcmp(eval_metric,'Accuracy')
CVACC = [CVACC; [sum(ytrain == round(YhatCV_all))/length(ytrain) params.max_depth num_iters]]; % cross-validation accuracy
elseif strcmp(eval_metric,'AUC')
[~,~,~,AUC] = perfcurve(ytrain,YhatCV_all,0);
CVAUC = [CVAUC; [AUC params.max_depth num_iters]];
end
end
end
if strcmp(eval_metric,'Accuracy')
CV_metric_optimal = CVACC(CVACC(:,1) == max(CVACC(:,1)),:); % [CV_accuracy max_depth num_iters]
elseif strcmp(eval_metric,'AUC')
CV_metric_optimal = CVAUC(CVAUC(:,1) == max(CVAUC(:,1)),:); % [CV_AUC max_depth num_iters]
end
params.max_depth = CV_metric_optimal(1,2);
num_iters = CV_metric_optimal(1,3);
% train the whole model using optimal parameters
model = xgboost_train(Xtrain,ytrain,params,num_iters,'None',model_filename);
YhatTrain = xgboost_test(Xtrain,model,save_model_to_disk); % test on the training set (just for info)
if isfield(model,'h_booster_ptr')
calllib('xgboost', 'XGBoosterFree',model.h_booster_ptr); % clear model data from memory. Do this only ONCE, otherwise Matlab will hang.
model = rmfield(model,'h_booster_ptr');
end
% plot original labels vs predicted probabilities
figure; scatter(YhatTrain, ytrain + 0.1*rand(length(ytrain),1)); grid on
% Plot CV results:
if strcmp(eval_metric,'Accuracy')
CVperf = CVACC;
CV_performance = 'CV Accuracy';
elseif strcmp(eval_metric,'AUC')
CVperf = CVAUC;
CV_performance = 'CV AUC';
end
[Z1,Z2] = meshgrid(params.max_depth_all,num_iters_all);
Z = reshape(CVperf(:,1),size(Z1,1),size(Z1,2));
figure; surfc(Z1,Z2,Z); alpha(0.2); xlabel('max depth'); ylabel('num iters'), zlabel(CV_performance)
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5
% set some parameters manually:
early_stopping_rounds = 10; % use CV with early_stopping_rounds== [10]
folds = 5; % number of cross validation folds
missing = single(NaN); % set a value to be treated as "missing"
% load example data, if not supplied:
if isempty(Xtrain)
load carsmall
Xtrain = [Acceleration Cylinders Displacement Horsepower MPG]; % contains input data WITHOUT labels
ytrain = cellstr(Origin); % set the labels (target variable)
ytrain = double(ismember(ytrain,'USA'));
end
% set max number of iterations for learning
if isempty(max_num_iters)
max_num_iters = 999; % == num_boost_round
end
% parse xgboost parameters:
%%% default in https://stackoverflow.com/questions/36071672/using-xgboost-in-c:
if isempty(params)
params.booster = 'gbtree';
params.objective = 'binary:logistic';
params.max_depth = 5;
params.eta = 0.1;
params.min_child_weight = 1;
params.subsample = 0.9;
params.colsample_bytree = 1;
params.num_parallel_tree = 1;
end
%%% default in "D:\g\py\hot\postprob\postprob.py", "D:\r\xgboost\py\ex_xgboost.py":
% if isempty(params)
% % params.n_estimators = num2str(1000); % option only available in python
% % params.nthred = num2str(3); % option only available in python
% params.booster = 'gbtree';
% params.objective = 'binary:logistic';
% params.scale_pos_weight = num2str(1);
% params.subsample = num2str(0.9);
% params.gamma = num2str(0);
% params.reg_alpha = num2str(0);
% params.max_depth = num2str(5);
% end
param_fields = fields(params);
for i=1:length(param_fields)
eval(['params.' param_fields{i} ' = num2str(params.' param_fields{i} ');'])
end
% load the xgboost library
if not(libisloaded('xgboost'))
cwd = pwd; cd D:\r\xgboost\lib
loadlibrary('xgboost')
cd(cwd)
end
if ~strcmp(eval_metric,'None')
cvind = ceil(folds*[1:size(Xtrain,1)]/(size(Xtrain,1)))'; % column containing the folder indices for cross validation
rand('state', 0); u1 = rand(size(Xtrain,1),1); cvind = sortrows([u1 , cvind],1); cvind = cvind(:,2:end); clear u1
iters_reached_per_fold = zeros(folds,1);
for kk = 1:folds
% post-process input data
rows = uint64(sum(cvind~=kk)); % use uint64(size(Xtrain,1)) in case of no CV
cols = uint64(size(Xtrain,2));
% create relevant pointers
train_ptr = libpointer('singlePtr',single(Xtrain(cvind~=kk,:)')); % the transposed (cv)training set is supplied to the pointer
train_labels_ptr = libpointer('singlePtr',single(ytrain(cvind~=kk)));
h_train_ptr = libpointer;
h_train_ptr_ptr = libpointer('voidPtrPtr', h_train_ptr);
% convert input matrix to DMatrix
calllib('xgboost', 'XGDMatrixCreateFromMat', train_ptr, rows, cols, missing, h_train_ptr_ptr);
% handle the labels (target variable)
labelStr = 'label';
calllib('xgboost', 'XGDMatrixSetFloatInfo', h_train_ptr, labelStr, train_labels_ptr, rows);
% create the booster and set some parameters
h_booster_ptr = libpointer;
h_booster_ptr_ptr = libpointer('voidPtrPtr', h_booster_ptr);
len = uint64(1);
calllib('xgboost', 'XGBoosterCreate', h_train_ptr_ptr, len, h_booster_ptr_ptr);
for i=1:length(param_fields)
eval(['calllib(''xgboost'', ''XGBoosterSetParam'', h_booster_ptr, ''' param_fields{i} ''', ''' eval(['params.' param_fields{i}]) ''');'])
end
%%% for example:
% calllib('xgboost', 'XGBoosterSetParam', h_booster_ptr, 'booster', 'gbtree');
% calllib('xgboost', 'XGBoosterSetParam', h_booster_ptr, 'objective', 'binary:logistic'); % 'reg:linear' , 'binary:logistic'
% calllib('xgboost', 'XGBoosterSetParam', h_booster_ptr, 'max_depth', '5');
% calllib('xgboost', 'XGBoosterSetParam', h_booster_ptr, 'eta', '0.1');
% calllib('xgboost', 'XGBoosterSetParam', h_booster_ptr, 'min_child_weight', '1');
% calllib('xgboost', 'XGBoosterSetParam', h_booster_ptr, 'subsample', '1'); % '1', '0.5'
% calllib('xgboost', 'XGBoosterSetParam', h_booster_ptr, 'colsample_bytree', '1');
% calllib('xgboost', 'XGBoosterSetParam', h_booster_ptr, 'num_parallel_tree', '1');
%%% Make a (cv) model
% see https://blog.cambridgespark.com/hyperparameter-tuning-in-xgboost-4ff9100a3b2f
% see https://machinelearningmastery.com/avoid-overfitting-by-early-stopping-with-xgboost-in-python/
%%% calllib('xgboost', 'XGBoosterSetParam', h_booster_ptr, 'eval_metric', 'logloss'); % eg: 'logloss' , 'auc' , 'mae'. NOTE: there is no way to provide early_stopping_rounds inside params
% initialize
AUC_ = []; % AUC - for CV performance evaluation
Acc_ = []; % Accuracy - for CV performance evaluation
% create a test set
h_test_ptr = libpointer;
h_test_ptr_ptr = libpointer('voidPtrPtr', h_test_ptr);
test_ptr = libpointer('singlePtr',single(Xtrain(cvind==kk,:)')); % the transposed (cv)training set is supplied to the pointer
yCV = ytrain(cvind==kk); % not supplied to xgboost.dll
rows = uint64(sum(cvind==kk)); % use uint64(size(Xtrain,1)) in case of no CV
cols = uint64(size(Xtrain,2));
calllib('xgboost', 'XGDMatrixCreateFromMat', test_ptr, rows, cols, missing, h_test_ptr_ptr);
% perform up to max_num_iters learning iterations. Stop learning if eval_metric is not improved over last early_stopping_rounds (number of) iterations
for iter = 0:max_num_iters
% disp(['iter (cv ' num2str(kk) ') = ' num2str(iter)])
calllib('xgboost', 'XGBoosterUpdateOneIter', h_booster_ptr, int32(iter), h_train_ptr);
%%% Make predictions on a CV test set
% predict
out_len = uint64(0);
out_len_ptr = libpointer('uint64Ptr', out_len);
f = libpointer('singlePtr');
f_ptr = libpointer('singlePtrPtr', f);
option_mask = int32(0);
ntree_limit = uint32(0);
training = int32(0);
calllib('xgboost', 'XGBoosterPredict', h_booster_ptr, h_test_ptr, option_mask, ntree_limit, training, out_len_ptr, f_ptr);
% extract predictions
n_outputs = out_len_ptr.Value;
setdatatype(f,'singlePtr',n_outputs);
YhatCV = double(f.Value); % display the predictions (in case objective == 'binary:logistic' : display the predicted probabilities)
% YhatCV = round(YhatCV); % so that we get the label
switch eval_metric
case 'AUC'
% use AUC as evaluation metric
[~,~,~,AUC] = perfcurve(yCV,YhatCV,1);
AUC_ = [AUC_; AUC];
if length(AUC_) > early_stopping_rounds && AUC_(iter-early_stopping_rounds+2) == max(AUC_(iter-early_stopping_rounds+2:end))
iters_reached_per_fold(kk) = iter-early_stopping_rounds+2;
break
end
case 'Accuracy'
% use Accuracy as evaluation metric
Acc = [sum(yCV == round(YhatCV)) / length(yCV)];
Acc_ = [Acc_; Acc];
if length(Acc_) > early_stopping_rounds && Acc_(iter-early_stopping_rounds+2) == max(Acc_(iter-early_stopping_rounds+2:end))
iters_reached_per_fold(kk) = iter-early_stopping_rounds+2;
break
end
otherwise
% free xgboost internal structures
if exist('h_train_ptr','var')
calllib('xgboost', 'XGDMatrixFree',h_train_ptr); clear h_train_ptr
end
if exist('h_test_ptr','var')
calllib('xgboost', 'XGDMatrixFree',h_test_ptr); clear h_test_ptr
end
if exist('h_booster_ptr','var')
calllib('xgboost', 'XGBoosterFree',h_booster_ptr); clear h_booster_ptr
end
disp('Evaluation metric not supported')
return
end
end
% free xgboost internal structures
if exist('h_train_ptr','var')
calllib('xgboost', 'XGDMatrixFree',h_train_ptr); clear h_train_ptr
end
if exist('h_test_ptr','var')
calllib('xgboost', 'XGDMatrixFree',h_test_ptr); clear h_test_ptr
end
if exist('h_booster_ptr','var')
calllib('xgboost', 'XGBoosterFree',h_booster_ptr); clear h_booster_ptr
end
end
iters_optimal = round(mean(iters_reached_per_fold)); % estimated optimal number of learning iterations for the whole training set
disp('optimal iterations per cv fold:')
disp(iters_reached_per_fold)
else
iters_optimal = max_num_iters;
end
%%% Train the final model using the whole training set and number of iterations == iters_optimal:
% post-process input data
rows = uint64(size(Xtrain,1)); % use uint64(size(Xtrain,1)) in case of no CV
cols = uint64(size(Xtrain,2));
Xtrain = Xtrain'; % DLL is row-based, and matlab is column-based
% create relevant pointers
train_ptr = libpointer('singlePtr',single(Xtrain));
train_labels_ptr = libpointer('singlePtr',single(ytrain));
h_train_ptr = libpointer;
h_train_ptr_ptr = libpointer('voidPtrPtr', h_train_ptr);
% convert input matrix to DMatrix
calllib('xgboost', 'XGDMatrixCreateFromMat', train_ptr, rows, cols, missing, h_train_ptr_ptr);
% handle the labels (target variable)
labelStr = 'label';
calllib('xgboost', 'XGDMatrixSetFloatInfo', h_train_ptr, labelStr, train_labels_ptr, rows);
% create the booster and set some parameters
h_booster_ptr = libpointer;
h_booster_ptr_ptr = libpointer('voidPtrPtr', h_booster_ptr);
len = uint64(1);
calllib('xgboost', 'XGBoosterCreate', h_train_ptr_ptr, len, h_booster_ptr_ptr);
for i=1:length(param_fields)
eval(['calllib(''xgboost'', ''XGBoosterSetParam'', h_booster_ptr, ''' param_fields{i} ''', ''' eval(['params.' param_fields{i}]) ''');'])
end
% perform iters_optimal learning iterations to produce a final model (pointer to it)
for iter = 0:iters_optimal
% disp(['iter (final model) = ' num2str(iter)])
calllib('xgboost', 'XGBoosterUpdateOneIter', h_booster_ptr, int32(iter), h_train_ptr);
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
model = struct;
model.iters_optimal = iters_optimal; % number of iterations performs by xgboost
model.h_booster_ptr = h_booster_ptr; % pointer to the final model
model.params = params; % just for info
model.missing = missing; % value considered "missing"
model.model_filename = ''; % initialize: filename for model (to be saved)
if ~(isempty(model_filename) || strcmp(model_filename,''))
calllib('xgboost', 'XGBoosterSaveModel', h_booster_ptr_ptr, model_filename);
model.model_filename = model_filename; % 'xgboost_model.xgb'
end
% free xgboost internal structures
if exist('h_train_ptr','var')
calllib('xgboost', 'XGDMatrixFree',h_train_ptr); clear h_train_ptr
end
if exist('h_booster_ptr','var')
% calllib('xgboost', 'XGBoosterFree',h_booster_ptr); clear h_booster_ptr
end
% % unload the xgboost library
% if libisloaded('xgboost')
% unloadlibrary('xgboost')
% end
得到了初步的结果
后续在实际的测试中表现不佳,正确率就在70%左右,于是又去尝试了Xgboost,最后得到的效果较好,95%左右。
参考文献
[1]刘佳,徐群玉,陈唯实.无人机雷达航迹运动特征提取及组合分类方法[J].系统工程与电子技术,2023,45(10):3122-3131.
[2]王储君,万显荣,张伟民.基于不平衡数据集的无人机和飞鸟目标分类[C]//中国电子学会电波传播分会.第十七届全国电波传播年会会议论文集.武汉大学电子信息学院;,2022:4.
[3]管康萍,冯正康,马小艳,等.一种基于航迹特征的无人机与飞鸟目标雷达识别方法[J].上海航天(中英文),2024,41(01):130-136.
[4]汪浩,窦贤豪,田开严,等.基于CNN的雷达航迹分类方法[J].舰船电子对抗,2023,46(05):70-74.
[5]M. Rosamilia, A. Balleri, A. De Maio, A. Aubry and V. Carotenuto, “Radar Detection Performance Prediction Using Measured UAVs RCS Data,” in IEEE Transactions on Aerospace and Electronic Systems, vol. 59, no. 4, pp. 3550-3565, Aug. 2023.