KDD99数据集预处理

最新推荐文章于 2024-08-20 21:42:22 发布

不懂开发的程序猿

最新推荐文章于 2024-08-20 21:42:22 发布

阅读量2.9k

点赞数 24

分类专栏：论文科研实验文章标签： KDDCUP99 预处理

本文链接：https://blog.csdn.net/qq_44807756/article/details/138341814

版权

论文科研实验专栏收录该内容

9 篇文章

订阅专栏

⚠申明：未经许可，禁止以任何形式转载，若要引用，请标注链接地址。全文共计3077字，阅读大概需要3分钟
🌈更多学习内容，欢迎👏关注👀【文末】我的个人微信公众号：不懂开发的程序猿
个人网站：https://jerry-jy.co/

❗️❗️❗️知识付费，🈲止白嫖，有需要请后台私信或【文末】个人微信公众号联系我

概述

KDD99数据集，全称KDD Cup 1999数据集，是一个广泛用于网络入侵检测研究的大型公开数据集。

1、数据集来源：KDD99数据集源自1999年的KDD Cup竞赛，该竞赛由美国空军研究实验室和圣地亚哥超级计算中心联合举办。

2、数据集内容：该数据集包含了网络连接记录，这些记录被标记为正常或遭受攻击，攻击类型多样，包括DoS攻击、U2R攻击、R2L攻击和探测攻击等。

3、数据集特点：

大规模：原始数据集包含约500万条记录。
多样性：包含多种攻击模式和正常网络行为。
不平衡性：攻击类型和正常类型的记录数量不平衡。

4、数据集格式：数据集通常以CSV格式提供，包含41个特征属性和1个类标识。

5、特征类型：数据集中的特征包括连续型和离散型，涉及网络连接的各个方面，如持续时间、协议类型、流量大小等。

6、类标识：每个连接记录都有一个类标识，用来表示该条连接记录是正常的还是某种攻击类型。

7、数据预处理：在使用KDD99数据集之前，通常需要进行数据清洗、特征工程和数据集划分等预处理步骤。

8、研究应用：KDD99数据集被广泛用于机器学习、数据挖掘和模式识别领域的研究，尤其是在异常检测和入侵检测系统的研究中。

9、数据集限制：KDD99数据集存在一些已知问题，如类别不平衡和过时的特征，这些限制了其在现代网络环境中的适用性。

10、改进版本：为了克服KDD99的某些限制，研究者创建了一些改进版本，如NSL-KDD数据集。

数据集下载

KDD Cup 1999 Data - University of California, Irvine

https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

数据格式及描述

KDD99数据集是一个著名的网络入侵检测数据集，它最初发布于1999年的KDD CUP竞赛。以下是关于KDD99数据集的数据格式及描述：

1、数据集来源：KDD99数据集由美国空军研究实验室和圣地亚哥超级计算中心联合发布，用于第三届国际知识发现和数据挖掘工具竞赛。

2、数据集内容：数据集包含网络连接记录，这些记录被标记为正常或遭受攻击，攻击类型多样，包括DoS攻击、U2R攻击、R2L攻击和探测攻击等。

3、数据集特点：

大规模：原始数据集包含约500万条记录。
多样性：包含多种攻击模式和正常网络行为。
不平衡性：攻击类型和正常类型的记录数量不平衡。

4、数据格式：数据集通常以CSV格式提供，包含41个特征属性和1个类标识。

5、特征类型：数据集中的特征包括连续型和离散型，涉及网络连接的各个方面，如持续时间、协议类型、流量大小等。

6、类标识：每个连接记录都有一个类标识，用来表示该条连接记录是正常的还是某种攻击类型。

7、数据预处理：在使用KDD99数据集之前，通常需要进行数据清洗、特征工程和数据集划分等预处理步骤。

8、研究应用：KDD99数据集被广泛用于机器学习、数据挖掘和模式识别领域的研究，尤其是在异常检测和入侵检测系统的研究中。

9、数据集限制：KDD99数据集存在一些已知问题，如类别不平衡和过时的特征，这些限制了其在现代网络环境中的适用性。

10、改进版本：为了克服KDD99的某些限制，研究者创建了一些改进版本，如NSL-KDD数据集。

预处理

步骤

KDD99数据集的预处理是使用该数据集进行机器学习模型训练前的一个重要步骤。以下是KDD99数据集预处理的关键步骤：

1、数据清洗：检查数据集中的缺失值和异常值。根据，KDD99数据集在发布时已经进行了一定程度的清洗，但研究者可能需要根据分析目标进行进一步的清洗。

2、格式转换：KDD99数据集通常以CSV格式提供，如果需要使用其他工具或软件进行分析，可能需要将数据转换为相应的格式，如ARFF格式用于Weka工具。

3、特征编码：对分类特征（如攻击类型）进行One-Hot编码或标签编码，以便于机器学习模型可以正确处理。

4、特征缩放：由于不同的特征可能具有不同的量级，进行特征缩放（如标准化或归一化）以保证模型训练的稳定性和效率。

5、数据划分：将数据集分为训练集和测试集。KDD99数据集已经预先划分了训练集和测试集，训练集用于模型训练，测试集用于评估模型性能。

6、处理数据不平衡：在KDD99数据集中，某些攻击类型可能比其他类型更常见，导致数据不平衡。可以采用过采样少数类、欠采样多数类或使用合成样本生成技术等方法来处理这个问题。

7、数据集理解：理解数据集中每个特征的含义及其对模型可能的影响，41个固定的特征属性。

原始数据集

在这里插入图片描述

预处理后的数据集

在这里插入图片描述

预处理结果

在这里插入图片描述

预处理代码

clear all;
clc;
profile on;
ticID = tic;
t = cputime;
matrix_normalized = zeros(); % store the normalized values

%% All files in the directory manually defined :(
excel_file = {'KDDTrain+.csv'}; % input excel file
NF = size(excel_file,1); % number of files once more than one

% PDF: Probability density function, the file has replaced the nominal values with numeric
output_pdf = {'PDF_sample.csv'};
        
output_csv = {'NORM_sample.csv'};

% define all protocols to be normalized
protocol_type = {'tcp';'udp';'icmp'}; % you can define more
Nprotocol = size(protocol_type,1); % how many protocols
M = zeros(Nprotocol,1); % save the amount of each protocol
pdf_p = zeros(Nprotocol,1); % save the probability of each protocol

% The Flag in KDD has the following values:
flag = {'OTH';'REJ';'RSTO';'RSTOS0';'RSTR';'S0';'S1';'S2';'S3';'SF';'SH'};
Nflag = size(flag,1);
F = zeros(Nflag,1);
pdf_f = zeros(Nflag,1);

service = {'aol'; 'http_443'; 'http_8001'; 'http_2784';...
           'domain_u'; 'ftp_data'; 'auth'; 'bgp'; 'courier';...
           'tftp_u'; 'uucp_path'; 'csnet_ns'; 'ctf';...
           'daytime';'time'; 'discard'; 'domain'; 'echo';...
           'eco_i'; 'ecr_i'; 'efs'; 'exec'; 'finger'; 'gopher';...
           'harvest'; 'hostnames'; 'http'; 'imap4'; 'IRC';...
           'iso_tsap'; 'klogin'; 'kshell'; 'ldap'; 'link';...
           'login'; 'smtp'; 'mtp'; 'name';...
           'netbios_dgm'; 'netbios_ns'; 'netbios_ssn'; 'netstat';...
           'nnsp'; 'nntp'; 'ntp_u'; 'other'; 'pm_dump'; 'pop_2';...
           'pop_3'; 'printer'; 'private'; 'red_i'; 'remote_job'; ...
           'rje'; 'shell'; 'sql_net'; 'ssh'; 'sunrpc';...
           'supdup'; 'systat'; 'telnet'; 'tim_i';...
           'urh_i'; 'urp_i'; 'uucp';'ftp'; 'vmnet';...
           'whois'; 'X11'; 'Z39_50'};
Nservice = size(service,1); % return number of services
N = zeros(Nservice,1); % create a zeros arry with one column only
pdf_s = zeros(Nservice,1);

%===================== Load the Dataset from xls File ===================%

for f=1:NF
% read everything into one cell array
fprintf('Start processing the File : %s', excel_file{f});fprintf('\n');
[~,~,raw] = xlsread(excel_file{f});
% find numbers
containsNumbers = cellfun(@isnumeric,raw);
% convert to string
raw(containsNumbers) = cellfun(@num2str,raw(containsNumbers),'UniformOutput',false);
row_count = size(raw,1);   % how many rows
col_count = size(raw,2);   % how many nominal columns

%unfortunately you have to determin each nominal column manually
proto_col = raw(:,2); % determine protocol type column %手动指定非数值列 
flag_col = raw(:,4); % determine flag column %手动指定非数值列
service_col = raw(:,3); %Service column %手动指定非数值列

% calculate the probabilities of the protocol type
for i=1:Nprotocol
        M(i) = sum(strcmp(protocol_type(i),proto_col));
end

%% replace all probability values for protocol type nominal --> numeric
for p = 1:length(protocol_type)
proto_col = strrep(proto_col,protocol_type{p},num2str(pdf_p(p)));
end
 
%claculate probabilities of flag     
for i=1:Nflag
        F(i) = sum(strcmp(flag(i),flag_col));
end

% replace probabilities of flag nominal --> numeric
for fg = 1:length(flag)
flag_col = strrep(flag_col,flag{fg},num2str(pdf_f(fg)));
end

%====== Service column calculation and replacement
for i=1:Nservice
        N(i) = sum(strcmp(service(i),service_col));
end

%% replace all probability values for protocol type nominal --> numeric
for s = 1:length(service)
service_col = strrep(service_col,service{s},num2str(pdf_s(s)));
end
% Set all values back to the main cell matrix file
raw(:,2) = proto_col; %手动指定非数值列
raw(:,4) = flag_col; %手动指定非数值列
raw(:,3) = service_col; %手动指定非数值列

%====== read the PDF file to start normalization
fid=fopen(output_pdf{f},'wt');
   for i=1:row_count
     fprintf(fid,'%s,',raw{i,1:end-1});
     fprintf(fid,'%s\n',raw{i,end});
   end
   fclose(fid);
fprintf('-->  Currently generated is : %s',output_pdf{f});fprintf('\n');

%===================== End Converting ===============================%

end

%%=================== Start Normalization ===========================%

for f=1:NF
%===================== Load the Dataset from xls File ===================%
%data = xlsread ('PDF_most_valuable_and_relevant_features.xlsx');
data = load (output_pdf{f}); % remove all headers, if headers are available then use [text,data]=load....
row_count = size(data,1);   % how many rows
col_count = size(data,2);   % how many columns
%=========================================================================%

raw_matrix = data;

%loops to select each feature and normalize it individually
for i=1:col_count
    
    selected_column = raw_matrix(:,i);
    maximum = max(selected_column);
    minimum = min(selected_column);
    if maximum > 1
	    for j=1:size(selected_column,1)
		    if selected_column(j) == 0
            matrix_normalized(j,i) = 0;
		end
	else
		for j=1:size(selected_column,1)
		    matrix_normalized(j,i) = selected_column(j); % do not normalize and save the values directly
		end
	end	
end

%   % write to a csv file 
csvwrite(output_csv{f},matrix_normalized); % output file

fprintf('==> Finished is: %s',output_csv{f});fprintf('\n');

end
%==================== End Normalizing =================================%

fprintf('Total execution time is: %f \n', cputime-t);
fclose all;
clear;