系列文章目录
KDDCup99网络入侵识别KDDCup99网络入侵识别
前言
本文基于作者本人理解进行编写,如有异议,欢迎评论
提示:以下是本篇文章正文内容
一.步骤
1.导入数据
import pandas as pd
data = pd.read_csv(r'D:/jixiexuexi/data/KDDCup.csv',engine='python')
data.head()
2.数据读取和探索
data.isnull().sum() # 判断缺失值总数判断
data['label'].value_counts() # 标签值数量统计
发现数据并未缺失值,进行下一步数据预处理
3.数据预处理
3.1对标签进行处理
data['label'] = data['label'].apply(lambda x: x[:-1]) # 1、将标签列中各值的'.'去掉
data_new = data.iloc[:, :31] # 2、获取前31列,因为查询网络资料发现只需考虑前31列
data_new.columns # 查看数据的列名
3.2 去除无用特征
#查询资料,发现'num_outbound_cmds', 'is_host_login', 'service'对标签并无影响,故删除
not_use = ['num_outbound_cmds', 'is_host_login', 'service'] # 待去除属性的名称
features = [i for i in data_new.columns if i not in not_use]# 筛选出来的自变量名称
data_new = data.loc[:, features] # 样本自变量
data_new['label'] = data['label'] # 为数据集加上标签列
3.3将标签归为五类
DOS = ['back', 'land', 'neptune', 'pod', 'smurf', 'teardrop']
Probin = ['ipsweep', 'nmap', 'portsweep', 'satan']
R2L = ['ftp_write', 'guess_passwd', 'imap', 'multihop', 'phf', 'spy', 'warezclient', 'warezmaster']
U2R = ['buffer_overflow', 'loadmodule', 'perl', 'rootkit']
# 根据标签名字,找到5个类别所在的index
index_DOS = data_new['label'].apply(lambda x: x in DOS)
index_Probin = data_new['label'].apply(lambda x: x in Probin)
index_R2L = data_new['label'].apply(lambda x: x in R2L)
index_U2R = data_new['label'].apply(lambda x: x in U2R)
# 根据index更新label
data_new.loc[index_DOS, 'label'] = 'DOS'
data_new.loc[index_Probin, 'label'] = 'Probin'
data_new.loc[index_R2L, 'label'] = 'R2L'
data_new.loc[index_U2R, 'label'] = 'U2R'
data_new['label'].value_counts() # 查看数据标签的分布情况
#标签分布不均衡
3.4将数据中离散型和连续型数据分类
features_discrete = ['land', 'protocol_type', 'flag', 'land', 'su_attempted', 'is_guest_login'] # 离散属性
features_consecutive = [i for i in features if i not in features_discrete] # 连续属性
4.特征筛选(此处使用树模型)
from sklearn.ensemble import ExtraTreesClassifier
data_consecutive = data_new.loc[:, features_consecutive] # 连续型变量
y = data_new['label'] # 目标值
model = ExtraTreesClassifier() # 决策树
model.fit(data_consecutive, y) # 模型训练
# 这里是每个变量的重要值,值大小代表重要性
model.feature_importances_
select = [i for i in list(model.feature_importances_) if i >0.01]# 找出重要性大于0.01的变量
col = []
for i in select:
col.append(list(model.feature_importances_).index(i))#获得重要性大于0.01变量的标签
feature_select = data_consecutive.columns[col] # 筛选出重要性大于0.01的变量
feature_select
5.对离散变量进行哑变量处理
data_discrete = data_new.loc[:, features_discrete] # 取出离散型自变量
data_discrete_new = pd.get_dummies(data_discrete) # 哑变量处理:将离散变量转为连续值
6.离散型变量和连续型变量拼接
X = pd.concat([data_consecutive[feature_select], data_discrete_new], axis=1) # 将两类变量进行拼接
7.模型构建和评价
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,classification_report
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2) # 将数据分为训练集和测试集
model = DecisionTreeClassifier() # 调用决策树算法
model.fit(X_tr, y_tr) # 模型训练
pre = model.predict(X_te)
score = model.score(X_te, y_te) # 模型在测试集上的精度
print('训练集准确率为:',score)
print('测试集的混淆矩阵:\n',confusion_matrix(y_te, pre))
print('测试集的模型评估报告:\n',classification_report(y_te, pre))