利用神经网络进行网络流量识别——特征提取的方法是（1）直接原始报文提取前24字节，24个报文组成596像素图像CNN识别；或者直接去掉header后payload的前1024字节（2）传输报文的大小分...

最新推荐文章于 2024-08-10 14:52:04 发布

djph26741

最新推荐文章于 2024-08-10 14:52:04 发布

阅读量4.4k

点赞数 1

文章标签：人工智能网络 php

原文链接：http://www.cnblogs.com/bonelee/p/10303108.html

版权

本文探讨了利用深度学习，尤其是卷积神经网络（CNN）进行网络流量识别的技术。研究指出，流量识别的关键在于前几包的数据，特别是TCP连接的初期信息。通过提取报文的前24字节或1024字节，转换为图像或直接输入CNN进行处理，可以实现高精度的应用协议和恶意软件流量分类。实验结果证明，这种方法能有效处理加密流量，避免了传统方法中手工特征工程的局限性。

摘要由CSDN通过智能技术生成

国外的文献汇总：

《Network Traffic Classification via Neural Networks》使用的是全连接网络，传统机器学习特征工程的技术。top10特征如下：

List of Attributes

Port number server Minimum segment size client→server First quartile of number of control bytes in each packet client→server Maximum number of bytes in IP packets server→client Maximum number of bytes in Ethernet package server→client Maximum segment size server→client Mean segment size server→client Median number of control bytes in each packet bidirectional Number of bytes sent in initial window client→server Minimum segment size server→client

Table 7: Top 10 attributes as determined by connection weights

《Deep Learning for Encrypted Traffic Classification: An Overview》2018年文章，里面提到流量分类技术的发展历程：

案例：流量识别流量识别任务（Skype, WeChat, BT等类别）1. 最简单的方法是使用端口号。但是，它的准确性一直在下降，因为较新的应用程序要么使用众所周知的端口号来掩盖其流量，要么不使用标准的注册端口号。2. 有效载荷或数据包检测（DPI），专注于在数据包中查找模式或关键字。这些方法仅适用于未加密的流量并且具有高计算开销。3. ML方法：这些方法依赖于统计或时间序列功能，这使它们能够处理加密和未加密的流量。这些方法通常采用经典机器学习（ML）算法，例如随机森林（RF）和k近邻（KNN）。然而，它们的性能在很大程度上取决于人工设计的特征，这限制了它们的普遍性。4. DL方法：深度学习不需要域专家选择功能，因为它通过特征自动选择功能。这种特性使深度学习成为一种非常理想的流量分类方法，特别是当新类不断出现并且旧类的模式发展时。深度学习的另一个重要特征是，与传统的ML方法相比，它具有相当高的学习能力，因此可以学习高度复杂的模式。结合这两个特征，作为端到端的方法，深度学习能够学习原始输入和相应输出之间的非线性关系。

文中指出：在许多研究中，这些特征具有代表性，即使对于加密流量，最多前20个数据包已经证明足够合理的准确性（the first few packets up to first 20 packets have been shown to be enough for reasonable accuracy even for encrypted traffic）。就是说报文提取特征仅仅需要前20个报文。非常重要！！！！

使用深度学习进行检测的一些文献汇总：

Paper Category

DL method Online Features Year

Wang-2018[8] Intrusion detection CNN+LSTM ✓ Header+payload 2018

Aceto[5] APP classification CNN/LSTM/SAE/MLP ✓ Header+payload -

Vu [9] Traffic identification AC-GAN ✕ Statistical 2017

Wang-2017[6] Traffic identification CNN ✓ Header+payload 2017

Seq2Img[7] APP/protocol identification RKHS+CNN ✓ Time series 2017

Lotfollahi[10] APP/traffic identification CNN/SAE ✕ Header+payload 2017

Lopez-Martin[11] Mixed-type classification CNN+LSTM ✓ Header+time Series 2017

Hochst[12] Traffic identification Autoencoder ✕ Statistical+header 2017

TABLE I OVERVIEW OF DEEP LEARNING METHODS USED FOR TRAFFIC CLASSIFICATION.

《Byte Segment Neural Network for Network Traffic Classification》——本质上是LSTM+attention来做的分类。
里面也提到了网络流前面几个报文的特征是最关键的。
Features are collected
from different levels of traffic and calculated on full flow.
Bernaille et. al observed that the size and the direction of the
first few packets of the TCP connection are very considerable.
According to this, they proposed a model based on simple K-
Means [12].

<<A novel QUIC traffic Classifier based on Convolutional Neural Networks>> 注意quick协议也是加密的！！！
也是使用CNN【【【使用一维CNN】】】进行quic上层应用类型识别，提取的是quick udp协议的有效payload(50-1392B)。原文：
There are four main steps in the pre-
processing phase including data link header removal, byte
conversation, normalization and zero padding.
The data-link header contains some information related to
the physic layer which plays an important role in forwarding
the frames in the network. However, this information is
useless for traffic classification, so the data-link header will be
filtered in the data link header removal step. Besides, we only
use the payload of QUIC packet because we found that other
information in QUIC packet is useless for the classification.
Then the packet in the dataset will be converted from bit to
byte in order to reduce the input size. For better performance,
all packet bytes are normalized using dividing by 255, the
maximum value for a byte. The CNN requires the same input
length while the packet length in the dataset varies from over
50 to 1392 bytes. Therefore, the dataset will be added some
zero values in the zero padding step to have the similar length
of each packet. The packet with packet length less than 1400,
are padded zero at the end. Finally, each packet comprises
1400 values corresponding to 1400 features.