SplitCap切分会话流并提取payload

该博客介绍了如何使用SplitCap工具对网络流量数据进行处理,包括切分会话流、提取会话内容、转换为PNG图片以及进一步转化为MNIST格式,以便用于深度学习模型的训练。整个流程涉及数据预处理、图像化和格式转换,最终目标是将网络通信数据转化为适合CNN训练的输入数据。
摘要由CSDN通过智能技术生成

SplitCap是按照会话流来切割的工具,会话流由源IP、目的IP、源端口、目的端口四个部分组成。

参考github连接:

https://github.com/echowei/DeepTraffic/tree/master/1.malware_traffic_classification

该链接有Splitcap工具

1、Split切分会话流

Split切分pcap文件脚本(PowerShell)如下:

foreach($f in gci 1_Pcap *.pcap)
{
    0_Tool\SplitCap_2-1\SplitCap -p 100000 -b 100000 -r $f.FullName -o 2_Session\AllLayers\$($f.BaseName)-ALL
    # 0_Tool\SplitCap_2-1\SplitCap -p 100000 -b 100000 -r $f.FullName -s flow -o 2_Session\AllLayers\$($f.BaseName)-ALL
    gci 2_Session\AllLayers\$($f.BaseName)-ALL | ?{$_.Length -eq 0} | del

    0_Tool\SplitCap_2-1\SplitCap -p 100000 -b 100000 -r $f.FullName -o 2_Session\L7\$($f.BaseName)-L7 -y L7
    # 0_Tool\SplitCap_2-1\SplitCap -p 100000 -b 100000 -r $f.FullName -s flow -o 2_Session\L7\$($f.BaseName)-L7 -y L7
    gci 2_Session\L7\$($f.BaseName)-L7 | ?{$_.Length -eq 0} | del
}

0_Tool\finddupe -del 2_Session\AllLayers
0_Tool\finddupe -del 2_Session\L7

2、提取会话内容

对SpiltCap进行预处理,提取传输层以上的payload并进行拼接,脚本(PowerShell)如下:

$SESSIONS_COUNT_LIMIT_MIN = 0
$SESSIONS_COUNT_LIMIT_MAX = 6000
$TRIMED_FILE_LEN = 784
$SOURCE_SESSION_DIR = "2_Session\L7"

echo "If Sessions more than $SESSIONS_COUNT_LIMIT_MAX we only select the largest $SESSIONS_COUNT_LIMIT_MAX."
echo "Finally Selected Sessions:"

$dirs = gci $SOURCE_SESSION_DIR -Directory
foreach($d in $dirs)
{
    $files = gci $d.FullName
    $count = $files.count
    if($count -gt $SESSIONS_COUNT_LIMIT_MIN)
    {             
        echo "$($d.Name) $count"        
        if($count -gt $SESSIONS_COUNT_LIMIT_MAX)
        {
            $files = $files | sort Length -Descending | select -First $SESSIONS_COUNT_LIMIT_MAX
            $count = $SESSIONS_COUNT_LIMIT_MAX
        }

        $files = $files | resolve-path
        # Ignore the .pcap file that has less than 10 packets
        $test  = $files | get-random -count ([int]($count/10))
        $train = $files | ?{$_ -notin $test}     

        $path_test  = "3_ProcessedSession\FilteredSession\Test\$($d.Name)"
        $path_train = "3_ProcessedSession\FilteredSession\Train\$($d.Name)"
        ni -Path $path_test -ItemType Directory -Force
        ni -Path $path_train -ItemType Directory -Force    

        cp $test -destination $path_test        
        cp $train -destination $path_train
    }
}

echo "All files will be trimed to $TRIMED_FILE_LEN length and if it's even shorter we'll fill the end with 0x00..."

$paths = @(('3_ProcessedSession\FilteredSession\Train', '3_ProcessedSession\TrimedSession\Train'), ('3_ProcessedSession\FilteredSession\Test', '3_ProcessedSession\TrimedSession\Test'))
foreach($p in $paths)
{
    foreach ($d in gci $p[0] -Directory) 
    {
        ni -Path "$($p[1])\$($d.Name)" -ItemType Directory -Force
        foreach($f in gci $d.fullname)
        {
            $content = [System.IO.File]::ReadAllBytes($f.FullName)
            $len = $f.length - $TRIMED_FILE_LEN
            if($len -gt 0)
            {        
                $content = $content[0..($TRIMED_FILE_LEN-1)]        
            }
            elseif($len -lt 0)
            {        
                $padding = [Byte[]] (,0x00 * ([math]::abs($len)))
                $content = $content += $padding
            }
            Set-Content -value $content -encoding byte -path "$($p[1])\$($d.Name)\$($f.Name)"
        }        
    }
}

结果如下,左侧bin文件是提取的会话流,右侧是网络数据流的通信层级,可见绘话流就是TCP层之上的payload。

3、会话流转为png图片

文件的内容就是字节序,将字节转为16进制为一个单位的数组,然后将两个数组成员拼接为一个字节,字节转为Int存到numpy的array当中,然后reshape一下,让一行为28个数,Python代码如下:

def getMatrixfrom_pcap(filename,width):
    with open(filename, 'rb') as f:
        content = f.read()
    hexst = binascii.hexlify(content)  
    fh = numpy.array([int(hexst[i:i+2],16) for i in range(0, len(hexst), 2)])  
    rn = int(len(fh)/width)
    fh = numpy.reshape(fh[:rn*width],(-1,width))  
    fh = numpy.uint8(fh)
    return fh

然后在批量处理上一步生成的会话流即可,代码如下:

paths = [['3_ProcessedSession\TrimedSession\Train', '4_Png\Train'],['3_ProcessedSession\TrimedSession\Test', '4_Png\Test']]
for p in paths:
    for i, d in enumerate(os.listdir(p[0])):
        dir_full = os.path.join(p[1], str(i))
        mkdir_p(dir_full)
        for f in os.listdir(os.path.join(p[0], d)):
            bin_full = os.path.join(p[0], d, f)
            print(bin_full)
            im = Image.fromarray(getMatrixfrom_pcap(bin_full,PNG_SIZE))
            png_full = os.path.join(dir_full, os.path.splitext(f)[0]+'.png')
            im.save(png_full)

4、将PNG图片转为MNIST格式

首先MNIST数据集格式查看以下连接:

https://blog.csdn.net/qq_20936739/article/details/82011320

文件头:

	# header for label array
	header = array('B')
	header.extend([0,0,8,1])
	header.append(int('0x'+hexval[2:][0:2],16))
	header.append(int('0x'+hexval[2:][2:4],16))
	header.append(int('0x'+hexval[2:][4:6],16))
	header.append(int('0x'+hexval[2:][6:8],16))	
	data_label = header + data_label

	# additional header for images array	
	if max([width,height]) <= 256:
		header.extend([0,0,0,width,0,0,0,height])
	else:
		raise ValueError('Image exceeds maximum size: 256x256 pixels');

	header[3] = 3 # Changing MSB for image data (0x00000803)

数据:

	for filename in FileList:
		print(filename)
		label = int(filename.split('\\')[2])
		Im = Image.open(filename)
		pixel = Im.load()
		width, height = Im.size
		for x in range(0,width):
			for y in range(0,height):
				data_image.append(pixel[y,x])
		data_label.append(label) # labels start (one unsigned byte each)
	hexval = "{0:#0{1}x}".format(len(FileList),6) # number of files in HEX
	hexval = '0x' + hexval[2:].zfill(8)

需要改Name,改为Test和Train,如下:

Names = [['4_Png\t10k','5_Mnist\\t10k']]
# Names = [['4_Png\train','5_Mnist\\test']]

将PNG数据处理完,生成IDX-Ubyte格式的数据集,便可以根据神经网络来训练:

值得一提的是,如果数据集中只有7个特征的话,数据集标签的维度还是10,10是通过one-hot形式标识的,只是7,8,9位只显示为0。

然后用CNN对其进行训练,

本代码的运行环境如下:

 

tensorflow源码如下:

import time
import sys
import numpy as np
import os

from tensorflow.examples.tutorials.mnist import input_data
# start tensorflow interactiveSession
import tensorflow as tf

# Note: if class numer is 2 or 20, please edit the variable named "num_classes" in /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py"
DATA_DIR = sys.argv[1]
CLASS_NUM = int(sys.argv[2])
TRAIN_ROUND = int(sys.argv[3])

folder = os.path.split(DATA_DIR)[1]

sess = tf.InteractiveSession()

flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_string('data_dir', DATA_DIR, 'Directory for storing data')

mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)

# function: find a element in a list
def find_element_in_list(element, list_element):
    try:
        index_element = list_element.index(element)
        return index_element
    except ValueError:
        return -1

# weight initialization
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape = shape)
    return tf.Variable(initial)

# convolution
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
# pooling
def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

# Create the model
# placeholder
x = tf.placeholder("float", [None, 784])
y_ = tf.placeholder("float", [None, CLASS_NUM])

# first convolutinal layer
w_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])

x_image = tf.reshape(x, [-1, 28, 28, 1])

h_conv1 = tf.nn.relu(conv2d(x_image, w_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

# second convolutional layer
w_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, w_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

# densely connected layer
w_fc1 = weight_variable([7*7*64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, w_fc1) + b_fc1)

# dropout
keep_prob = tf.placeholder("float")
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

# readout layer
w_fc2 = weight_variable([1024, CLASS_NUM])
b_fc2 = bias_variable([CLASS_NUM])

y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, w_fc2) + b_fc2)

# define var&op of training&testing
actual_label = tf.argmax(y_, 1)
label,idx,count = tf.unique_with_counts(actual_label)
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))
train_step = tf.train.GradientDescentOptimizer(1e-4).minimize(cross_entropy)
predict_label = tf.argmax(y_conv, 1)
label_p,idx_p,count_p = tf.unique_with_counts(predict_label)
correct_prediction = tf.equal(predict_label, actual_label)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
correct_label=tf.boolean_mask(actual_label,correct_prediction)
label_c,idx_c,count_c=tf.unique_with_counts(correct_label)

# if model exists: restore it
# else: train a new model and save it
saver = tf.train.Saver()
model_name = "model_" + str(CLASS_NUM) + "class_" + folder
model =  model_name + '/' + model_name + ".ckpt"
if not os.path.exists(model):
    sess.run(tf.initialize_all_variables())
    if not os.path.exists(model_name):
        os.makedirs(model_name)
    # with open('out.txt','a') as f:
    #     f.write(time.strftime('%Y-%m-%d %X',time.localtime()) + "\n")
    #     f.write('DATA_DIR: ' + DATA_DIR+ "\n")
    for i in range(TRAIN_ROUND+1):
        batch = mnist.train.next_batch(50)
        if i%100 == 0:
            train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_:batch[1], keep_prob:1.0})
            s = "step %d, train accuracy %g" %(i, train_accuracy)
            print (s)
            # if i%2000 == 0:
            #     with open('out.txt','a') as f:
            #         f.write(s + "\n")
        train_step.run(feed_dict={x:batch[0], y_:batch[1], keep_prob:0.5})
    
    save_path = saver.save(sess, model)
    print("Model saved in file:", save_path)
else:        
    saver.restore(sess, model)
    print("Model restored: " + model)

 

  • 4
    点赞
  • 37
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值