《Web安全之深度学习实战》笔记：第十章用户行为分析与恶意行为检测

mooyuan天天

已于 2022-03-11 15:34:55 修改

阅读量1.8k

点赞数 1

分类专栏： Web安全之深度学习实战文章标签： web安全安全深度学习网络安全人工智能

于 2022-03-11 15:31:46 首次发布

本文链接：https://blog.csdn.net/mooyuan/article/details/123414165

版权

Web安全之深度学习实战专栏收录该内容

19 篇文章 24 订阅

订阅专栏

本文介绍了如何使用UBA（用户行为分析）技术来检测恶意操作，特别是基于SEA数据集的案例。数据集包含70多个UNIX用户的行为日志，通过Wordbag、n-gram、Word2Vec和词集等特征提取方法，然后运用NB、XGB-BOOST、MLP、CNN和RNN等模型进行训练和测试。实验结果显示，不同模型在检测恶意行为上有不同表现，其中XGB-BOOST表现出色。

摘要由CSDN通过智能技术生成

本章基于SEA数据集介绍UBA的一个典型应用场景，即恶意操作行为检测。事实上，在《web安全之机器学习入门》中，我们已经了解过该数据集。

我们将恶意内部人员和内部员工的异常操作统称为恶意操作。检测这种恶意操作需要使用高级技术，比如用户行为分析（User Behawiors Analysis，UBA），这种新兴技术可提供以往被遗漏的数据保护和欺诈检测功能。结合用户日常操作的系统，UBA利用一种专门的安全分析算法，不仅可以关注初始登录操作，还能跟踪用户的一举一动。UBA有两个主要功能：它有助于为用户执行的正常活动确立基线；迅速识别偏离正常行为的异常行为，以便安全分析员执行调查。某些异常行为可能乍一看不是恶意的，但是这需要安全分析员进一步调查情况，以确定它是合法行为还是恶意行为。

一、数据集

SEA数据集涵盖70多个UNIX系统用户的行为日志，这些数据来自UNIX系统记录的用户使用的命令。SEA数据集中每个用户都采集了15000条命令，从用户集合中随机抽取50个用户作为正常用户，剩余用户的命令随机插入，模拟作为内部伪装者发起的攻击。

每个用户的数据按照连续的100个命令一组分为150个块，前三分之一数据块用作训练该用户正常行为模型，剩余三分之二数据块随机插入了测试用的恶意数据。SEA数据集中恶意数据的分布具有统计规律，任意给定一个测试集命令块，其中含有恶意指令的概率为1%，而当一个命令块中含有恶意指令，则后续命令块也含有恶意指令的概率达到80% [2] 。可以看出SEA数据集将连续数据块看作一个会话，只能模拟连续会话关联的攻击行为。另外，SEA数据集中黑样本偏少，虽然这更接近实际情况，但是却给我们在随机划分训练集和测试集时带来了挑战，如果使用常见的划分方法，有相当大的概率训练集中都是白样本，所以本章的样本划分需要特殊处理，保证训练集中有足够的黑样本。

具体源码如下所示：

cmdlines_file="../data/uba/MasqueradeDat/User7"
labels_file="../data/uba/MasqueradeDat/label.txt"
word2ver_bin="uba_word2vec.bin"
max_features=300
index = 80

def get_cmdlines():
    x=np.loadtxt(cmdlines_file,dtype=str)
    x=x.reshape((150,100))
    y=np.loadtxt(labels_file, dtype=int,usecols=6)
    y=y.reshape((100, 1))
    y_train=np.zeros([50,1],int)
    y=np.concatenate([y_train,y])
    y=y.reshape((150, ))

    return x,y

具体细节可以参考之前的笔记，讲解这部分数据集的处理部分。

《Web安全之机器学习入门》笔记：第五章 5.3 K近邻检测异常操作（一）_mooyuan的博客-CSDN博客

二、特征提取

（一）Wordbag

def get_features_by_wordbag():
    global max_features
    global  index
    x_arr,y=get_cmdlines()
    x=[]

    for i,v in enumerate(x_arr):
        v=" ".join(v)
        x.append(v)

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    x=vectorizer.fit_transform(x)

    x_train=x[0:index,]
    x_test=x[index:,]
    y_train=y[0:index,]
    y_test=y[index:,]

    transformer = TfidfTransformer(smooth_idf=False)
    transformer.fit(x)
    x_test = transformer.transform(x_test)
    x_train = transformer.transform(x_train)

    return x_train, x_test, y_train, y_test

以第一个元素为例，读入数据集调用函数

x_arr,y=get_cmdlines()

此时第一个元素值如下

['cpp' 'sh' 'xrdb' 'cpp' 'sh' 'xrdb' 'mkpts' 'test' 'stty' 'hostname'
 'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'env' 'echo' 'sh' 'userenv'
 'wait4wm' 'xhost' 'xsetroot' 'reaper' 'xmodmap' 'sh' '[' 'cat' 'stty'
 'hostname' 'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'sh' 'more' 'sh'
 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh'
 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'launchef' 'launchef'
 'sh' '9term' 'sh' 'launchef' 'sh' 'launchef' 'hostname' '[' 'cat' 'stty'
 'hostname' 'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'sh' 'more' 'sh'
 'more' 'sh' 'ex' 'sendmail' 'sendmail' 'sh' 'MediaMai' 'sendmail' 'sh'
 'rm' 'MediaMai' 'sh' 'rm' 'MediaMai' 'launchef' 'launchef']

将其进行拼接处理，代码如下

    x=[]
    print(x_arr[0])
    print(np.array(x_arr).shape)
    for i,v in enumerate(x_arr):
        v=" ".join(v)
        x.append(v)

此时第一个元素变为

cpp sh xrdb cpp sh xrdb mkpts test stty hostname date echo [ find chmod tty echo env echo sh userenv wait4wm xhost xsetroot reaper xmodmap sh [ cat stty hostname date echo [ find chmod tty echo sh more sh more sh more sh more sh more sh more sh more sh more sh more sh more sh more sh launchef launchef sh 9term sh launchef sh launchef hostname [ cat stty hostname date echo [ find chmod tty echo sh more sh more sh ex sendmail sendmail sh MediaMai sendmail sh rm MediaMai sh rm MediaMai launchef launchef

接下来词袋处理，代码如下

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    x=vectorizer.fit_transform(x)

词袋处理过后的结果如下

 (0, 13)	2
  (0, 90)	25
  (0, 129)	2
  (0, 67)	1
  (0, 103)	1
  (0, 95)	3
  (0, 46)	4
  (0, 14)	3
  (0, 25)	7
  (0, 9)	3
  (0, 109)	3
  (0, 28)	1
  (0, 114)	1
  (0, 117)	1
  (0, 123)	1
  (0, 132)	1
  (0, 82)	1
  (0, 127)	1
  (0, 8)	2
  (0, 51)	6
  (0, 1)	1
  (0, 30)	1
  (0, 89)	3
  (0, 64)	3
  (0, 84)	2

再接下来TF-IDF处理

    transformer = TfidfTransformer(smooth_idf=False)
    transformer.fit(x)
    x_train = transformer.transform(x_train)

处理后如下所示

  (0, 132)	0.07590139034306102
  (0, 129)	0.15180278068612205
  (0, 127)	0.08038431251734222
  (0, 123)	0.07590139034306102
  (0, 117)	0.08038431251734222
  (0, 114)	0.07590139034306102
  (0, 109)	0.15401560934616065
  (0, 103)	0.07805309706428058
  (0, 95)	0.15273504757432566
  (0, 90)	0.7130806021902251
  (0, 89)	0.160729204804717
  (0, 84)	0.06701351493784334
  (0, 82)	0.08038431251734222
  (0, 67)	0.07590139034306102
  (0, 64)	0.12397049936474483
  (0, 51)	0.29322010438631935
  (0, 46)	0.1367884925425544
  (0, 30)	0.04484290866252592
  (0, 28)	0.07590139034306102
  (0, 25)	0.35638177767342655
  (0, 14)	0.09652546185265033
  (0, 13)	0.149769219205087
  (0, 9)	0.15147374310180983
  (0, 8)	0.06933404617293713
  (0, 1)	0.16804767932167497

这时的x_train和x_test的shape如下，可以看到向量长度为136，可是max_features=300。其实这是因为我们用来训练的数据集太少了，故而特征长度仅为136

max_features=300
x_train (80, 136)
x_test (70, 136)

（二）n-gram

def get_features_by_ngram():
    global max_features
    global  index
    x_arr,y=get_cmdlines()
    x=[]

    for i,v in enumerate(x_arr):
        v=" ".join(v)
        x.append(v)

    vectorizer = CountVectorizer(
                                 ngram_range=(2, 4),
                                 token_pattern=r'\b\w+\b',
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    x=vectorizer.fit_transform(x)

    x_train=x[0:index,]
    x_test=x[index:,]
    y_train=y[0:index,]
    y_test=y[index:,]

    transformer = TfidfTransformer(smooth_idf=False)
    transformer.fit(x)
    x_test = transformer.transform(x_test)
    x_train = transformer.transform(x_train)

    return x_train, x_test, y_train, y_test

（三）Word2Vec

def get_features_by_word2vec():
    global word2ver_bin
    global index
    global max_features

    x_all=[]
    x_arr,y=get_cmdlines()
    x=[]

    for i,v in enumerate(x_arr):
        v=" ".join(v)
        x.append(v)

    for i in range(1,30):
        filename="../data/uba/MasqueradeDat/User%d" % i
        with open(filename) as f:
            x_all.append([w.strip('\n') for w in f.readlines()])

    cores=multiprocessing.cpu_count()

    if os.path.exists(word2ver_bin):
        print ("Find cache file %s" % word2ver_bin)
        model=gensim.models.Word2Vec.load(word2ver_bin)
    else:
        model=gensim.models.Word2Vec(size=max_features, window=5, min_count=1, iter=60, workers=cores)
        model.build_vocab(x_all)
        model.train(x_all, total_examples=model.corpus_count, epochs=model.iter)
        model.save(word2ver_bin)

    x = np.concatenate([buildWordVector(model, z, max_features) for z in x])
    x = scale(x)

    x_train = x[0:index,]
    x_test = x[index:,]

    y_train = y[0:index,]
    y_test = y[index:,]

    return x_train, x_test, y_train, y_test

（四）词集

def  get_features_by_wordseq():
    global max_features
    global  index
    x_arr,y=get_cmdlines()
    x=[]

    for i,v in enumerate(x_arr):
        v=" ".join(v)
        x.append(v)

    vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_features,
                                              min_frequency=0,
                                              vocabulary=None,
                                              tokenizer_fn=None)
    x=vp.fit_transform(x, unused_y=None)
    x = np.array(list(x))

    x_train = x[0:index, ]
    x_test = x[index:, ]
    y_train = y[0:index, ]
    y_test = y[index:, ]

    return x_train, x_test, y_train, y_test

三、模型构建

（一）NB

def do_nb(x_train, x_test, y_train, y_test):
    gnb = GaussianNB()
    gnb.fit(x_train,y_train)
    y_pred=gnb.predict(x_test)
    print(classification_report(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

运行结果1

nb and wordbag
              precision    recall  f1-score   support

           0       0.98      0.97      0.98        64
           1       0.71      0.83      0.77         6

    accuracy                           0.96        70
   macro avg       0.85      0.90      0.87        70
weighted avg       0.96      0.96      0.96        70

[[62  2]
 [ 1  5]]

（二）XGB-BOOST

def do_xgboost(x_train, x_test, y_train, y_test):
    xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
    y_pred = xgb_model.predict(x_test)
    print(classification_report(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

运行结果

xgboost and wordbag
              precision    recall  f1-score   support

           0       0.96      1.00      0.98        64
           1       1.00      0.50      0.67         6

    accuracy                           0.96        70
   macro avg       0.98      0.75      0.82        70
weighted avg       0.96      0.96      0.95        70

[[64  0]
 [ 3  3]]

（三）MLP

源码如下

def do_mlp(x_train, x_test, y_train, y_test):
    global max_features
    # Building deep neural network
    clf = MLPClassifier(solver='lbfgs',
                        alpha=1e-5,
                        hidden_layer_sizes = (5, 2),
                        random_state = 1)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(classification_report(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

运行结果如下

mlp and wordbag
   precision    recall  f1-score   support

           0       0.91      1.00      0.96        64
           1       0.00      0.00      0.00         6

    accuracy                           0.91        70
   macro avg       0.46      0.50      0.48        70
weighted avg       0.84      0.91      0.87        70

[[64  0]
 [ 6  0]]

（四）CNN

def do_cnn(trainX, testX, trainY, testY):
    global max_features
    y_test = testY
    #trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
    #testX = pad_sequences(testX, maxlen=max_features, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Building convolutional network
    network = input_data(shape=[None,max_features], name='input')
    network = tflearn.embedding(network, input_dim=1000, output_dim=128,validate_indices=False)
    branch1 = conv_1d(network, 128, 2, padding='valid', activation='relu', regularizer="L2")
    branch2 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
    branch3 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
    network = merge([branch1, branch2, branch3], mode='concat', axis=1)
    network = tf.expand_dims(network, 2)
    network = global_max_pool(network)
    network = dropout(network, 1)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy', name='target')
    # Training
    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=10, shuffle=True, validation_set=0,
              show_metric=True, batch_size=10,run_id="uba")

    y_predict_list = model.predict(testX)

    y_predict = []
    for i in y_predict_list:
        if i[0] > 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print (metrics.confusion_matrix(y_test, y_predict))

（五）RNN

def do_rnn_wordbag(trainX, testX, trainY, testY):
    y_test=testY
    #trainX = pad_sequences(trainX, maxlen=100, value=0.)
    #testX = pad_sequences(testX, maxlen=100, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Network building
    net = tflearn.input_data([None, 100])
    net = tflearn.embedding(net, input_dim=1000, output_dim=128)
    net = tflearn.lstm(net, 128, dropout=0.1)
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='adam', learning_rate=0.005,
                             loss='categorical_crossentropy')

    # Training
    model = tflearn.DNN(net, tensorboard_verbose=0)
    model.fit(trainX, trainY, validation_set=0.1, show_metric=True,
              batch_size=1,run_id="uba",n_epoch=10)

    y_predict_list = model.predict(testX)

    y_predict = []
    for i in y_predict_list:
        if i[0] >= 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print (metrics.confusion_matrix(y_test, y_predict))

    print (y_train)

    print ("ture")
    print (y_test)
    print ("pre")
    print (y_predict)

（六）Bi-RNN

def do_birnn_wordbag(trainX, testX, trainY, testY):
    y_test=testY
    #trainX = pad_sequences(trainX, maxlen=100, value=0.)
    #testX = pad_sequences(testX, maxlen=100, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Network building
    net = input_data(shape=[None, 100])
    net = tflearn.embedding(net, input_dim=10000, output_dim=128)
    net = tflearn.bidirectional_rnn(net, BasicLSTMCell(128), BasicLSTMCell(128))
    net = dropout(net, 0.5)
    net = fully_connected(net, 2, activation='softmax')
    net = regression(net, optimizer='adam', loss='categorical_crossentropy')

    # Training
    model = tflearn.DNN(net, tensorboard_verbose=0)
    model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
              batch_size=1,run_id="uba",n_epoch=10)

    y_predict_list = model.predict(testX)

    y_predict = []
    for i in y_predict_list:
        if i[0] >= 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print (metrics.confusion_matrix(y_test, y_predict))