《Web安全之深度学习实战》笔记:第十章 用户行为分析与恶意行为检测

本文介绍了如何使用UBA(用户行为分析)技术来检测恶意操作,特别是基于SEA数据集的案例。数据集包含70多个UNIX用户的行为日志,通过Wordbag、n-gram、Word2Vec和词集等特征提取方法,然后运用NB、XGB-BOOST、MLP、CNN和RNN等模型进行训练和测试。实验结果显示,不同模型在检测恶意行为上有不同表现,其中XGB-BOOST表现出色。
摘要由CSDN通过智能技术生成

        本章基于SEA数据集介绍UBA的一个典型应用场景,即恶意操作行为检测。事实上,在《web安全之机器学习入门》中,我们已经了解过该数据集。

        我们将恶意内部人员和内部员工的异常操作统称为恶意操作。检测这种恶意操作需要使用高级技术,比如用户行为分析(User Behawiors Analysis,UBA),这种新兴技术可提供以往被遗漏的数据保护和欺诈检测功能。结合用户日常操作的系统,UBA利用一种专门的安全分析算法,不仅可以关注初始登录操作,还能跟踪用户的一举一动。UBA有两个主要功能:它有助于为用户执行的正常活动确立基线;迅速识别偏离正常行为的异常行为,以便安全分析员执行调查。某些异常行为可能乍一看不是恶意的,但是这需要安全分析员进一步调查情况,以确定它是合法行为还是恶意行为 。

一、数据集

        SEA数据集涵盖70多个UNIX系统用户的行为日志,这些数据来自UNIX系统记录的用户使用的命令。SEA数据集中每个用户都采集了15000条命令,从用户集合中随机抽取50个用户作为正常用户,剩余用户的命令随机插入,模拟作为内部伪装者发起的攻击。

        每个用户的数据按照连续的100个命令一组分为150个块,前三分之一数据块用作训练该用户正常行为模型,剩余三分之二数据块随机插入了测试用的恶意数据。SEA数据集中恶意数据的分布具有统计规律,任意给定一个测试集命令块,其中含有恶意指令的概率为1%,而当一个命令块中含有恶意指令,则后续命令块也含有恶意指令的概率达到80% [2] 。可以看出SEA数据集将连续数据块看作一个会话,只能模拟连续会话关联的攻击行为。另外,SEA数据集中黑样本偏少,虽然这更接近实际情况,但是却给我们在随机划分训练集和测试集时带来了挑战,如果使用常见的划分方法,有相当大的概率训练集中都是白样本,所以本章的样本划分需要特殊处理,保证训练集中有足够的黑样本。

具体源码如下所示:

cmdlines_file="../data/uba/MasqueradeDat/User7"
labels_file="../data/uba/MasqueradeDat/label.txt"
word2ver_bin="uba_word2vec.bin"
max_features=300
index = 80

def get_cmdlines():
    x=np.loadtxt(cmdlines_file,dtype=str)
    x=x.reshape((150,100))
    y=np.loadtxt(labels_file, dtype=int,usecols=6)
    y=y.reshape((100, 1))
    y_train=np.zeros([50,1],int)
    y=np.concatenate([y_train,y])
    y=y.reshape((150, ))

    return x,y

具体细节可以参考之前的笔记,讲解这部分数据集的处理部分。

《Web安全之机器学习入门》笔记:第五章 5.3 K近邻检测异常操作(一)_mooyuan的博客-CSDN博客

二、特征提取

(一)Wordbag

def get_features_by_wordbag():
    global max_features
    global  index
    x_arr,y=get_cmdlines()
    x=[]

    for i,v in enumerate(x_arr):
        v=" ".join(v)
        x.append(v)

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    x=vectorizer.fit_transform(x)

    x_train=x[0:index,]
    x_test=x[index:,]
    y_train=y[0:index,]
    y_test=y[index:,]

    transformer = TfidfTransformer(smooth_idf=False)
    transformer.fit(x)
    x_test = transformer.transform(x_test)
    x_train = transformer.transform(x_train)

    return x_train, x_test, y_train, y_test

以第一个元素为例,读入数据集调用函数

x_arr,y=get_cmdlines()

 此时第一个元素值如下

['cpp' 'sh' 'xrdb' 'cpp' 'sh' 'xrdb' 'mkpts' 'test' 'stty' 'hostname'
 'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'env' 'echo' 'sh' 'userenv'
 'wait4wm' 'xhost' 'xsetroot' 'reaper' 'xmodmap' 'sh' '[' 'cat' 'stty'
 'hostname' 'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'sh' 'more' 'sh'
 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh'
 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'launchef' 'launchef'
 'sh' '9term' 'sh' 'launchef' 'sh' 'launchef' 'hostname' '[' 'cat' 'stty'
 'hostname' 'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'sh' 'more' 'sh'
 'more' 'sh' 'ex' 'sendmail' 'sendmail' 'sh' 'MediaMai' 'sendmail' 'sh'
 'rm' 'MediaMai' 'sh' 'rm' 'MediaMai' 'launchef' 'launchef']

将其进行拼接处理,代码如下

    x=[]
    print(x_arr[0])
    print(np.array(x_arr).shape)
    for i,v in enumerate(x_arr):
        v=" ".join(v)
        x.append(v)

此时第一个元素变为

cpp sh xrdb cpp sh xrdb mkpts test stty hostname date echo [ find chmod tty echo env echo sh userenv wait4wm xhost xsetroot reaper xmodmap sh [ cat stty hostname date echo [ find chmod tty echo sh more sh more sh more sh more sh more sh more sh more sh more sh more sh more sh more sh launchef launchef sh 9term sh launchef sh launchef hostname [ cat stty hostname date echo [ find chmod tty echo sh more sh more sh ex sendmail sendmail sh MediaMai sendmail sh rm MediaMai sh rm MediaMai launchef launchef

 接下来词袋处理,代码如下

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    x=vectorizer.fit_transform(x)

词袋处理过后的结果如下

 (0, 13)	2
  (0, 90)	25
  (0, 129)	2
  (0, 67)	1
  (0, 103)	1
  (0, 95)	3
  (0, 46)	4
  (0, 14)	3
  (0, 25)	7
  (0, 9)	3
  (0, 109)	3
  (0, 28)	1
  (0, 114)	1
  (0, 117)	1
  (0, 123)	1
  (0, 132)	1
  (0, 82)	1
  (0, 127)	1
  (0, 8)	2
  (0, 51)	6
  (0, 1)	1
  (0, 30)	1
  (0, 89)	3
  (0, 64)	3
  (0, 84)	2

再接下来TF-IDF处理

    transformer = TfidfTransformer(smooth_idf=False)
    transformer.fit(x)
    x_train = transformer.transform(x_train)

处理后如下所示

  (0, 132)	0.07590139034306102
  (0, 129)	0.15180278068612205
  (0, 127)	0.08038431251734222
  (0, 123)	0.07590139034306102
  (0, 117)	0.08038431251734222
  (0, 114)	0.07590139034306102
  (0, 109)	0.15401560934616065
  (0, 103)	0.07805309706428058
  (0, 95)	0.15273504757432566
  (0, 90)	0.7130806021902251
  (0, 89)	0.160729204804717
  (0, 84)	0.06701351493784334
  (0, 82)	0.08038431251734222
  (0, 67)	0.07590139034306102
  (0, 64)	0.12397049936474483
  (0, 51)	0.29322010438631935
  (0, 46)	0.1367884925425544
  (0, 30)	0.04484290866252592
  (0, 28)	0.07590139034306102
  (0, 25)	0.35638177767342655
  (0, 14)	0.09652546185265033
  (0, 13)	0.149769219205087
  (0, 9)	0.15147374310180983
  (0, 8)	0.06933404617293713
  (0, 1)	0.16804767932167497

 这时的x_train和x_test的shape如下,可以看到向量长度为136,可是max_features=300。其实这是因为我们用来训练的数据集太少了,故而特征长度仅为136

max_features=300
x_train (80, 136)
x_test (70, 136)

(二)n-gram

def get_features_by_ngram():
    global max_features
    global  index
    x_arr,y=get_cmdlines()
    x=[]

    for i,v in enumerate(x_arr):
        v=" ".join(v)
        x.append(v)

    vectorizer = CountVectorizer(
                                 ngram_range=(2, 4),
                                 token_pattern=r'\b\w+\b',
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    x=vectorizer.fit_transform(x)

    x_train=x[0:index,]
    x_test=x[index:,]
    y_train=y[0:index,]
    y_test=y[index:,]

    transformer = TfidfTransformer(smooth_idf=False)
    transformer.fit(x)
    x_test = transformer.transform(x_test)
    x_train = transformer.transform(x_train)

    return x_train, x_test, y_train, y_test

(三)Word2Vec

def get_features_by_word2vec():
    global word2ver_bin
    global index
    global max_features

    x_all=[]
    x_arr,y=get_cmdlines()
    x=[]

    for i,v in enumerate(x_arr):
        v=" ".join(v)
        x.append(v)

    for i in range(1,30):
        filename="../data/uba/MasqueradeDat/User%d" % i
        with open(filename) as f:
            x_all.append([w.strip('\n') for w in f.readlines()])

    cores=multiprocessing.cpu_count()

    if os.path.exists(word2ver_bin):
        print ("Find cache file %s" % word2ver_bin)
        model=gensim.models.Word2Vec.load(word2ver_bin)
    else:
        model=gensim.models.Word2Vec(size=max_features, window=5, min_count=1, iter=60, workers=cores)
        model.build_vocab(x_all)
        model.train(x_all, total_examples=model.corpus_count, epochs=model.iter)
        model.save(word2ver_bin)

    x = np.concatenate([buildWordVector(model, z, max_features) for z in x])
    x = scale(x)

    x_train = x[0:index,]
    x_test = x[index:,]

    y_train = y[0:index,]
    y_test = y[index:,]

    return x_train, x_test, y_train, y_test

(四)词集

def  get_features_by_wordseq():
    global max_features
    global  index
    x_arr,y=get_cmdlines()
    x=[]

    for i,v in enumerate(x_arr):
        v=" ".join(v)
        x.append(v)

    vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_features,
                                              min_frequency=0,
                                              vocabulary=None,
                                              tokenizer_fn=None)
    x=vp.fit_transform(x, unused_y=None)
    x = np.array(list(x))

    x_train = x[0:index, ]
    x_test = x[index:, ]
    y_train = y[0:index, ]
    y_test = y[index:, ]

    return x_train, x_test, y_train, y_test

三、模型构建

(一)NB

def do_nb(x_train, x_test, y_train, y_test):
    gnb = GaussianNB()
    gnb.fit(x_train,y_train)
    y_pred=gnb.predict(x_test)
    print(classification_report(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

运行结果1

nb and wordbag
              precision    recall  f1-score   support

           0       0.98      0.97      0.98        64
           1       0.71      0.83      0.77         6

    accuracy                           0.96        70
   macro avg       0.85      0.90      0.87        70
weighted avg       0.96      0.96      0.96        70

[[62  2]
 [ 1  5]]

(二)XGB-BOOST

def do_xgboost(x_train, x_test, y_train, y_test):
    xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
    y_pred = xgb_model.predict(x_test)
    print(classification_report(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

运行结果

xgboost and wordbag
              precision    recall  f1-score   support

           0       0.96      1.00      0.98        64
           1       1.00      0.50      0.67         6

    accuracy                           0.96        70
   macro avg       0.98      0.75      0.82        70
weighted avg       0.96      0.96      0.95        70

[[64  0]
 [ 3  3]]

(三)MLP

源码如下

def do_mlp(x_train, x_test, y_train, y_test):
    global max_features
    # Building deep neural network
    clf = MLPClassifier(solver='lbfgs',
                        alpha=1e-5,
                        hidden_layer_sizes = (5, 2),
                        random_state = 1)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(classification_report(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

运行结果如下

mlp and wordbag
   precision    recall  f1-score   support

           0       0.91      1.00      0.96        64
           1       0.00      0.00      0.00         6

    accuracy                           0.91        70
   macro avg       0.46      0.50      0.48        70
weighted avg       0.84      0.91      0.87        70

[[64  0]
 [ 6  0]]

(四)CNN

def do_cnn(trainX, testX, trainY, testY):
    global max_features
    y_test = testY
    #trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
    #testX = pad_sequences(testX, maxlen=max_features, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Building convolutional network
    network = input_data(shape=[None,max_features], name='input')
    network = tflearn.embedding(network, input_dim=1000, output_dim=128,validate_indices=False)
    branch1 = conv_1d(network, 128, 2, padding='valid', activation='relu', regularizer="L2")
    branch2 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
    branch3 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
    network = merge([branch1, branch2, branch3], mode='concat', axis=1)
    network = tf.expand_dims(network, 2)
    network = global_max_pool(network)
    network = dropout(network, 1)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy', name='target')
    # Training
    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=10, shuffle=True, validation_set=0,
              show_metric=True, batch_size=10,run_id="uba")

    y_predict_list = model.predict(testX)

    y_predict = []
    for i in y_predict_list:
        if i[0] > 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print (metrics.confusion_matrix(y_test, y_predict))

(五)RNN

def do_rnn_wordbag(trainX, testX, trainY, testY):
    y_test=testY
    #trainX = pad_sequences(trainX, maxlen=100, value=0.)
    #testX = pad_sequences(testX, maxlen=100, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Network building
    net = tflearn.input_data([None, 100])
    net = tflearn.embedding(net, input_dim=1000, output_dim=128)
    net = tflearn.lstm(net, 128, dropout=0.1)
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='adam', learning_rate=0.005,
                             loss='categorical_crossentropy')

    # Training
    model = tflearn.DNN(net, tensorboard_verbose=0)
    model.fit(trainX, trainY, validation_set=0.1, show_metric=True,
              batch_size=1,run_id="uba",n_epoch=10)

    y_predict_list = model.predict(testX)

    y_predict = []
    for i in y_predict_list:
        if i[0] >= 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print (metrics.confusion_matrix(y_test, y_predict))

    print (y_train)

    print ("ture")
    print (y_test)
    print ("pre")
    print (y_predict)

(六)Bi-RNN

def do_birnn_wordbag(trainX, testX, trainY, testY):
    y_test=testY
    #trainX = pad_sequences(trainX, maxlen=100, value=0.)
    #testX = pad_sequences(testX, maxlen=100, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Network building
    net = input_data(shape=[None, 100])
    net = tflearn.embedding(net, input_dim=10000, output_dim=128)
    net = tflearn.bidirectional_rnn(net, BasicLSTMCell(128), BasicLSTMCell(128))
    net = dropout(net, 0.5)
    net = fully_connected(net, 2, activation='softmax')
    net = regression(net, optimizer='adam', loss='categorical_crossentropy')

    # Training
    model = tflearn.DNN(net, tensorboard_verbose=0)
    model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
              batch_size=1,run_id="uba",n_epoch=10)

    y_predict_list = model.predict(testX)

    y_predict = []
    for i in y_predict_list:
        if i[0] >= 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print (metrics.confusion_matrix(y_test, y_predict))

 四、总结

        其实这个数据集的数据过少,没有实际意义。但是这一小节也是将机器学习的算法应用到了恶意行为检测中,提取特征并构建模型进行训练和测试。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

mooyuan天天

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值