本章基于SEA数据集介绍UBA的一个典型应用场景,即恶意操作行为检测。事实上,在《web安全之机器学习入门》中,我们已经了解过该数据集。
我们将恶意内部人员和内部员工的异常操作统称为恶意操作。检测这种恶意操作需要使用高级技术,比如用户行为分析(User Behawiors Analysis,UBA),这种新兴技术可提供以往被遗漏的数据保护和欺诈检测功能。结合用户日常操作的系统,UBA利用一种专门的安全分析算法,不仅可以关注初始登录操作,还能跟踪用户的一举一动。UBA有两个主要功能:它有助于为用户执行的正常活动确立基线;迅速识别偏离正常行为的异常行为,以便安全分析员执行调查。某些异常行为可能乍一看不是恶意的,但是这需要安全分析员进一步调查情况,以确定它是合法行为还是恶意行为 。
一、数据集
SEA数据集涵盖70多个UNIX系统用户的行为日志,这些数据来自UNIX系统记录的用户使用的命令。SEA数据集中每个用户都采集了15000条命令,从用户集合中随机抽取50个用户作为正常用户,剩余用户的命令随机插入,模拟作为内部伪装者发起的攻击。
每个用户的数据按照连续的100个命令一组分为150个块,前三分之一数据块用作训练该用户正常行为模型,剩余三分之二数据块随机插入了测试用的恶意数据。SEA数据集中恶意数据的分布具有统计规律,任意给定一个测试集命令块,其中含有恶意指令的概率为1%,而当一个命令块中含有恶意指令,则后续命令块也含有恶意指令的概率达到80% [2] 。可以看出SEA数据集将连续数据块看作一个会话,只能模拟连续会话关联的攻击行为。另外,SEA数据集中黑样本偏少,虽然这更接近实际情况,但是却给我们在随机划分训练集和测试集时带来了挑战,如果使用常见的划分方法,有相当大的概率训练集中都是白样本,所以本章的样本划分需要特殊处理,保证训练集中有足够的黑样本。
具体源码如下所示:
cmdlines_file="../data/uba/MasqueradeDat/User7"
labels_file="../data/uba/MasqueradeDat/label.txt"
word2ver_bin="uba_word2vec.bin"
max_features=300
index = 80
def get_cmdlines():
x=np.loadtxt(cmdlines_file,dtype=str)
x=x.reshape((150,100))
y=np.loadtxt(labels_file, dtype=int,usecols=6)
y=y.reshape((100, 1))
y_train=np.zeros([50,1],int)
y=np.concatenate([y_train,y])
y=y.reshape((150, ))
return x,y
具体细节可以参考之前的笔记,讲解这部分数据集的处理部分。
《Web安全之机器学习入门》笔记:第五章 5.3 K近邻检测异常操作(一)_mooyuan的博客-CSDN博客
二、特征提取
(一)Wordbag
def get_features_by_wordbag():
global max_features
global index
x_arr,y=get_cmdlines()
x=[]
for i,v in enumerate(x_arr):
v=" ".join(v)
x.append(v)
vectorizer = CountVectorizer(
decode_error='ignore',
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1 )
x=vectorizer.fit_transform(x)
x_train=x[0:index,]
x_test=x[index:,]
y_train=y[0:index,]
y_test=y[index:,]
transformer = TfidfTransformer(smooth_idf=False)
transformer.fit(x)
x_test = transformer.transform(x_test)
x_train = transformer.transform(x_train)
return x_train, x_test, y_train, y_test
以第一个元素为例,读入数据集调用函数
x_arr,y=get_cmdlines()
此时第一个元素值如下
['cpp' 'sh' 'xrdb' 'cpp' 'sh' 'xrdb' 'mkpts' 'test' 'stty' 'hostname'
'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'env' 'echo' 'sh' 'userenv'
'wait4wm' 'xhost' 'xsetroot' 'reaper' 'xmodmap' 'sh' '[' 'cat' 'stty'
'hostname' 'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'sh' 'more' 'sh'
'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh'
'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'launchef' 'launchef'
'sh' '9term' 'sh' 'launchef' 'sh' 'launchef' 'hostname' '[' 'cat' 'stty'
'hostname' 'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'sh' 'more' 'sh'
'more' 'sh' 'ex' 'sendmail' 'sendmail' 'sh' 'MediaMai' 'sendmail' 'sh'
'rm' 'MediaMai' 'sh' 'rm' 'MediaMai' 'launchef' 'launchef']
将其进行拼接处理,代码如下
x=[]
print(x_arr[0])
print(np.array(x_arr).shape)
for i,v in enumerate(x_arr):
v=" ".join(v)
x.append(v)
此时第一个元素变为
cpp sh xrdb cpp sh xrdb mkpts test stty hostname date echo [ find chmod tty echo env echo sh userenv wait4wm xhost xsetroot reaper xmodmap sh [ cat stty hostname date echo [ find chmod tty echo sh more sh more sh more sh more sh more sh more sh more sh more sh more sh more sh more sh launchef launchef sh 9term sh launchef sh launchef hostname [ cat stty hostname date echo [ find chmod tty echo sh more sh more sh ex sendmail sendmail sh MediaMai sendmail sh rm MediaMai sh rm MediaMai launchef launchef
接下来词袋处理,代码如下
vectorizer = CountVectorizer(
decode_error='ignore',
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1 )
x=vectorizer.fit_transform(x)
词袋处理过后的结果如下
(0, 13) 2
(0, 90) 25
(0, 129) 2
(0, 67) 1
(0, 103) 1
(0, 95) 3
(0, 46) 4
(0, 14) 3
(0, 25) 7
(0, 9) 3
(0, 109) 3
(0, 28) 1
(0, 114) 1
(0, 117) 1
(0, 123) 1
(0, 132) 1
(0, 82) 1
(0, 127) 1
(0, 8) 2
(0, 51) 6
(0, 1) 1
(0, 30) 1
(0, 89) 3
(0, 64) 3
(0, 84) 2
再接下来TF-IDF处理
transformer = TfidfTransformer(smooth_idf=False)
transformer.fit(x)
x_train = transformer.transform(x_train)
处理后如下所示
(0, 132) 0.07590139034306102
(0, 129) 0.15180278068612205
(0, 127) 0.08038431251734222
(0, 123) 0.07590139034306102
(0, 117) 0.08038431251734222
(0, 114) 0.07590139034306102
(0, 109) 0.15401560934616065
(0, 103) 0.07805309706428058
(0, 95) 0.15273504757432566
(0, 90) 0.7130806021902251
(0, 89) 0.160729204804717
(0, 84) 0.06701351493784334
(0, 82) 0.08038431251734222
(0, 67) 0.07590139034306102
(0, 64) 0.12397049936474483
(0, 51) 0.29322010438631935
(0, 46) 0.1367884925425544
(0, 30) 0.04484290866252592
(0, 28) 0.07590139034306102
(0, 25) 0.35638177767342655
(0, 14) 0.09652546185265033
(0, 13) 0.149769219205087
(0, 9) 0.15147374310180983
(0, 8) 0.06933404617293713
(0, 1) 0.16804767932167497
这时的x_train和x_test的shape如下,可以看到向量长度为136,可是max_features=300。其实这是因为我们用来训练的数据集太少了,故而特征长度仅为136
max_features=300
x_train (80, 136)
x_test (70, 136)
(二)n-gram
def get_features_by_ngram():
global max_features
global index
x_arr,y=get_cmdlines()
x=[]
for i,v in enumerate(x_arr):
v=" ".join(v)
x.append(v)
vectorizer = CountVectorizer(
ngram_range=(2, 4),
token_pattern=r'\b\w+\b',
decode_error='ignore',
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1 )
x=vectorizer.fit_transform(x)
x_train=x[0:index,]
x_test=x[index:,]
y_train=y[0:index,]
y_test=y[index:,]
transformer = TfidfTransformer(smooth_idf=False)
transformer.fit(x)
x_test = transformer.transform(x_test)
x_train = transformer.transform(x_train)
return x_train, x_test, y_train, y_test
(三)Word2Vec
def get_features_by_word2vec():
global word2ver_bin
global index
global max_features
x_all=[]
x_arr,y=get_cmdlines()
x=[]
for i,v in enumerate(x_arr):
v=" ".join(v)
x.append(v)
for i in range(1,30):
filename="../data/uba/MasqueradeDat/User%d" % i
with open(filename) as f:
x_all.append([w.strip('\n') for w in f.readlines()])
cores=multiprocessing.cpu_count()
if os.path.exists(word2ver_bin):
print ("Find cache file %s" % word2ver_bin)
model=gensim.models.Word2Vec.load(word2ver_bin)
else:
model=gensim.models.Word2Vec(size=max_features, window=5, min_count=1, iter=60, workers=cores)
model.build_vocab(x_all)
model.train(x_all, total_examples=model.corpus_count, epochs=model.iter)
model.save(word2ver_bin)
x = np.concatenate([buildWordVector(model, z, max_features) for z in x])
x = scale(x)
x_train = x[0:index,]
x_test = x[index:,]
y_train = y[0:index,]
y_test = y[index:,]
return x_train, x_test, y_train, y_test
(四)词集
def get_features_by_wordseq():
global max_features
global index
x_arr,y=get_cmdlines()
x=[]
for i,v in enumerate(x_arr):
v=" ".join(v)
x.append(v)
vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_features,
min_frequency=0,
vocabulary=None,
tokenizer_fn=None)
x=vp.fit_transform(x, unused_y=None)
x = np.array(list(x))
x_train = x[0:index, ]
x_test = x[index:, ]
y_train = y[0:index, ]
y_test = y[index:, ]
return x_train, x_test, y_train, y_test
三、模型构建
(一)NB
def do_nb(x_train, x_test, y_train, y_test):
gnb = GaussianNB()
gnb.fit(x_train,y_train)
y_pred=gnb.predict(x_test)
print(classification_report(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
运行结果1
nb and wordbag
precision recall f1-score support
0 0.98 0.97 0.98 64
1 0.71 0.83 0.77 6
accuracy 0.96 70
macro avg 0.85 0.90 0.87 70
weighted avg 0.96 0.96 0.96 70
[[62 2]
[ 1 5]]
(二)XGB-BOOST
def do_xgboost(x_train, x_test, y_train, y_test):
xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
y_pred = xgb_model.predict(x_test)
print(classification_report(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
运行结果
xgboost and wordbag
precision recall f1-score support
0 0.96 1.00 0.98 64
1 1.00 0.50 0.67 6
accuracy 0.96 70
macro avg 0.98 0.75 0.82 70
weighted avg 0.96 0.96 0.95 70
[[64 0]
[ 3 3]]
(三)MLP
源码如下
def do_mlp(x_train, x_test, y_train, y_test):
global max_features
# Building deep neural network
clf = MLPClassifier(solver='lbfgs',
alpha=1e-5,
hidden_layer_sizes = (5, 2),
random_state = 1)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
运行结果如下
mlp and wordbag
precision recall f1-score support
0 0.91 1.00 0.96 64
1 0.00 0.00 0.00 6
accuracy 0.91 70
macro avg 0.46 0.50 0.48 70
weighted avg 0.84 0.91 0.87 70
[[64 0]
[ 6 0]]
(四)CNN
def do_cnn(trainX, testX, trainY, testY):
global max_features
y_test = testY
#trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
#testX = pad_sequences(testX, maxlen=max_features, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Building convolutional network
network = input_data(shape=[None,max_features], name='input')
network = tflearn.embedding(network, input_dim=1000, output_dim=128,validate_indices=False)
branch1 = conv_1d(network, 128, 2, padding='valid', activation='relu', regularizer="L2")
branch2 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
branch3 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
network = merge([branch1, branch2, branch3], mode='concat', axis=1)
network = tf.expand_dims(network, 2)
network = global_max_pool(network)
network = dropout(network, 1)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy', name='target')
# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY,
n_epoch=10, shuffle=True, validation_set=0,
show_metric=True, batch_size=10,run_id="uba")
y_predict_list = model.predict(testX)
y_predict = []
for i in y_predict_list:
if i[0] > 0.5:
y_predict.append(0)
else:
y_predict.append(1)
print(classification_report(y_test, y_predict))
print (metrics.confusion_matrix(y_test, y_predict))
(五)RNN
def do_rnn_wordbag(trainX, testX, trainY, testY):
y_test=testY
#trainX = pad_sequences(trainX, maxlen=100, value=0.)
#testX = pad_sequences(testX, maxlen=100, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Network building
net = tflearn.input_data([None, 100])
net = tflearn.embedding(net, input_dim=1000, output_dim=128)
net = tflearn.lstm(net, 128, dropout=0.1)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=0.005,
loss='categorical_crossentropy')
# Training
model = tflearn.DNN(net, tensorboard_verbose=0)
model.fit(trainX, trainY, validation_set=0.1, show_metric=True,
batch_size=1,run_id="uba",n_epoch=10)
y_predict_list = model.predict(testX)
y_predict = []
for i in y_predict_list:
if i[0] >= 0.5:
y_predict.append(0)
else:
y_predict.append(1)
print(classification_report(y_test, y_predict))
print (metrics.confusion_matrix(y_test, y_predict))
print (y_train)
print ("ture")
print (y_test)
print ("pre")
print (y_predict)
(六)Bi-RNN
def do_birnn_wordbag(trainX, testX, trainY, testY):
y_test=testY
#trainX = pad_sequences(trainX, maxlen=100, value=0.)
#testX = pad_sequences(testX, maxlen=100, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Network building
net = input_data(shape=[None, 100])
net = tflearn.embedding(net, input_dim=10000, output_dim=128)
net = tflearn.bidirectional_rnn(net, BasicLSTMCell(128), BasicLSTMCell(128))
net = dropout(net, 0.5)
net = fully_connected(net, 2, activation='softmax')
net = regression(net, optimizer='adam', loss='categorical_crossentropy')
# Training
model = tflearn.DNN(net, tensorboard_verbose=0)
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
batch_size=1,run_id="uba",n_epoch=10)
y_predict_list = model.predict(testX)
y_predict = []
for i in y_predict_list:
if i[0] >= 0.5:
y_predict.append(0)
else:
y_predict.append(1)
print(classification_report(y_test, y_predict))
print (metrics.confusion_matrix(y_test, y_predict))
四、总结
其实这个数据集的数据过少,没有实际意义。但是这一小节也是将机器学习的算法应用到了恶意行为检测中,提取特征并构建模型进行训练和测试。