【异常检测】基于常用数据集LANL和CERT的异常检测源代码（https://github.com/pnnl/safekit）解读

本文链接：https://blog.csdn.net/qq_38391210/article/details/104880592

本文介绍了基于LANL和CERT数据集的异常检测源代码，主要关注`safekit`库的代码结构与重要文件解读，包括LSTM和DNN模型的实现。`examples`文件夹包含实践代码，如dnn_agg.ipynb用于DNN训练，而simple_lm.ipynb涉及LSTM模型。`test`文件夹包含测试代码，如agg_tests.py和lanl_lm_tests.py。文章还讨论了模型训练过程和损失函数计算。

摘要由CSDN通过智能技术生成

写在前面

最近在做一个异常检测项目，采用LANL数据集进行实践，找到了一份基于LANL数据集和CERT数据集的异常检测实践源码。花了一些时间研究源代码，这里把研究的结果记录下来。

源代码

使用的源代码链接为：https://github.com/pnnl/safekit。基于这篇源代码有两篇异常检测论文，如果有兴趣的也可以去下载阅读。一篇是Recurrent Neural Network Language Models for Open Vocabulary Event-Level Cyber Anomaly Detection，另一篇是Deep Learning for Unsupervised Insider Threat Detection in Structured Cybersecurity Data Streams。

代码结构

重要文件解读

examples文件夹

（1）dnn_agg.ipynb
首先是导入LANL数据集对应的特征json文件lanl_count_in_count_out_agg.json，然后得到事件计数特征值在特征向量中对应的起始位置index，用datastart_index表示。

dataspecs = json.load(open('../safekit/features/specs/agg/lanl_count_in_count_out_agg.json', 'r'))
datastart_index = dataspecs['counts']['index'][0]

我们来看一下lanl_count_in_count_out_agg.json文件中有些什么
num_features是特征向量的维度，也就是特征数目；time是发生的时间；user是用户标识，这里有30000个用户class类别；redteam是代表是否为redteam事件；counts中就是事件计数值。加起来就是137维的特征向量。

{
   
  "num_features": 137,
  "time": {
   
	"index": [0],
  	"num_classes": 0,
    "meta": 1,
    "feature": 0,
    "target": 0 },
  "user": {
   
	"index": [1],
    "num_classes": 30000,
    "meta": 1,
    "feature": 0,
    "target": 0 },
  "redteam": {
   
	"index": [2],
    "num_classes": 0,
    "meta": 1,
    "feature": 0, 
    "target": 0