《Web安全之深度学习实战》笔记:第十四章 恶意程序分类识别

本小节主要以MIST数据集为例介绍恶意程序的分类识别技术,使用特征提取方法为2-Gram和TF-IDF模型,介绍的分类算法包括支持向量机、XGBoost和多层感知机。

一、恶意程序

常见的恶意程序识别方法主要依据是静态文件特征码和高危动态行为特征等,会随着恶意程序呈指数级增长。传统的基于规则的检测技术已经难以覆盖全部恶意程序,终端安全厂商将大量的人力物力投入到使用沙箱以及机器学习技术上,希望可以有效提高识别恶意程序的能力。

二、数据集

测试数据来自Marco Ramilli的MIST数据集(Malware Instruction Set for Behaviour Analysis),MIST通过分析大量的恶意程序,提取静态的文件特征以及动态的程序行为特征,对应的数据特征获取过程如图14-2所示。

源码如下所示:

def load_files():
    malware_class=['APT1','Crypto','Locker','Zeus']
    x=[]
    y=[]
    for i,family in enumerate(malware_class):
        dir="../data/malware/MalwareTrainingSets-master/trainingSets/%s/*" % family
        print ("Load files from %s index %d" % (dir,i))
        v=load_files_from_dir(dir)
        x+=v
        y+=[i]*len(v)
    print ("Loaded files %d" % len(x))
    return x,y

 对于每个文件,处理如下

def load_files_from_dir(dir):
    import glob
    files=glob.glob(dir)
    result = []
    for file in files:
        #print ("Load file %s" % file)
        with open(file) as f:
            lines=f.readlines()
            lines_to_line=" ".join(lines)
            lines_to_line = re.sub(r"[APT|Crypto|Locker|Zeus]", ' ', lines_to_line,flags=re.I)
            result.append(lines_to_line)
    return result

三、特征提取

(一)Ngram-TFIDF

def get_feature_text():
    x,y=load_files()
    max_features=1000

    vectorizer = CountVectorizer(
            decode_error='ignore',
            ngram_range=(2, 2),
            strip_accents='ascii',
            max_features=max_features,
            stop_words='english',
            max_df=1.0,
            min_df=1,
            token_pattern=r'\b\w+\b',
            binary=True)
    print (vectorizer)
    x=vectorizer.fit_transform(x)

    transformer = TfidfTransformer(smooth_idf=False)
    x = transformer.fit_transform(x)

    # 非常重要 稀疏矩阵转换成矩阵
    x = x.toarray()

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)
    return x_train, x_test, y_train, y_test

(二)Ngram-2D

def get_feature_pe_picture():
    #加载原始文件
    x,y=load_files()
    max_features=1024
    vectorizer = CountVectorizer(
            decode_error='ignore',
            ngram_range=(2, 2),
            strip_accents='ascii',
            max_features=max_features,
            stop_words='english',
            max_df=1.0,
            min_df=1,
            dtype=np.int,
            token_pattern=r'\b\w+\b',
            binary=False)
    print (vectorizer)
    x=vectorizer.fit_transform(x)
    #非常重要 稀疏矩阵转换成矩阵
    x=x.toarray()
    x_pic = []
    for i in range(4762):
        #将形状为(1024,1)的向量转化成(32,32)的矩阵
        pic=np.reshape(x[i],(32,32,1))
        x_pic.append(pic)
        #save_image(pic,i)
    #随机分配训练和测试集合
    x_train, x_test, y_train, y_test = train_test_split(x_pic, y, test_size=0.4)
    return x_train, x_test, y_train, y_test

四、模型构建

(一)XGBOOST

def do_xgboost(x_train, x_test, y_train, y_test):
    xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
    y_pred = xgb_model.predict(x_test)
    print(classification_report(y_test, y_pred))

(二)SVM

def do_svm(x_train, x_test, y_train, y_test):
    from sklearn.svm import SVC
    clf = svm.SVC(kernel='linear', C=1.0)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(classification_report(y_test, y_pred))

(三)MLP

def do_mlp(x_train, x_test, y_train, y_test):

    clf = MLPClassifier(solver='lbfgs',
                        alpha=1e-5,
                        hidden_layer_sizes = (10, 4),
                        random_state = 1)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(classification_report(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))

(四)CNN_2d

def do_cnn_2d(trainX, testX, trainY, testY):
    print("text feature and cnn 2d")
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=4)
    testY = to_categorical(testY, nb_classes=4)
    # Building convolutional network
    network = input_data(shape=[None, 32, 32,1], name='input')
    network = conv_2d(network, 16, 3, activation='relu', regularizer="L2")
    network = max_pool_2d(network, 2)
    network = local_response_normalization(network)
    network = conv_2d(network, 16, 3, activation='relu', regularizer="L2")
    network = max_pool_2d(network, 2)
    network = local_response_normalization(network)
    network = fully_connected(network, 16, activation='tanh')
    network = dropout(network, 0.1)
    network = fully_connected(network, 16, activation='tanh')
    network = dropout(network, 0.1)
    network = fully_connected(network, 4, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.01,
                         loss='categorical_crossentropy', name='target')

    # Training
    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY, n_epoch=10, validation_set=(testX, testY),show_metric=True, run_id="malware")

(五)CNN_1d

def do_cnn_1d(trainX, testX, trainY, testY):
    print("text feature and cnn")
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=4)
    testY = to_categorical(testY, nb_classes=4)

    # Building convolutional network
    network = input_data(shape=[None,1000], name='input')
    network = tflearn.embedding(network, input_dim=1000000, output_dim=128,validate_indices=False)
    branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
    branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
    branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
    network = merge([branch1, branch2, branch3], mode='concat', axis=1)
    network = tf.expand_dims(network, 2)
    network = global_max_pool(network)
    network = dropout(network, 0.8)
    network = fully_connected(network, 4, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy', name='target')
    # Training
    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=5, shuffle=True, validation_set=(testX, testY),
              show_metric=True, batch_size=100,run_id="malware")

五、运行结果

1D运行结果,从结果来看cnn效果有点过差

xgboost
              precision    recall  f1-score   support

           0       0.98      0.94      0.96       113
           1       0.97      0.95      0.96       803
           2       0.96      0.90      0.93       205
           3       0.94      0.98      0.96       784

    accuracy                           0.95      1905
   macro avg       0.96      0.94      0.95      1905
weighted avg       0.96      0.95      0.95      1905

svm
              precision    recall  f1-score   support

           0       0.96      0.92      0.94       113
           1       0.95      0.96      0.95       803
           2       0.91      0.87      0.89       205
           3       0.94      0.94      0.94       784

    accuracy                           0.94      1905
   macro avg       0.94      0.92      0.93      1905
weighted avg       0.94      0.94      0.94      1905
cnn
| Adam | epoch: 001 | loss: 1.09685 - acc: 0.4432 | val_loss: 1.13283 - val_acc: 0.4089 -- iter: 2857/2857
| Adam | epoch: 002 | loss: 1.09272 - acc: 0.4425 | val_loss: 1.12148 - val_acc: 0.4089 -- iter: 2857/2857
| Adam | epoch: 003 | loss: 1.11942 - acc: 0.4117 | val_loss: 1.11967 - val_acc: 0.4089 -- iter: 2857/2857
| Adam | epoch: 004 | loss: 1.12596 - acc: 0.4221 | val_loss: 1.12072 - val_acc: 0.4089 -- iter: 2857/2857
| Adam | epoch: 005 | loss: 1.11561 - acc: 0.4272 | val_loss: 1.12084 - val_acc: 0.4089 -- iter: 2857/2857

CNN 2D的性能如下,看起来也没好到哪里去

| Adam | epoch: 001 | loss: 1.23541 - acc: 0.4109 | val_loss: 1.11576 - val_acc: 0.4247 -- iter: 2857/2857
| Adam | epoch: 002 | loss: 1.16763 - acc: 0.4203 | val_loss: 1.11529 - val_acc: 0.4247 -- iter: 2857/2857
| Adam | epoch: 003 | loss: 1.12465 - acc: 0.4194 | val_loss: 1.11524 - val_acc: 0.4247 -- iter: 2857/2857
| Adam | epoch: 004 | loss: 1.11964 - acc: 0.4281 | val_loss: 1.11697 - val_acc: 0.4247 -- iter: 2857/2857
| Adam | epoch: 005 | loss: 1.11276 - acc: 0.4242 | val_loss: 1.11429 - val_acc: 0.4247 -- iter: 2857/2857
| Adam | epoch: 006 | loss: 1.11595 - acc: 0.4346 | val_loss: 1.11510 - val_acc: 0.4247 -- iter: 2857/2857
| Adam | epoch: 007 | loss: 1.10915 - acc: 0.4170 | val_loss: 1.10926 - val_acc: 0.4247 -- iter: 2857/2857
| Adam | epoch: 008 | loss: 1.11696 - acc: 0.4282 | val_loss: 1.10626 - val_acc: 0.4268 -- iter: 2857/2857
| Adam | epoch: 009 | loss: 1.14538 - acc: 0.4108 | val_loss: 1.08093 - val_acc: 0.4268 -- iter: 2857/2857
| Adam | epoch: 010 | loss: 1.09208 - acc: 0.4215 | val_loss: 1.08760 - val_acc: 0.4241 -- iter: 2857/2857

不过xgboost性能还是不错的,cnn总体上在图像效果不错,在这个恶意软件识别中,效果过于差了些

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

mooyuan天天

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值