Notes on the paper"Improved MalGAN: Avoiding Malware Detector by Leaning Cleanware Features"

Imformation of the authors

在这里插入图片描述

Abstract

Improved-MalGAN improves the ability of evading based on MalGAN, and achieves a better performance.

Introduction

Maybe the introduction of MalGAN is too complex, but I feel like it is necessary and significant to record the imformation below.


  • The previous research
    • [4] avoided image classification by CNN
      apply some microfabrication which is invisible to the human eye
    • [5] Equation-Solving Attacks and Path-Finding Attacks, which can stole the machine learning model(just like MalGAN).
    • [6]MalGan
      • Overview
        在这里插入图片描述
      • Generator of MalGAN
        在这里插入图片描述
      • Substitute Detector of MalGAN
        在这里插入图片描述
      • Malmare Detector(black box) of MalGAN
        在这里插入图片描述

Issues

This part is the most important part of the whole research, because issues mean you can make some improvements.


  • Training and predict can be done externally, cuz it is convenient for attacker
  • The quantities of API feature of MalGAN is reduced to 128, it is not enough
  • G ( θ , x ) G(\theta,x) G(θ,x) and D ( θ , x ) D(\theta,x) D(θ,x) use the same API list for both MalGAN and B B B(black box detetctor), which means the attackers know the datesets used to learn of B B B, but it seems like impossible.
  • Using multiple malware to learn in MalGAN, it may affect the performance of avoidancemaybe it is right, but this opinion still should be proved, and attackers wanna use as few as possible and preferably only one malware to be the traning sample.cuz it is more convenient to catch one malware sample

Proposed Method

Some of the improvements for the above mentioned issues.


  • the Overview of Improved-MalGAN

    • Execute B ( θ , x ) B(\theta,x) B(θ,x) externally by using Python’s subprocess library
      Generating 200 files for each epoch, then use the Python’s subprocess library, input the generated files to the malware detector, and labeling.I don't how to realize it, let me search some relative imformation
    • Using all APIs, instead of selecting API features according to the importance of RF
      But in this paper, we should restrict the number of files to avoid too large dimension of the API feature vector.
    • The API lists between MalGAN and B ( θ , x ) B(\theta,x) B(θ,x) are from the defferent training datasets
  • Improved G ( θ , x ) G(\theta,x) G(θ,x) and D ( θ , x ) D(\theta,x) D(θ,x)
    Models are based on DCGAN[10]

    • G ( θ , x ) G(\theta,x) G(θ,x) of MalGAN
      在这里插入图片描述
    • G ( θ , x ) G(\theta,x) G(θ,x) of Improved-MalGAN
      在这里插入图片描述
      Noise generates the values in the range of ( − 1 , 1 ) (-1, 1) (1,1) for the number of APIs in the API list, and input to Generator. Generator finally generates the values in the range of ( − 1 , 1 ) (-1, 1) (1,1) by the Tanh function.The reason for using the Tanh function instead of the Sigmoid function to deal with the value in the range ( − 1 , 1 ) (-1, 1) (1,1) is to use the Parametric ReLU (PReLu) function described later.
    • D ( θ , x ) D(\theta,x) D(θ,x)
      在这里插入图片描述
      Referenced to Deep Convolutional GAN(DCGAN)[10] which is used to image generation,
      but in Improved-MalGAN the Convolutional Layer was replaced with the Dense Layer,
      and the activation function is f ( P R e L u ) f(PReLu) f(PReLu) (PReLu) through DCGAN usually use both f ( P R e L u ) f(PReLu) f(PReLu) and f ( L e a k y R e L U ) f(Leaky ReLU) f(LeakyReLU) (Leaky ReLU)
      在这里插入图片描述
  • B B B : Black-box detector
    Exactly B B B is one or a multiply machine learning algorithm,
    and the algorithms below are used in Improved-MalGAN.
    • RF (n estimators = 1000)
    • MLP (hidden layer sizes = (64, ))

  • Substitute Detector

    • D ( θ , x ) D(\theta,x) D(θ,x) of MalGAN
      在这里插入图片描述
    • D ( θ , x ) D(\theta,x) D(θ,x) of Improved-MalGAN
      在这里插入图片描述

Experiment

Experiment setup

  • Python 2.5
  • Tensorflow 1.11.0 and Keras 2.2.4
    Use this to create MalGAN
  • Scikit-learn 0.20.0 to create B B B

Dataset

Public dataset : FFRI Dataset 2018

FFRI:
a part of research data set of anti Malware engineering WorkShop (MWS) [12]

  • For B B B
    • training data
      36265 malware and 26127 cleanware
    • testing data
      36048 malware and 26590 cleanware
  • For MalGAN
    • training data
      1 malware and 44 cleanware

Different pattern of API list

  • For B B B
    training data
    在这里插入图片描述
  • For MalGAN
    training data
    在这里插入图片描述

Classification results of B B B

在这里插入图片描述

Experiment results

Preliminary experiment result

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

  • About the figures
    • Y-axis
      the number of files classified as cleanware
    • X-axis
      epochs

In this experiment, it generates 200 files for each epoch, so the maximum value on the Y-axis is 200.

  • Conclusion
1more feature quantities used, MalGAN can more easily avoid, but if the average number of APIs is too much than original malware, it is not realisticthis is an issue*
2RF algorithm is more robust than MLP

The approach to solve this issue* above

  • Appending a layer which calculates only based on the implementation of Variarional Autoencoderwhat is this? to the end of G G G.
  • RMSE

n n n is the number of files generated each epoch
m m m is the number of feature quantities
x ^ \hat{x} x^ is the output data replaced
x x x is the original malware

To calculates the customized Root Mean Square Error between the output data replaced by the range of ( 0 , 1 ) (0,1) (0,1) and the feature quantities of the original malware. Multiplying the RMSE by 0.05, and multiplying the binary cross entropy by 0.95, these is the best “reducing the number of APIs” setup.

Experiment results after solving the issue* above

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
OK, we can see it is not robust, but the number of APIs refuced to about 250 at the 30 epochs, RMSE works.

  • Conclusion
1customized RMSE can reduce the APIs required to avoid the B B B
2the learning becomes unstable by reducing the API, and more unstable by using importance of RF

Conclusion

  • There are other feature can be easily generated besides APIs, such as strings…
  • The ease of avoidance by MalGAN depends on the algorithm of the B B B, hence, we can improve malware detector to pretect our anti-virus models.
  • With the improvements of ML generation technology, maybe we can generate the binary itself.

Some of the details

This part is used to record somthing interesting and some knowledge I have never knew before.


  • Tanh function
    To deal with the value in the range (-1,1) in order to use the Parametric ReLU function(PReLU).
    the graph of the function?
    在这里插入图片描述

  • Sigmoid function
    在这里插入图片描述
    在这里插入图片描述

  • Parametric ReLU function(PReLu)
    在这里插入图片描述
    the graph of the function?
    在这里插入图片描述
    在这里插入图片描述

  • Leaky Parametric ReLU function(Leaky PReLu)
    在这里插入图片描述

  • Whis is the meaning of “Execute the malware detectors externally using Python’s subprocess library, and import the detection results into Mal- GAN”???
    TODO

References

  • [4] Nina Narodytska, and Shiva Kasiviswanathan, “Simple Black-Box Ad- versarial Perturbations for Deep Networks”, IEEE Conference on Com- puter Vision and Pattern Recognition Workshops (CVPRW), pp.1310- 1318, 2017.
  • [5] Florian Tramer, Fan Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart, “Stealing Machine Learning Models via Prediction APIs”, 25th USENIX Security Symposium, pp.601-618, 2016.
  • [6] Weiwei Hu, and Ying Tan, “Generating Adversarial Malware Ex- amples for Black-Box Attacks Based on GAN”, arXiv preprint arXiv:1702.05983, 2017.
  • [10] jacobgil, “keras-dcgan”, https://github.com/jacobgil/keras-dcgan (ac- cessed 2018-12-1).
  • [12] Yuta Takata, Masato Terada, Takahiro Matsuki, Takahiro Kasama, Shoko Araki, and Mitsuhiro Hatada, “Datasets for Anti-Malware Research MWS Datasets 2018 ”, Information Processing Society of Japan, Vol.2018-CSEC-82, No.38, 2018.
  • The function figures in this note are from 🔗, and thanks to the author. Thanks!
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

猫咪钓鱼

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值