Notes on the paper"Improved MalGAN: Avoiding Malware Detector by Leaning Cleanware Features"

最新推荐文章于 2022-02-27 11:55:41 发布

猫咪钓鱼

最新推荐文章于 2022-02-27 11:55:41 发布

阅读量1.2k

点赞数

分类专栏：论文笔记

本文链接：https://blog.csdn.net/weixin_43655282/article/details/103943344

版权

论文笔记专栏收录该内容

7 篇文章 1 订阅

订阅专栏

Imformation of the authors

在这里插入图片描述

Abstract

Improved-MalGAN improves the ability of evading based on MalGAN, and achieves a better performance.

Introduction

Maybe the introduction of MalGAN is too complex, but I feel like it is necessary and significant to record the imformation below.

The previous research
- [4] avoided image classification by CNN
  apply some microfabrication which is invisible to the human eye
- [5] Equation-Solving Attacks and Path-Finding Attacks, which can stole the machine learning model(just like MalGAN).
- [6]MalGan
  - Overview
  - Generator of MalGAN
  - Substitute Detector of MalGAN
  - Malmare Detector(black box) of MalGAN

Issues

This part is the most important part of the whole research, because issues mean you can make some improvements.

Training and predict can be done externally, cuz it is convenient for attacker
The quantities of API feature of MalGAN is reduced to 128, it is not enough
$G(\theta,x)$ and $D(\theta,x)$ use the same API list for both MalGAN and $B$ (black box detetctor), which means the attackers know the datesets used to learn of $B$ , but it seems like impossible.
Using multiple malware to learn in MalGAN, it may affect the performance of avoidancemaybe it is right, but this opinion still should be proved, and attackers wanna use as few as possible and preferably only one malware to be the traning sample.cuz it is more convenient to catch one malware sample

Proposed Method

Some of the improvements for the above mentioned issues.

the Overview of Improved-MalGAN
- Execute $B(\theta,x)$ externally by using Python’s subprocess library
  Generating 200 files for each epoch, then use the Python’s subprocess library, input the generated files to the malware detector, and labeling.I don't how to realize it, let me search some relative imformation
- Using all APIs, instead of selecting API features according to the importance of RF
  But in this paper, we should restrict the number of files to avoid too large dimension of the API feature vector.
- The API lists between MalGAN and $B(\theta,x)$ are from the defferent training datasets
Improved $G(\theta,x)$ and $D(\theta,x)$
Models are based on DCGAN[10]
- $G(\theta,x)$ of MalGAN
- $G(\theta,x)$ of Improved-MalGAN
  
  Noise generates the values in the range of $(- 1, 1)$ for the number of APIs in the API list, and input to Generator. Generator finally generates the values in the range of $(- 1, 1)$ by the Tanh function.The reason for using the Tanh function instead of the Sigmoid function to deal with the value in the range $(- 1, 1)$ is to use the Parametric ReLU (PReLu) function described later.
- $D(\theta,x)$
  
  Referenced to Deep Convolutional GAN(DCGAN)[10] which is used to image generation,
  but in Improved-MalGAN the Convolutional Layer was replaced with the Dense Layer,
  and the activation function is $f (P R e L u)$ (PReLu) through DCGAN usually use both $f (P R e L u)$ and $f (L e a k y R e L U)$ (Leaky ReLU)
$B$ : Black-box detector
Exactly $B$ is one or a multiply machine learning algorithm,
and the algorithms below are used in Improved-MalGAN.
• RF (n estimators = 1000)
• MLP (hidden layer sizes = (64, ))
Substitute Detector
- $D(\theta,x)$ of MalGAN
- $D(\theta,x)$ of Improved-MalGAN

Experiment

Experiment setup

Python 2.5
Tensorflow 1.11.0 and Keras 2.2.4
Use this to create MalGAN
Scikit-learn 0.20.0 to create $B$

Dataset

Public dataset : FFRI Dataset 2018

FFRI:
a part of research data set of anti Malware engineering WorkShop (MWS) [12]

For $B$
- training data
  36265 malware and 26127 cleanware
- testing data
  36048 malware and 26590 cleanware
For MalGAN
- training data
  1 malware and 44 cleanware

Different pattern of API list

For $B$
training data
For MalGAN
training data

Classification results of $B$

在这里插入图片描述

Experiment results

Preliminary experiment result

在这里插入图片描述

About the figures
- Y-axis
  the number of files classified as cleanware
- X-axis
  epochs

In this experiment, it generates 200 files for each epoch, so the maximum value on the Y-axis is 200.

Conclusion

1	more feature quantities used, MalGAN can more easily avoid, but if the average number of APIs is too much than original malware, it is not realistic`this is an issue*`
2	RF algorithm is more robust than MLP

The approach to solve this issue* above

Appending a layer which calculates only based on the implementation of Variarional Autoencoderwhat is this? to the end of $G$ .
RMSE

$n$ is the number of files generated each epoch
$m$ is the number of feature quantities
$\hat{x}$ is the output data replaced
$x$ is the original malware

To calculates the customized Root Mean Square Error between the output data replaced by the range of $(0, 1)$ and the feature quantities of the original malware. Multiplying the RMSE by 0.05, and multiplying the binary cross entropy by 0.95, these is the best “reducing the number of APIs” setup.

Experiment results after solving the issue* above

在这里插入图片描述

OK, we can see it is not robust, but the number of APIs refuced to about 250 at the 30 epochs, RMSE works.

Conclusion

1	customized RMSE can reduce the APIs required to avoid the $B$
2	the learning becomes unstable by reducing the API, and more unstable by using importance of RF

Conclusion

There are other feature can be easily generated besides APIs, such as strings…
The ease of avoidance by MalGAN depends on the algorithm of the $B$ , hence, we can improve malware detector to pretect our anti-virus models.
With the improvements of ML generation technology, maybe we can generate the binary itself.

Some of the details

This part is used to record somthing interesting and some knowledge I have never knew before.

Tanh function
To deal with the value in the range (-1,1) in order to use the Parametric ReLU function(PReLU).
the graph of the function?
Sigmoid function
Parametric ReLU function(PReLu)

the graph of the function?
Leaky Parametric ReLU function(Leaky PReLu)
Whis is the meaning of “Execute the malware detectors externally using Python’s subprocess library, and import the detection results into Mal- GAN”???
TODO

References

[4] Nina Narodytska, and Shiva Kasiviswanathan, “Simple Black-Box Ad- versarial Perturbations for Deep Networks”, IEEE Conference on Com- puter Vision and Pattern Recognition Workshops (CVPRW), pp.1310- 1318, 2017.
[5] Florian Tramer, Fan Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart, “Stealing Machine Learning Models via Prediction APIs”, 25th USENIX Security Symposium, pp.601-618, 2016.
[6] Weiwei Hu, and Ying Tan, “Generating Adversarial Malware Ex- amples for Black-Box Attacks Based on GAN”, arXiv preprint arXiv:1702.05983, 2017.
[10] jacobgil, “keras-dcgan”, https://github.com/jacobgil/keras-dcgan (ac- cessed 2018-12-1).
[12] Yuta Takata, Masato Terada, Takahiro Matsuki, Takahiro Kasama, Shoko Araki, and Mitsuhiro Hatada, “Datasets for Anti-Malware Research MWS Datasets 2018 ”, Information Processing Society of Japan, Vol.2018-CSEC-82, No.38, 2018.
The function figures in this note are from 🔗, and thanks to the author. Thanks!