Imformation of the authors
Abstract
Improved-MalGAN improves the ability of evading based on MalGAN, and achieves a better performance.
Introduction
Maybe the introduction of MalGAN is too complex, but I feel like it is necessary and significant to record the imformation below.
- The previous research
- [4] avoided image classification by CNN
apply some microfabrication which is invisible to the human eye - [5] Equation-Solving Attacks and Path-Finding Attacks, which can stole the machine learning model(just like MalGAN).
- [6]MalGan
- Overview
- Generator of MalGAN
- Substitute Detector of MalGAN
- Malmare Detector(black box) of MalGAN
- Overview
- [4] avoided image classification by CNN
Issues
This part is the most important part of the whole research, because issues mean you can make some improvements.
- Training and predict can be done externally, cuz it is convenient for attacker
- The quantities of API feature of MalGAN is reduced to 128, it is not enough
- G ( θ , x ) G(\theta,x) G(θ,x) and D ( θ , x ) D(\theta,x) D(θ,x) use the same API list for both MalGAN and B B B(black box detetctor), which means the attackers know the datesets used to learn of B B B, but it seems like impossible.
- Using multiple malware to learn in MalGAN, it may affect the performance of avoidance
maybe it is right, but this opinion still should be proved
, and attackers wanna use as few as possible and preferably only one malware to be the traning sample.cuz it is more convenient to catch one malware sample
Proposed Method
Some of the improvements for the above mentioned issues.
-
the Overview of Improved-MalGAN
- Execute
B
(
θ
,
x
)
B(\theta,x)
B(θ,x) externally by using Python’s subprocess library
Generating 200 files for each epoch, then use the Python’s subprocess library, input the generated files to the malware detector, and labeling.I don't how to realize it, let me search some relative imformation
- Using all APIs, instead of selecting API features according to the importance of RF
But in this paper, we should restrict the number of files to avoid too large dimension of the API feature vector.
- The API lists between MalGAN and B ( θ , x ) B(\theta,x) B(θ,x) are from the defferent training datasets
- Execute
B
(
θ
,
x
)
B(\theta,x)
B(θ,x) externally by using Python’s subprocess library
-
Improved G ( θ , x ) G(\theta,x) G(θ,x) and D ( θ , x ) D(\theta,x) D(θ,x)
Models are based on DCGAN[10]-
G
(
θ
,
x
)
G(\theta,x)
G(θ,x) of MalGAN
-
G
(
θ
,
x
)
G(\theta,x)
G(θ,x) of Improved-MalGAN
Noise generates the values in the range of ( − 1 , 1 ) (-1, 1) (−1,1) for the number of APIs in the API list, and input to Generator. Generator finally generates the values in the range of ( − 1 , 1 ) (-1, 1) (−1,1) by the Tanh function.The reason for using the Tanh function instead of the Sigmoid function to deal with the value in the range ( − 1 , 1 ) (-1, 1) (−1,1) is to use the Parametric ReLU (PReLu) function described later. -
D
(
θ
,
x
)
D(\theta,x)
D(θ,x)
Referenced to Deep Convolutional GAN(DCGAN)[10] which is used to image generation,
but in Improved-MalGAN the Convolutional Layer was replaced with the Dense Layer,
and the activation function is f ( P R e L u ) f(PReLu) f(PReLu) (PReLu) through DCGAN usually use both f ( P R e L u ) f(PReLu) f(PReLu) and f ( L e a k y R e L U ) f(Leaky ReLU) f(LeakyReLU) (Leaky ReLU)
-
G
(
θ
,
x
)
G(\theta,x)
G(θ,x) of MalGAN
-
B B B : Black-box detector
Exactly B B B is one or a multiply machine learning algorithm,
and the algorithms below are used in Improved-MalGAN.
• RF (n estimators = 1000)
• MLP (hidden layer sizes = (64, )) -
Substitute Detector
-
D
(
θ
,
x
)
D(\theta,x)
D(θ,x) of MalGAN
-
D
(
θ
,
x
)
D(\theta,x)
D(θ,x) of Improved-MalGAN
-
D
(
θ
,
x
)
D(\theta,x)
D(θ,x) of MalGAN
Experiment
Experiment setup
- Python 2.5
- Tensorflow 1.11.0 and Keras 2.2.4
Use this to create MalGAN - Scikit-learn 0.20.0 to create B B B
Dataset
Public dataset : FFRI Dataset 2018
FFRI:
a part of research data set of anti Malware engineering WorkShop (MWS) [12]
- For
B
B
B
- training data
36265 malware and 26127 cleanware - testing data
36048 malware and 26590 cleanware
- training data
- For MalGAN
- training data
1 malware and 44 cleanware
- training data
Different pattern of API list
- For
B
B
B
training data
- For MalGAN
training data
Classification results of B B B
Experiment results
Preliminary experiment result
- About the figures
- Y-axis
the number of files classified as cleanware - X-axis
epochs
- Y-axis
In this experiment, it generates 200 files for each epoch, so the maximum value on the Y-axis is 200.
- Conclusion
1 | more feature quantities used, MalGAN can more easily avoid, but if the average number of APIs is too much than original malware, it is not realisticthis is an issue* |
---|---|
2 | RF algorithm is more robust than MLP |
The approach to solve this issue* above
- Appending a layer which calculates only based on the implementation of Variarional Autoencoder
what is this?
to the end of G G G. - RMSE
n n n is the number of files generated each epoch
m m m is the number of feature quantities
x ^ \hat{x} x^ is the output data replaced
x x x is the original malware
To calculates the customized Root Mean Square Error between the output data replaced by the range of ( 0 , 1 ) (0,1) (0,1) and the feature quantities of the original malware. Multiplying the RMSE by 0.05, and multiplying the binary cross entropy by 0.95, these is the best “reducing the number of APIs” setup.
Experiment results after solving the issue* above
OK, we can see it is not robust, but the number of APIs refuced to about 250 at the 30 epochs, RMSE works.
- Conclusion
1 | customized RMSE can reduce the APIs required to avoid the B B B |
---|---|
2 | the learning becomes unstable by reducing the API, and more unstable by using importance of RF |
Conclusion
- There are other feature can be easily generated besides APIs, such as strings…
- The ease of avoidance by MalGAN depends on the algorithm of the B B B, hence, we can improve malware detector to pretect our anti-virus models.
- With the improvements of ML generation technology, maybe we can generate the binary itself.
Some of the details
This part is used to record somthing interesting and some knowledge I have never knew before.
-
Tanh function
To deal with the value in the range (-1,1) in order to use the Parametric ReLU function(PReLU).
the graph of the function?
-
Sigmoid function
-
Parametric ReLU function(PReLu)
the graph of the function?
-
Leaky Parametric ReLU function(Leaky PReLu)
-
Whis is the meaning of “Execute the malware detectors externally using Python’s subprocess library, and import the detection results into Mal- GAN”???
TODO
References
- [4] Nina Narodytska, and Shiva Kasiviswanathan, “Simple Black-Box Ad- versarial Perturbations for Deep Networks”, IEEE Conference on Com- puter Vision and Pattern Recognition Workshops (CVPRW), pp.1310- 1318, 2017.
- [5] Florian Tramer, Fan Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart, “Stealing Machine Learning Models via Prediction APIs”, 25th USENIX Security Symposium, pp.601-618, 2016.
- [6] Weiwei Hu, and Ying Tan, “Generating Adversarial Malware Ex- amples for Black-Box Attacks Based on GAN”, arXiv preprint arXiv:1702.05983, 2017.
- [10] jacobgil, “keras-dcgan”, https://github.com/jacobgil/keras-dcgan (ac- cessed 2018-12-1).
- [12] Yuta Takata, Masato Terada, Takahiro Matsuki, Takahiro Kasama, Shoko Araki, and Mitsuhiro Hatada, “Datasets for Anti-Malware Research MWS Datasets 2018 ”, Information Processing Society of Japan, Vol.2018-CSEC-82, No.38, 2018.
- The function figures in this note are from 🔗, and thanks to the author. Thanks!