Using Markov Chains for Android Malware Detection

If you’re chatting with someone, and they tell you “aslkjeklvm,e,zk3l1” then they’re speaking gibberish. But how can you teach a computer to recognize gibberish, and more importantly, why bother? I’ve looked at a lot of Android malware, and I’ve noticed that many of them have gibberish strings either as literals in the code, as class names, in the signature, and so on. My hypothesis was that if you could quantify gibberishness, it would be a good feature in a machine learning model. I’ve tested this intuition, and in this post I’ll be sharing my results.

What is a Markov chain?

You could just read Wikipedia, but ain’t no one got time for that. Simply put, you show it lots of strings to train it. Once it’s trained, you can give it a string and it’ll tell you how much it looks like the strings it was trained on. This is called getting the probability of a string.

Markov chains are quite simple. The code I used is here: https://gist.github.com/CalebFenton/a129333dabc1cc346b0874407f92b568

It’s loosely based on Rob Neuhaus’s Gibberish Detector.

The Experiment

I’ve been trying to improve the model I use in the Judge Android malware detector. It would have been a lot easier to just create some Markov based features, throw them into the model, and roll with it. But I wanted to be scientific. Everyone thinks science is all fun and rainbows, but thats only in the highlight reel. The real bulk of science, the behind the scenes footage, is tedious and hard. In order to make a data-driven decision about which features worked best and if they helped at all, I took lots of notes of my experiment and preserved all the data. Hopefully, it’s helpful to others.

I started by collecting all of the strings from my data set. For reference, the data set contained ~185,000 samples, about half of which are malicious. Malicious apps are from all over, but many are from VirusTotal with 20+ positives. Benign apps are possibly more diverse and contain strings in several languages.

After collecting strings, I trained a Markov model on the strings in the benign apps. For each occurrence of a string, I trained the model on that string. This model would “know” what legitimate, non-malicious strings were probable. Then, I calculated the probability of each benign string and plotted the distribution on a curve.

Finally, I used the Markov model trained on the benign strings to calculate probabilities of all the malicious strings and also plotted them. Before I spent time writing any code for features, I wanted to know if there was even a significant difference in probability distribution. If there wasn’t, I would have probably written the code anyway because this is how I spend my weekends, but it turns out there was a significant difference and I felt much more confident the features would be useful.

The Markov chain was configured to use bigrams and strings were tokenized depending on their category. Below are the patterns used:

      
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
      
      
@Nullable
private static String getTokenSplitPattern(String stringType) {
switch (stringType) {
case "activityNames":
case "receiverNames":
case "serviceNames":
return null;
case "appLabels":
case "appNames":
case "stringLiterals":
return "(?:[\\s+\\p{Punct}]+)";
case "classNames":
case "fieldNames":
case "methodNames":
return null;
case "packageNames":
return "(?:\\.)";
case "packagePaths":
return "(?:/)";
case "issuerCommonNames":
case "issuerCountryNames":
case "issuerLocalities":
case "issuerOrganizationalUnits":
case "issuerOrganizations":
case "issuerStateNames":
case "subjectCommonNames":
case "subjectCountryNames":
case "subjectLocalities":
case "subjectOrganizationalUnits":
case "subjectOrganizations":
case "subjectStateNames":
return "(?:\\s+)";
default:
return null;
}
}

String categories and the probability distributions are listed below.

From the Android manifest:

From the Dalvik executable code:

From the signing certificate:

Comparing the Graphs

Let’s look at a specific example. One of the largest collections of strings were the string literals. These were all the strings that occurred in code. So in the case of String message = "hello,world!";, the string literal is hello,world!.

Below is the probability distribution of benign strings:

benign string literalsbenign string literals

Below is the probability distribution of malicious strings:

malicious string literalsmalicious string literals

Notice the scale along the x-axis is different between the two graphs, but also note this isn’t as important as the general shape.

In the benign graph, strings are generally of a higher probability, especially at the lower end between 0.01 and 0.05 probability. This would indicate that adding this information to the model would improve model performance.

To add this information as a feature, I collected the probabilities of each string and put them into different bins. For example, for every string with a probability <= 0.01, I incremented the count of strings in the “probability 0.01” bin. The bins were:

      
      
1
2
3
4
      
      
double[] bins = new double[]{
0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.08, 0.10,
0.20, 0.30, 0.40, 0.50, 1.0
};

Notice that bins were more numerous in the lower probabilities to better capture the difference in observed distribution between the two classes (malicious / benign). Similar bin distributions were made for other string categories.

Results and Conclusion

By incorporating Markov features, model performance was fairly improved. Below are the “v1” model scores from a 10-fold cross validation using the top 10,000 features:

      
      
1
2
3
      
      
recall: 0.938063119746
precision: 0.979244936027
accuracy: 0.965148875727

Below are the “v2” model scores measured in the same way:

      
      
1
2
3
      
      
recall: 0.958403081067
precision: 0.984061094076
accuracy: 0.97440419396

I regard this as pretty good because every percentage point over 90% is difficult. Also note that these model performance scores are not using the optimal algorithm. I just threw the top 10,000 features into an ExtraTreesClassifier because it’s fast and fairly good. I wasn’t interested in best possible score, only relative improvements. Using a single, well-tuned model, and smarter feature selection, performance reaches 99%.

But 99% is still relatively poor performance for anti-virus, especially for precision, because false positives are so bad. I think the reason for the poor performance is because the malicious test set included adware and injected malware. Adware apps are difficult to distinguish statically and it’s even hard with skilled, manual analysis. Usually, it can only be known by behavioral analysis, i.e. does running this app cause many “annoying” ads to be displayed. By focusing the data set (i.e. cherry picking) more overtly malicious apps, model performance may get a lot better.

So-called injected malware is an otherwise benign app which has had malicious code injected into it. This is uniquely easy for Android malware authors due to the ease of app disassembly and reassembly. Perhaps unsurprisingly, the top features in Judge’s model are from APKiD which can fingerprint the compiler as a proxy for if an app has had code injected into it. This type of malware is also difficult to detect because usually most of the code is legitimate and this means features will reflect a mostly legitimate app.

In any case, further experimentation need to be done to decide which class of malware is most often missed. Once this is known, more features should be added to specifically address that type of malware.

Another remediating factor to consider with using this model is that any real anti-virus solution will be able to generally white list several types of apps to greatly reduce the impact of false positives. For example, an anti-virus company could record which signatures are most common and whitelist all of them. This would preclude the possibility of false positive detections of any popular or system apps.

Future Work

This experiment took two days to run and most of the time was spent extracting features. Also, during that time, my main desktop was booted into Linux and was unavailable for gaming, which was a huge drag. It should be possible to use a smaller data set to reduce iteration time. Maybe I could try 20-50 thousand rather than 180 thousand. samples.

Markov tokenization and n-gram selection can also be adjusted. It’s not clear if bigrams are ideal here. Trigrams may be better, or maybe 1-grams. Maybe it would be better to tokenize string literals only on spaces rather than spacing and punctuation since a lot of obfuscated strings are heavy on punctuation, and it would be good to capture that.

Learning done on the Markov features is probably quite sensitive to how the bins are defined. It may be good to add more bins in the upper ranges, or perhaps even more bins in the lower ranges. Once there are a few data points for performance, it should be possible to algorithmically determine bin sizes and automate some of the guess work. This would make ongoing model updates easier.

It may be useful to filter out commonly occurring classes. Most apps are bloated with common libraries, and by filtering them, the features would more accurately reflect what’s unique to each app. I’ve already created the fingerprinting code, which I’ll write about in a later post. My concern with this is that it’ll require more work to maintain a list of common classes since major libraries are numerous and change frequently.

Reference Data

I’m a big believer in sharing data so people can play with it themselves. Below are links to most of the data for these experiments:

These are compressed JSON files which should be easy to inspect and understand. String counts were recorded rather than strings because the Markov chains were trained by adding each string to the chain for each time it was counted. So if a string was seen 100 times, the Markov chain would have that string added 100 times.

Here’s some example code to deserialize the Markov model. This assumes it’s part of the Jar as a resource.

      
      
1
2
3
4
5
6
7
8
9
10
11
12
      
      
InputStream is = MarkovLoader.class.getResourceAsStream( "/markov_chains.obj");
if (is == null) {
throw new IOException( "Markov chains resource not found");
}
ObjectInputStream ois = new ObjectInputStream(is);
Object obj = ois.readObject();
Map<String, MarkovChain> markovChains;
if (obj instanceof HashMap) {
markovChains = (HashMap) obj;
} else {
throw new IOException( "Weird markov chain resource");
}
Share Comments


原文地址: http://calebfenton.github.io/2017/08/23/using-markov-chains-for-android-malware-detection/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
This book is based on our years-long research conducted to systematically analyze emerging Android malware. Some of our earlier research results and findings were reported in an IEEE conference paper entitled Dissecting Android Malware: Characterization and Evolution, which was presented at the IEEE Symposium on Security and Privacy (often mentioned as Oakland conference in the security community) in May, 2012 [77]. During and after the conference, we were pleased to receive and hear inquiries from colleagues with encouraging comments on the systematization of knowledge work that has been conducted in our conference paper. Partially because of that, we are motivated to expand our work and hope such efforts will be of service to the security and privacy community. Further, as part of that, we have released corresponding malware dataset for our study under the name Android Malware Genome Projectto the community. With that, we want to take this opportunity to thank our collaborators, Dongyan Xu, Peng Ning, Xinyuan Wang, Shihong Zou, and others, whose valuable insights and comments greatly enriched our work. The authors are also grateful to colleagues in the Cyber Defense Lab at NC State University, especially Tyler Bletsch, Zhi Wang, Michael Grace, Deepa Srinivasan, Minh Q. Tran, Chiachih Wu, Wu Zhou, and Kunal Patel. Special thanks also go to Susan Lagerstrom-Fife and our publisher for their great help and patience! This research was supported in part by the US National Science Foundation (NSF) under Grants 0855297, 0855036, 0910767, and 0952640. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, for the NSF. 1 Introduction ........................................ 1 2 A Survey of Android Malware........................... 3 2.1 Malware Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Malware Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Malware Installation . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.2 Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Malicious Payloads . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.4 Permission Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Case Studies ........................................ 21 3.1 Malware I: Plankton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 Phoning Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2 Dynamic Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Malware II: DroidKungFu . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1 Root Exploits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.2 Command and Control (C&C) Servers . . . . . . . . . . . . . 24 3.2.3 Payloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.4 Obfuscation, JNI, and Others . . . . . . . . . . . . . . . . . . . . 26 3.3 Malware III: AnserverBot. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Anti-Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.2 Command and Control (C&C) Servers . . . . . . . . . . . . . 28 4 Discussion.......................................... 31 5 Additional Reading................................... 33 5.1 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1.1 Malware Detection and Defense . . . . . . . . . . . . . . . . . . 33 5.1.2 Smartphone (Apps) Security. . . . . . . . . . . . . . . . . . . . . 34 5.2 Conference and Workshop Proceedings . . . . . . . . . . . . . . . . . . 34 ix 6 Summary........................................... 37 References............................................ 39 Index ................................................ 43

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值