Analyzing PDF Malware - Part 1

最新推荐文章于 2014-04-10 11:21:03 发布

kezhen

最新推荐文章于 2014-04-10 11:21:03 发布

阅读量1.1k

点赞数

分类专栏：恶意代码相关

恶意代码相关专栏收录该内容

9 篇文章 0 订阅

订阅专栏

http://blog.spiderlabs.com/2011/09/analyzing-pdf-malware-part-1.html

Background

I’d like to think that security awareness has gotten to the point where the average end user thinks twice before opening an ‘exe’ file sent to them as an email attachment. I like to think that. I really do. But when it comes to opening PDF documents, whether it be an email attachment or their latest online utility bill, I can’t even begin to convince myself that there is ever a moment of hesitation. And am I the only one who finds it ironic that security publications covering recent PDF attacks can often be downloaded in PDF form? How long would it take to live down being compromised by a document that is warning users about itself? Maybe it’s just my affinity for self-referential humor.

我认为安全意识应该已经达到这样一种程度：每个终端用户在打开以email附件的形式收到的exe文件前都应该三思。我喜欢这样想，也会这么做。但是当打开PDF文档，不管这个文档是以email附件的形式收到还是最新的物业账单。

A point to note is that the Portable Document Format is already a huge winner for presenting content for a number of reasons; the proliferation of easily accessible cross-platform readers, and relatively small file sizes are two quick obvious ones. Since many attackers tend to be opportunistic, PDF’s popularity among end users and it’s ability for dynamic action makes it a natural choice as an attack vector. Attackers go where the victims are, so to speak. Often times it comes down to a simple numbers game. More users on a particular platform equals more potential victims to the attacker.

If you pay attention to the news you don’t have to think back too far to remember various incidents involving malformed PDF documents. From hackers who leverage malicious PDF documents to gain a foot hold on an internal network of a major corporation, to reversers taking advantage of a weakness in a rendering engine to jailbreak their smart phones, PDF’s are being used to bypass established security protections. How do we defend ourselves against maliciously crafted PDF’s? There are a variety of methods that can be employed, but I think the best first move for those with the technical inclination is to understand the problem at hand by looking at a sample.

In this first of a multi-part writeup we will analyze a sample PDF aptly named sample1.pdf, and attempt to determine if the file is malicious or not. We will analyze it using a blend of both static and dynamic methodologies. If we determine that the file is malicious (spoiler alert: it is) we will dissect the attacks that were employed. We will trace the code of the document through various rounds of obfuscation, rout out common techniques employed by the attackers, and identify the vulnerabilities that were targeted.

在这个文档的第一部分，我们分析一个样本PDF 姑且称为sample1.pdf，并判断该文档是否是恶意的。我们使用动静结合的方法分析。如果我们判定该文档是恶意的，我们将剖析其采用的攻击。我们将穿透各种混淆跟踪文档代码，跟踪攻击采用的通用技术，并识别出文档利用的漏洞。

Don’t forget…

First things first. It may seem like it goes without saying, but in your zeal to dig into the tech, you may forget to check if someone has already encountered your file and done the heavy lifting for you. Before you even try opening the file, run a quick MD5 sum and do an online search to see if you get any hits. If you know that there aren’t any confidentiality issues regarding your file, you may also want to submit to any of the myriad of online services. A fairly comprehensive list of online services from anti-virus scanners to automated sandboxes can be found over at cleanbytes.net. You may find that your answers are already well documented or easily detected through automated analysis. If this is your case, “Bob’s your uncle” as they say. However if you are not able to get a clear answer one way or the other through your searching, or if your particular file has the potential to contain sensitive information you may need to take the next step in analysis.

在分析前先计算下文件的MD5 值看看有木有已有的分析结果供自己参考。如果没有已有的分析参考，你可以提交至各种各样的在线服务上。关于杀毒扫描和自动沙盒的在线服务较全的列表可以在cleanbytes.net 这个网站找到。可以通过这些自动分析获取较全面的文档，如果是这种情况，就万事大吉；然而，如果不能获得较全面的分析或者文件含有敏感内容，你需要采取下面的步骤分析。

What would you say “ya do” here?

Since you are investigating the nature of your file, you will want to use a few tools to peek inside the file without dynamically executing the contents. There are a growing number of tools to choose from when analyzing PDF’s. I will demonstrate a sampling of them throughout the post. To begin with, a simple strings dump will give us any printable characters in the file:

在分析PDF文档时有很多工具可用，首先，简单的strings dump 可以给我们展示文档中可打印的字符。

A quick look at this output gives a bit of helpful information immediately, namely we find JavaScript content mixed in with common PDF objects. JavaScript is supported by PDF and is often the workhorse that attackers use to setup and execute attacks. Many JavaScript obfuscation tricks commonly used in web browsers can also be used with success in a PDF. In addition to using the strings command you may also want to use your favorite hex editor to statically view the contents of the PDF. In some cases, such as with the 010 Editor, there are templates that can be used to do some minimal parsing of the file’s structure to wrap a bit of context around the printable characters giving you a sense of the object’s overall structure.

快速浏览下输出，可以获得一些有效的信息：可以发现在这个文档的普通PDF对象中混有Javascript内容。PDF支持Javascript，并常称为进行攻击的主要借助体。很多用在Web浏览器中的Javascript混淆技术可以成功用于PDF中。

Running a second tool, PDFiD from Didier Stevens confirms what we are seeing in the previous strings output by displaying the structure of the objects and actions:

PDFiD shows us that there are three objects, but more importantly it counts “/JS” and “/Javascript” occurrences which also matches up with what we see in the strings dump.

A third and final tool we can use to take a static look at our file is pdfscan.rb from Origami, a Ruby framework used to analyze PDF documents:

Again we get confirmation with pdfscan’s slightly more verbose layout that “/JavaScript” is present in our sample1.pdf.

In the past, much of the JavaScript encountered within PDF files was very straightforward in nature. Today, there is a veritable cornucopia of obfuscation techniques employed by attackers. In addition to scripting capabilities of JavaScript, the current PDF specification supports a number of different encoding types in the form of “Standard Filters”. Described at a high level, Filters support actions such as data compression, modification of character representation, and encryption. Filters can be used by attackers to hide from anti-virus signatures, and create havoc for security analysts trying to manually untangle the gnarly mess, especially when they are called in succession or “cascaded”. Didier Stevens has a must-read, albeit old,write-up covering more of the detailed ways that PDFs can be encoded.

Get into the light where you belong.

Luckily there are tools available to help in the extraction of the JavaScript we noticed in our static analysis. First let’s take a look at the output of extractjs.rb that has been run against our sample1.pdf. This ruby script is from same Origami project as pdfscan.rb mentioned earlier:

All right, now we are getting somewhere! We’ve extracted the main chunk of JavaScript from the /JS tag in Object 1, but some key pieces are still missing from the code. One of the techniques that attackers have adapted over time is to hide snippets of code within document variables normally used to describe the PDF itself, such as the document’s title, or subject. These meta-fields are shown below with the help of yet another command line tool pdf.py from the jsunpack-n suite of scripts:

We see from the output that Object 3 contains three additional tags:

1) Producer = substr

2) Subject = spli

3) Title = [data 45194 bytes]

The “Title” tag actually contains a very large string of data 45,194 bytes long. We will come back to that later. You may notice that at the end of pdf.py’s output it says that it wrote JavaScript to ../sample1.pdf.out. If we view the beginning of that file we see that the script makes a good attempt to capture those extra Tag values and make them accessible. Very handy.

Once the three document variables from Object 3 have been combined with the JavaScript we extracted from Object 1, we can trace, de-obfuscate, and simplify the small amount of code by hand to produce the following readable output:

Untangling this code isn’t a required step, but it gives you a more complete view into what is going on under the hood, and can help prevent missing a branch of conditional code that might be hiding some unknown functionality.

We can see that the large amount of data stored in the title variable is being decoded and evaluated. The next step is to execute our code in a controlled manner to see what the code is doing with that data. There are a couple of ways to accomplish this. If your preference is for command line tools, SpiderMonkey is the way to go. Like many of these great analysis tools it comes pre-compiled on Lenny Zeltser’s REMnux 2linux distro. If you prefer a GUI interface for this stage, Malzilla or PDFStreamDumper are both nice visual solutions. We are going to mix it up a bit and check out one of the GUIs.

We have used Malzilla to run our JavaScript (top pane in image), which produces a second stage of JavaScript (lower pane in image). Fortunately this second stage code is much easier to read than the first and only contains some minor obfuscation. Malzilla conveniently saves this new output to a file for us. The initial line of the newly produced script is the variable ‘bjsg’ containing escaped shellcode. This will be a primary target for analysis later. After some beautification, a bit of formatting, and some renaming we can investigate the rest of the file:

It is interesting to note that the code attempts to detect the version of the PDF viewer and the version of theEScript plugin used to execute JavaScript within the PDF. It then uses that garnered info to specifically target ranges and combinations of those versions. There are also five additional functions defined:

1) function build_nop()

2) function collabExploit()

3) function printf()

4) function geticon()

5) function a()

The first function creates a NOP sled. The remaining functions exploit known vulnerabilities with PDF viewing software:

1) NOP sled

At this point our initial suspicions have been confirmed. Our sample1.pdf file is indeed of a malicious nature. But what is this malicious file attempting to do on our system after exploiting one of these known vulnerabilities? To find out that answer we need to investigate the contents of the shellcode we discovered in our second stage JavaScript, which we will do in the next post of the multi-part series.

To be continued…