论文阅读：Precise and Accurate Patch Presence Test for Binaries

最新推荐文章于 2024-02-21 17:31:05 发布

桃子小迷妹

最新推荐文章于 2024-02-21 17:31:05 发布

阅读量446

点赞数

分类专栏：论文

本文链接：https://blog.csdn.net/weixin_43846270/article/details/118109323

版权

论文专栏收录该内容

20 篇文章 1 订阅

订阅专栏

FIBER：Precise and Accurate Patch Presence Test for Binaries

source to binary

1. Introduction

Inspired by human analysts’ behaviors to inspect only small and localized code areas, we present FIBER, an automated system that leverages this observation in its core design. （受人类分析人员只检查小型和局部代码区域的行为的启发，我们提出了FIBER，这是一个在其核心设计中利用这种观察结果的自动化系统。）

patch presence test —— bug search

Patch presence test ：checks whether a specific patch has been applied to an unknown target, assuming the knowledge of the affected function(s) and the patch itself, e.g., “whether the heartbleed vulnerability of an openssl library has been patched in the $tls1\_process\_heartbeat()$ function”.
（Patch presence test：检测一个确定的补丁是否已经应用在一个未知的目标，假设已经知道了受影响的函数和补丁本身，例如：一个openssl库的heartbleed漏洞是否在函数 $tls1\_process\_heartbeat()$ 被打补丁）

Bug search, on the other hand, does not make assumptions on which of the target functions are affected and simply look for all functions or code snippets that are similar to the vulnerable one, e.g., “which of the functions in a software distribution looks like a vulnerable version of $tls1\_process\_heartbeat()$ .”
（Bug search：不做出哪一个目标函数受到影响的假设，简单地寻找所有函数或类似于vulnerable的一个代码片段,例如,”软件中的哪一个函数与 $tls1\_process\_heartbeat()$ 的漏洞版本相似”。）

Existing solutions：take the whole function for comparison
FIBER addresses the following technical problem: “how do we generate binary signatures that well represent the source-level patch”?
whether the specific affected function is patched in the target binary.
（FIBER解决了以下问题：我们如何生成二进制signature表示源码级别的补丁？）
We address this problem in two steps:
First, inspired by typical human analyst’s behaviors, we will pick and choose the most suitable parts of a patch as candidates for binary signature generation.（首先，受典型的人类分析师行为的启发，我们将选择补丁中最合适的部分作为二进制签名生成的候选部分）
Second, we generate the binary signatures that preserve as much source-level information as possible, including the patch and the corresponding function as a whole.（其次，我们生成保留尽可能多的源码信息的二进制签名，包含补丁和相对应的函数作为一个整体。）

2. Overview

A motivating example.

the security patch for CVE-2015-8955, a Linux kernel vulnerability
在这里插入图片描述
Step 1: Pick a change site (i.e., sequence of changed statements).

patch 引入了多个 change sites。 However, not all of them are ideal for the patch presence test purpose.

Line 1-5：adds a new parameter “pmu” for original function, which will be used by the added “if” statement at line 11.
Another change is to move the assignment of “armpmu” from line 7 to line 17. The “to arm pmu()” used by the assignment is a small utility macro, which will result in few instructions without changing the control flow graph (CFG), making it difficult to be located at binary level. （将在不改变控制流图(CFG)的情况下产生很少的指令，使得很难在二进制级别找到它。）
the added “if” statement at line 11 will introduce a structural change to the CFG, besides, it also has a unique semantic as it involves the newly added function parameter. （‘if’ 引入了CFG的结构变化，而且涉及到一个新添加的函数参数）
选择 Line 11 作为特征

Step 2: Rough matching.
Now we have decided to search in the target binary function for the existence of line 11 in Fig 1, typically we will start from matching
the CFG structure since it is easy and fast.

现在我们决定在目标二进制函数中搜索 Fig 1 中 line 11 是否存在，通常我们会从匹配CFG结构开始，因为它很容易和快速。

Specifically, one condition in the “if” statement will generally lead to a basic block with two successors, Thus for line 11, we will first try to locate those basic blocks with out-degrees of 2. Besides, one successor of the basic block should be the function epilogue since at line 12 the function will return if passing the checks at line 11. In Fig 1 we also show a part of the CFG
generated from a patched Android kernel image, we can see that both the bolded basic block and the basic block right of it satisfy this requirement.

具体来说，“if”语句中的一个条件通常会导致一个具有两个后续块的基本块。因此，对于第11行，我们将首先尝试找到那些出度为2的基本块。此外，基本块的一个后继应该是函数的尾声，因为在第12行，如果通过第11行检查，函数将返回。在 Fig 1 中，我们还显示了从一个修补过的Android内核图像生成的CFG的一部分，我们可以看到粗体的基本块和它的右边基本块（紧挨着的右边）都满足这个要求。

Step 3: Precise matching.
Out of the two candidate basic blocks in the target binary, we now should need some semantic information to further distinguish them. （在目标二进制文件中的两个候选基本块中，我们现在应该需要一些语义信息来进一步区分它们。）With limited information at the binary level, we need to map the binary instructions to source-level statements somehow. （由于二进制级别的信息有限，我们需要以某种方式将二进制指令映射到源代码级别的语句。）Specifically, by tracking the register’s origin (listed at the bottom of Fig 1), we can finally tell the differences of the two “cmp” instructions and correctly decide that the bolded basic block is the one that maps back to line 11. （具体来说，通过跟踪寄存器的起源(列在Fig 1的底部)，我们最终可以分辨出两个“cmp”指令的区别，并正确地决定粗体的基本块是映射回 line 11。）

System architecture.

在这里插入图片描述

It has four primary inputs:
(1) the source-level patch information; （补丁源码级信息）
(2) the complete source code of a reference; （reference 完整的源码）
(3) the affected function(s) in the compiled reference binary; （编译的reference让二进制中受影响的函数）
(4) the affected functions in the target binary. （目标二进制中受影响的函数）

Three major components in FIBER:
(1) Change site analyzer.

A single patch may introduce more than one change site in different functions and one change site can also span over multiple lines in source code. （一个补丁可以在不同的函数中引入多个 change sites，一个 change site 也可以跨越源代码中的多行。）
Change site analyzer intends to pick out those most representative, unique and easy-to-match source changes by carefully analyzing each change site and the corresponding reference function(s), mimicking what a real analyst would do.
（change site 分析器打算通过仔细分析每个 change site 和相应的 reference函数来挑选出那些最具代表性的、独特的和易于匹配的源变更，模仿真正的分析人员会做什么。）
Besides, during this process, we can also obtain useful source-level insight regarding the change sites (e.g., the types of statements and the variables involved), which can guide the later signature generation and matching process.
（此外，在这个过程中，我们还可以获得关于 change site 的有用的源码级观察(例如，语句的类型和涉及的变量)，这可以指导以后的签名生成和匹配过程。）

在这里插入图片描述
(2) Signature generator.
This component is responsible for translating source-level change sites into binary-level signatures. （该部分负责将源代码级别的 change sites 转换为二进制级别的 signature。）Essentially this step requires an analysis to ensure that we can map binary instructions to source-level statements, which is challenging because of the information loss during the compilation process. （从本质上说，这一步需要进行分析，以确保我们可以将二进制指令映射到源码级语句，这很有挑战性，因为在编译过程中会丢失信息。）The key building block we leverage is binary symbolic execution for this purpose.（我们利用的关键构建块是用于此目的的二进制符号执行。）

(3) Matching engine.
The matching engine’s task is to search a given signature in the target binary. （matching engine的任务是在目标二进制文件中搜索给定的 signature） To do that, we first need to locate the affected function(s) in the target binary with the help of the symbol table.（为此，我们首先需要在符号表的帮助下定位目标二进制文件中受影响的函数。） Then the search is done by first matching the syntax represented by the topology of a localized CFG related to the patch (a much quicker process), and then the semantic formulas (slower because of the symbolic execution). （然后通过首先匹配与补丁相关的本地化CFG的拓扑所表示的语法(这是一个快得多的过程)，然后匹配语义公式(由于符号执行的原因较慢)来完成搜索。）

It is worth noting that as long as a signature is generated for a particular security patch, it can then be saved and reused for multiple target binaries, thus we only need to run the analyzer and generator once for each patch.
（值得注意的是，只要为一个特定的安全补丁生成了一个签名，那么它就可以被保存并用于多个目标二进制文件，因此我们只需要为每个补丁运行一次分析器和生成器。）

scope：（1）FIBER naturally supports analyzing binaries of different architecture and compiled with different compiler options.（源码可得，可在任意架构和编译选项下编译）
(2) FIBER is inherently not tied to any source language although currently it works on C code.

4 System Design

Signature

signature 代表了补丁。
In general, we have two criterion for an “ideal” signature:
（1）Unique
The signature should not be found in places other than the patch itself. Otherwise, it is not unique to the patch. Specifically, it should not exist in both the patched and un-patched versions. This means that the signature should not be ovmay cause it to appear in places unrelated to the patch.
（2）Stable
The signature should be robust to benign evolution of the code base, e.g., the target function may look different than as the reference due to version differences. This means that the signature should not be overly complex (related to too many source lines), which is more likely to encounter benign changes in the target, creating false matches of the signature.

Change Site Analyzer
For each added statement in the patch, the following steps will be performed:
(1) Uniqueness test.
Basically, a statement has to exist in only the added lines of the patch and nowhere else (e.g., un-patched code bases)”.
(2) (optional) Context addition.
If no single statement is unique, we consider all its adjacent statements as potential context choices. The “adjacent” is bi-directional and on the control flow level (e.g., the “if” statement has two successors and both of which can be considered the context), thus there can be multiple context statements.
(3) Fine-grained change detection.
By convention, patches are distributed in the form of source line changes. Even when a line is partially modified, the corresponding patch will still show one deleted and one added line. We detect such fine-grained changes within a single statement / source line, by comparing it with its neighbouring deleted/added lines.
(4) Type insight.
The types of variables involved in source statements are also important since it will guide the later binary signature generation and matching.

Source Change Selection
Previous step may generate multiple candidate unique source changes for a single patch.
(1) Distance to function entrance.
Short distance between statements in the source-level signature and the function entrance will accelerate the signature generation process because of its design.
(2) Function size.
If the source code signature is located in a smaller function, the matching engine will benefit since the search space will be reduced and it is less likely to encounter “noise”. In addition, the matching speed will be faster. Note that this is more important than (1) because the signature generation process is only a one-time effort while matching may be repeated for different target binaries.
(3) Change type.
if the change involves some structural/control-flow changes (e.g., “if” statement), we can quickly narrow down the search range to structurally-similar candidates in the target binary, affecting the matching speed.

We categorize the source changes into several general types:
(1) function invocations (new function call or argument change to an existing call)
(2) condition related (new conditional statement or condition change
in an existing statement)
(3) assignments (which may involve arithmetic operations).

Note that there are certain source-level changes are simply not visible at the binary level (e.g., source code comments) or difficult to locate (variable declaration).
请注意，有些源代码级别的更改在二进制级别是不可见的(例如，源代码注释)或难以定位的(变量声明)。

Signature Generator
首先需要把涉及到的源码编译为二进制，根据上一步所选择的唯一的源码更改，将从二进制中生成二进制 signature

Binary Signature Generation

识别并组织与源码改变有关的指令（使用debug信息，debug信息提供了从源码到二进制指令的映射）然后使用包含这些指令的节点构建一个局部的CFG（如果节点之间无边，则需要添加一些填充节点）（局部CFG排除了不相关的代码）
识别根指令。理论上，local CFG的所有指令都是二进制 signature 中的一部分。然而，只需要这些指令中的一个子集就可以总结关键指令（数据流语义），这些指令称为‘根指令’。（例如，编译器可能会插入额外的“中间”指令来释放一些寄存器(通过将它们的值保存到内存中)。）

Figure 3
第一条语句（赋值语句）：
三条二进制指令
然而，仅捕获最后一条指令就已经足够了，因为我们通过数据流分析知道 X1 等于X0+0x4，因此可以丢弃第一条和第二条指令。

类似地，与第二条语句对应的指令03和04已经充分捕获了它的语义，因为指令00、01和02的输出稍后将被其他指令消耗。
在这里插入图片描述

将“根指令”定义为数据流链中的最后一条指令(其中没有其他指令将进一步传播任何数据)，以及一些补充指令来完成源级语义。

例如，根据这个定义，cmp 指令将是根指令。然而，我们需要用下一个条件跳转指令来补充它，以完成它的条件语句语义。

对于函数调用指令，根指令将包括参数的 push指令（假设为x86）和调用指令（完成函数调用）。

请注意，编译器可能仍然会为相同的语句生成略微不同的根指令(由于编译器优化等原因)。为了便于签名匹配，只要根指令的类型相同，我们认为根指令是等价的(根指令的规范化)。我们在表1中说明了这一点，在表1中，我们展示了可能从同一源更改中生成的不同类型的指令。例如，编译器可以在赋值语句中使用位运算而不是乘法运算。

在这里插入图片描述

注释根指令。现在，我们需要确保根指令被充分标记(这是我们的二进制签名)，以便它们能够唯一地映射到源更改。
target 函数和 reference 函数应该共享变量级语义(因为它们只是同一个函数的不同版本)，我们将目标定义为将根指令的操作数(寄存器或内存位置)映射回源级变量。

因为如果 target 函数确实应用了补丁，那么与补丁相关的变量应该与我们在 reference 函数中看到的变量相同。

为此，我们为每个操作数(直到根指令点)计算一个完整的函数语义公式。

在这里插入图片描述

在这里插入图片描述
请注意，从函数的角度来看，指令中的任何操作数实际上只能从四个源派生:
(1)一个函数参数(外部输入)，如x86为ebp+0x4, aarch64为X0或X1;
(2)局部变量(在函数中定义)，例如x86中的ebp-0x8或aarch64中的sp+0x4(使用寄存器传递参数);
(3)函数调用(外部源)的返回值，例如，保存函数调用返回值的寄存器;
(4)一个直接的数字(常量)，例如指令/数据地址(包括全局变量)，偏移量，其他常量;