DEEPBINDIFF: Learning Program-Wide Code Representations for Binary Diffing

最新推荐文章于 2024-06-10 09:35:23 发布

桃子小迷妹

最新推荐文章于 2024-06-10 09:35:23 发布

阅读量978

点赞数 3

分类专栏：论文

本文链接：https://blog.csdn.net/weixin_43846270/article/details/108508663

版权

论文专栏收录该内容

20 篇文章 1 订阅

订阅专栏

DEEPBINDIFF: Learning Program-Wide Code Representations for Binary Diffing

Existing techniques： low accuracy, poor scalability, coarse granularity, or require extensive labeled training data to function.

Limitations of Learning-based Approaches.

First, no existing learning-based technique can perform efficient program-wide binary diffing at a fine-grained basic block level.
Second, none of the learning-based techniques considers both program-wide dependency information and basic block semantic information during analysis.
Program-wide dependency information： The program-wide dependency information, which can be extracted from the inter-procedural control flow graph (ICFG), provides the contextual information of a basic block.（程序范围内的依赖信息可以从过程间控制流图(ICFG)中提取出来，它提供了基本块的上下文信息。） It is particularly useful inbinary diffing as one binary could contain multiple very similar functions, and the program-wide contextual information can be vital to differentiate these functions as well as the blocks within them.（它在二进制区分中特别有用，因为一个二进制可以包含多个非常相似的函数，而程序范围的上下文信息对于区分这些函数以及它们中的块是至关重要的。）
Basic block semantic information： characterizes the uniqueness of each basic block. （描述每个基本块的唯一性。）
Third, most of the existing learning-based techniques are built on top of supervised learning. Thus,
the performance is heavily dependent on the quality of training data.
To begin with, we argue that a large, representative and balanced training dataset can be very hard to collect, because of the extreme diversity of the binary programs. （首先，由于二进制程序的极端多样性，一个大的、有代表性的、平衡的训练数据集可能很难收集。）Also, supervised learning could suffer from the overfitting problem.（此外，监督学习还会受到过拟合问题的影响。）Moreover, state-of-the-art supervised learning technique InnerEye considers a whole instruction (opcode+ operands) as a word, therefore, may lead to serious out-of-vocabulary (OOV) problem. （此外，最先进的监督学习技术InnerEye将整个指令(操作码+操作数)视为一个单词，因此，可能会导致严重的外词汇表(OOV)问题。）

Problem Statement

A. Problem Definition

在这里插入图片描述

B. Assumptions

Only stripped binaries
Binaries are not packed
Two input binaries are for the same architecture.

Approach

在这里插入图片描述
As shown, the system takes as input two binaries and outputs the basic block level diffing results. （系统以两个二进制作为输入，输出基本块级结果。）

Pre-Processing

Pre-processing analyzes binaries and produces inputs for embedding generation. More specifically, it produces inter-procedural CFGs for binaries and applies a token embedding generation model to generate embeddings for each token (opcode and operands). These generated token embeddings are further transformed into basic block level feature vectors. (预处理分析二进制文件并生成embedding generation的输入。更具体地说，它为二进制文件生成过程间CFG，并应用一个token embedding generation模型来为每个token(操作码和操作数)生成embedding。这些生成的token embedding被进一步转化为基本的块级特征向量。)

A. CFG Generation

By combining the call graph with the control-flow graphs of each function, DEEPBINDIFF leverages IDA pro to extract basic block information, and generates an inter-procedural CFG (ICFG) that provides program-wide contextual information. （通过将调用图与每个函数的控制流图相结合，DEEPBINDIFF利用IDA pro提取基本块信息，并生成一个过程间CFG (ICFG)，该CFG提供程序范围的上下文信息。）

B. Feature Vector Generation

在这里插入图片描述
Step1. Random Walks.
generate random walks in ICFGs so that each walk contains one possible execution path of the binary. (在ICFGs中生成随机遍历，每个遍历都包含二进制文件的一个可能执行路径。) To ensure the completeness of basic block coverage, we configure the walking engine so that every basic block is guaranteed to be contained by at least 2 random walks.
（为了保证基本块覆盖的完整性，我们配置了walking engine，保证每个基本块至少被包含在2个random walks。） Further, each random walk is set to have a length of 5 basic blocks to carry enough control flow information. （此外，每个random walk的长度被设置为5个基本块，以携带足够的控制流信息。）Then, we put random walks together
to generate a complete instruction sequence for training. （然后，我们将random walk放在一起生成一个完整的指令序列。）
Step2. Normalization.

all numeric constant values are replaced with string ‘im’; 所有的数值常量值被替换为字符串’ im ’
all general registers are renamed according to their lengths; 所有通用寄存器都根据其长度重新命名
pointers are replaced with string ‘ptr’. 指针被替换为字符串’ ptr ’

Step3. Model Training.
CBOW model
consider each token (opcode or operand) as word, normalized random walks on top of ICFGs to be sentences and instructions around each token as its context. （将每个token(操作码或操作数)视为单词，ICFGs上的规范化随机游走是围绕每个token的句子和指令，作为其上下文。）
For example, step 3 in Figure 2 shows that the current token is cmp (shown in red), so we use one instruction before and another instruction after (shown in green) in the random walk as the context. If the target instruction is at the block boundary (e.g., first instruction in the block), then only one adjacent instruction will be considered as its context.

Step4. Feature Vector Generation.
Since each basic block could contain multiple instructions and each instruction in turn involves one opcode and potentially multiple operands, we calculate the average of the operand embeddings, concatenate with the opcode embedding to generate instruction embedding, and further sum up the instructions within the block to formulate the block feature vector. （因为每一个基本块包含多条指令，每条指令依次包括一个操作码和潜在的多个操作数。我们计算操作数embedding的平均值，与操作码的embedding连接生成指令的embedding，并且进一步总结块内的指令，形成基本块特征向量）

instruction importance
To tackle this problem, DEEPBINDIFF adopts a weighting strategy to adjust the weights of opcodes based on the opcodes importance with TF-IDF model. The calculated weight indicates how important one instruction is to the block that contains it in all the blocks within two input binaries. （为了解决这个问题，DEEPBINDIFF采用TF-IDF模型，根据操作码重要性采用加权策略来调整操作码的权重。计算出的权重表示一条指令对包含它的块有多重要在两个输入二进制文件中的所有块。）

一个基本块的特征向量为：
在这里插入图片描述
一个指令 $in_i$ 包括一个操作码 $p_i$ 和一个 $k$ ( $k$ 可以为0）个操作数的集合 $Set_{t_i}$
$embed_{p_i}$ : 操作码的embedding
$weight_{p_i}$ : 操作码的TF-IDF权值
$embed_{t_{i_n}}$ : 操作数的embedding

Embedding Generation

1. TADW algorithm
Text-associated DeepWalk is an unsupervised graph embedding learning technique. As the name suggests, it is an improvement over the DeepWalk algorithm. （文本关联的DeepWalk是一种无监督的图嵌入学习技术。顾名思义，它是对DeepWalk算法的改进。）

DeepWalk algorithm is an online graph embedding learning algorithm that considers a set of short truncated random walks as a corpus in language modeling problem, and the graph vertices as its own vocabulary. The embeddings are then learned using random walks on the vertices in the graph. Accordingly, vertices that share similar neighbours will have similar embeddings. It excels at learning the contextual information from a graph. Nevertheless, it does not consider the node features during analysis. （DeepWalk算法是一种在线图嵌入学习算法，它将一组短截随机游动作为语言建模问题的语料库，将图的顶点作为自己的词汇表。然后使用图中顶点的随机游动来学习嵌入。因此，共享相似邻居的顶点将具有相似的嵌入。它擅长从图表中学习上下文信息。然而，它在分析过程中没有考虑节点特性。）

Text-associated DeepWalk (TADW) is able to incorporate features of vertices into the network representation learning process. （文本相关的DeepWalk (TADW）能够将顶点的特征合并到网络表示学习过程中。）

在这里插入图片描述
2. Graph Merging
The drawbacks of running TADW twice for the two ICFGs（one for each binary）
（1）First, it is less efficient to perform matrix factorization twice.（执行两次矩阵分解效率较低）
（2）Second, generating embeddings separately can miss some important indicators for similarity detection.（单独生成嵌入会遗漏一些用于相似度检测的重要指标。）

在这里插入图片描述
Ideally: a-1 (has a reference to string ‘hello’)， d-3(calls fread)
In practice, the feature vectors of these basic blocks may not look very similar as one basic block could contain multiple instructions while the call or the string reference is just one of them. (这些基本块的特征向量可能看起来不太相似，因为一个基本块可能包含多个指令，而调用或字符串引用只是其中一个指令。)Besides, the two pairs also have different contextual information (node ‘a’ has no incoming edge but ‘1’ does). As a result, TADW may not generate similar embeddings for the two pairs of nodes.（此外，两对的上下文信息也不同(节点a没有传入边，而节点1有)。因此，TADW可能不会为这两对节点生成类似的嵌入。）

Graph Merging ：the two ICFGs are merged and TADW runs only once on the merged graph.
（1）Particularly, DEEPBINDIFF extracts the string references and detects external library calls and system calls.（DEEPBINDIFF提取字符串引用并检测外部库调用和系统调用。）
（2）Then, it creates virtual nodes for strings and library functions, and draws edges from the callsites to these virtual nodes.（它为字符串和库函数创建虚拟节点，并从调用点绘制到这些虚拟节点的边。）
Hence, two graphs are merged into one on terminal virtual nodes.
因此，a 和 1至少有一个共同的邻居，这增强了它们之间的相似性。此外，节点“a”和“1”的邻居也具有更高的相似性，因为它们共享相似的邻居。
Moreover, since we only merge the graphs on terminal nodes, the original graph structures stay unchanged.（只合并终端节点上的图，原始图的结构保持不变。）

3. Basic Block Embeddings
With the merged graph, DEEPBINDIFF leverages TADW algorithm to generate basic block embeddings. More specifically, DEEPBINDIFF feeds the merged graph and the basic block feature vectors into TADW for multiple iterations of optimization.

CODE DIFFING

The goal is to find a basic block level matching solution that maximizes the similarity for the two input binaries.

Two major limitations of performing linear assignment based on basic block embeddings to produce an optimal matching:
(1) First, linear assignment can be inefficient as binaries could contain enormous amount of blocks.
(2) Second, although embeddings include some contextual information, linear assignment itself does not consider any
graph information. （尽管embedding包含一些上下文信息，线性分配本身并不考虑任何图形信息。）

A possible improvement is to conduct linear assignment at two levels. Rather than matching basic blocks directly, we could match functions first by generating function level embeddings. Then, basic blocks within the matched functions can be further matched using basic block embeddings. This approach, however, can be severely thwarted by compiler optimizations that alter the function boundary such as function inlining.（先匹配函数（函数embedding），再在匹配的函数内匹配基本块（基本块的embedding））

k-Hop Greedy Matching
The high-level idea is to benefit from the ICFG contextual information and find matching basic blocks based on the similarity calculated from basic block embeddings within the k-hop neighbors of already matched ones. （其高级思想是从ICFG上下文信息中获益，并根据从已经匹配的k-hop邻居中的基本块嵌入计算出的相似度来找到匹配的基本块。）