Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs

Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs

a novel approach for predicting procedure names in stripped executables.
Code and data are available at github.

1. Introduction

The main challenge is to understand how the different “working parts” inside the executable are meant to interact to carry out the objective of the executable.
主要的挑战是理解可执行文件中的不同“工作部分”是如何相互作用以实现可执行文件的目标的。

Problem definition : Given a nameless assembly procedure X \mathcal{X} X residing in a stripped (containing no debug information) executable, our goal is to predict a likely and descriptive name Y = y 1 . . . , y m \mathcal{Y = y_1..., y_m} Y=y1...,ym, where y 1 . . . , y m \mathcal{ y_1..., y_m} y1...,ym are the subtokens composing Y \mathcal{Y} Y. Thus, our goal is to model P ( Y ∣ X ) \mathcal{ P (Y | X)} P(YX). For example,for the name Y \mathcal{Y} Y = create_server_socket, the subtokens y 1 . . . , y m \mathcal{ y_1..., y_m} y1...,ym that we aim to predict are create, server and socket, respectively.

给定一个无名的汇编程序 X \mathcal{X} X(在一个stripped executable中),目标是预测一个可能的和描述性的名字 Y = y 1 . . . , y m \mathcal{Y = y_1..., y_m} Y=y1...,ym,其中 y 1 . . . , y m \mathcal{ y_1..., y_m} y1...,ym 是组成 Y \mathcal{Y} Y 的子token。因此,我们的目标是建模 P ( Y ∣ X ) \mathcal{ P (Y | X)} P(YX)。例如,对于名称 Y \mathcal{Y} Y = create_server_socket,子token y 1 . . . , y m \mathcal{ y_1..., y_m} y1...,ym,我们要预测的对象分别是createserversocket

The problem of predicting a meaningful name for a given procedure can be viewed as a translation task - translating from assembly code to natural language.

Challenge
Challenge 1: Little syntactic information and token coreference. 很少的语法信息和标记关联
在这里插入图片描述
以 Fig.1.(a) 为例。
These instructions are composed from a mnemonic (e.g., mov), followed by a mix of register names and alphanumeric constants. 这些指令由助记符(如 mov ),后面跟着寄存器名和字母数字常量的组合组成。 These instructions lack any information regarding the variable types or names that the programmer had defined in the high-level source code. 这些指令缺少任何有关程序员在高级源代码中定义的变量类型或名称的信息。
This means that the presence of a register name in an instruction carries little information. 这意味着在一条指令中存在寄存器名所携带的信息很少。
数字4可以是堆栈变量(Line 7)的偏移量、过程调用中用作参数的枚举值(Line 10)或跳转表索引。

Challenge 2: Long procedure names.
Procedure names in compiled C code are often long, as they encode information that would be part of a typed function signature in a higher-level language (e.g., AccessCheckByTypeResultListAndAuditAlarmByHandleAin the Win32 API). 编译后的C代码中的过程名通常很长,因为它们编码的信息将成为高级语言中类型化函数签名的一部分。 Methods that attempt to directly predict a full label as a single word from a vocabulary will be inherently imprecise. 试图直接从词汇表中预测单个单词的完整标签的方法本质上是不精确的。

Our approach.
We present a novel representation for binary procedures specially crafted to allow a neural model to generate a descriptive name for a given stripped binary procedure. 我们提出了一种新颖的二进制过程表示方法,这种方法可以让神经模型为给定的剥离二进制过程产生一个描述性的名称。To construct this representation from a binary procedure we: 为了从二进制过程中构造这个表示,我们:
(1) Build a control-flow graph (CFG) from the disassembled binary procedure input.
(2) Reconstruct a call-site-like structure for each call instruction present in the disassembled code. 为反汇编代码中的每条调用指令重构一个 call-site-like 结构。
(3) Use pointer-aware slicing to augment these call sites by finding concrete values or approximating abstracted values. 通过查找具体值或近似抽象值,使用 pointer-aware 切片来扩充这些 call sites。
(4) Transform the CFG into an augmented call sites graph. 将CFG转换为增强的 call sites 图。

OVERVIEW

Using calls to imported procedures.

After disassembling P P P’s code, the procedure is transformed into a sequence of assembly instructions. 如Figure.1.(a) 所示。 The most informative pieces of information to understand what the code does are calls to procedures whose names are known because they can be statically resolved and can not be easily stripped. 要理解代码所做的工作,最有意义的信息片段是对名称已知的过程的调用,因为这些过程可以静态解析,不容易剥离。Resolving these names is possible because these called procedures reside in libraries that are dynamically linked to the executable, causing them to be imported (into the executable memory) as a part of the operating system (OS) loading process. 解析这些名称是可能的,因为这些被调用的过程驻留在动态链接到可执行文件的库中,导致它们作为操作系统(OS)加载过程的一部分被导入(到可执行内存中)。Calls to such imported procedures are also called application program interface (API) calls, as they expose an interface to these libraries. 对这些导入过程的调用也称为应用程序接口(API)调用,因为它们向这些库公开接口。
In order to pass arguments when making these API calls, the calling convention used by the OS defines that these argument values are placed into specific registers before the call instruction. 为了在进行这些API调用时传递参数,操作系统使用的调用约定 定义这些参数值在调用指令之前被放置到特定的寄存器中。分配到这些寄存器的过程如 Fig. 1(a) Line1-4,8-13所示。
在这里插入图片描述
在这里插入图片描述
Its important to note, while anti-RE tools may try and obfuscate API names, the values for the arguments passed when calling these external procedures must remain intact. 值得注意的是,尽管anti-RE工具可能会试图混淆API名称,调用这些外部过程时传递的参数值必须保持不变。

Putting calls in the right order. After examining the API calls of the procedure, a human reverser will try to understand the order in which they are called at runtime.在检查了过程的API调用之后,逆向人员将尝试理解它们在运行时调用的顺序。
Figure 1(b) shows all the call instructions in $P$’s disassembled code, in the semi-random order in which they were generated by the compiler. This order does not reflect any logical or chronological order. Figure 1(b) 显示了 P P P的反汇编代码中的所有调用指令,它们是由编译器以半随机顺序生成的。这个顺序不反映任何逻辑或时间顺序。

例如: (1) call socket, 是设置的一部分, 与 printf交织 用于错误处理;
(2) 这个序列中的调用中被随机打乱,例如:connect出现在setsocketopt之前。
在这里插入图片描述
To order the API calls correctly, we construct a CFG for P P P. A CFG is a directed graph representing the control flow of the procedure, as determined by jump instructions. Figure 1© shows a CFG created from the dissembled code in Fig. 1(a). For readability only, the API calls are presented in the CFG nodes (instead of the full instruction sequence). By observing the CFG, we learn all possible runtime paths, thus approximating all potential call sequences.
A human reverse-engineer can follow the jumps in the CFG and figure out the order in which these calls will be made: π \pi π = socketsetsockoptconnect.
在这里插入图片描述
Reconstructing call sites using pointer-aware slicing.
After detecting call sites, we wish to augment them with additional information regarding the source of each argument. This is useful because the source of each argument can provide a hint for the possible values that this argument can store. Moreover, as discussed above, these are essential to allow name prediction in the face of
API name obfuscation.

Figure 2 depicts the process of creating our augmented call sites-based representation, focusing on the creation of the call site for the connect API call. 图2 描述了创建 参数的 基于call sites 的表示的过程,重点是为连接的 API调用创建调用站点。
在这里插入图片描述
Fig. 2.(a)展示了 P P P 的控制流图只包含 c a l l   i n s t r u c t i o n call\ instruction call instruction
我们将API调用丰富成更类似于高级语言中的 call site的结构。 为此,我们通过检查导入API的库的调试符号来获取在调用API时传递的参数的数量。按照操作系统calling convention,将每个实参映射到用于传递它的寄存器。
通过检索 connect API 三个参数,可以生成他们的 call site connect(rdi, rsi, rdx),如 Fig. 2(b)所示。 这个重构的 call site 包含了 API 的名字 connect 和用作这个API call 的论点 的寄存器.

为了从这些call site 获得额外的信息,我们检查如何计算每个实参的值。 为了收集这些信息,我们在程序的调用位置提取每个寄存器的静态切片。A slice of a program at a specific location for a specific value is a subset of the instructions in the program that is necessary to create the value at this specific location. 程序中位于特定位置的某个特定值的切片是程序中指令的子集,这些指令是在这个特定位置创建该值所必需的。

Fig. 2(a) 中的 ①②③,用紫色标记的节点组成了通向 connect 调用指令的CFG路径。在 Fig. 2(b) 中,connect call site的每个寄存器都通过一个箭头连接到用于计算其值的 P P P 的切片。这些切片是从这条路径中的指令中提取出来的。由于有些实参是指针,我们执行一个指针感知的静态程序切片。
Augmenting call sites using concrete and abstract values.
几个信息切片展示在 Fig. 2(b):
在这里插入图片描述

(1)在④中,rdi被赋值为socket之前调用的返回值:(i)根据 calling conventions,socket的返回值在从调用返回之前被放置在rax中,(ii)将rax的值赋给栈上rbp-58h位置的局部变量,(iii) rax从相同的堆栈位置读取这个值,(iv)该值由rax分配给rdi。
(2)在⑤中,rdi 从传递到 p p p 中的实参中得到它的值:(i) 巧合的是,rdi 被用来传递这个值给 P P P,(ii)将该值置于rbp-50h 的堆栈变量中,(iii)该值从栈赋给rdi,(iv)该值从rdi赋给rsi
(3)在⑥中,常数值16直接赋给rdx。这意味着我们可以确定作为connect参数使用的这个寄存器的具体值。
注意,Fig. 2(b) 切片中的指令并不是都出现在Fig. 2(a) 中的调用指令之上。这是由编译器优化和其他放置约束引起的。
在这里插入图片描述
在切片 ④ 和 ⑤ 中,作为参数使用的寄存器的具体值是未知的。
使用静态分析,我们通过替换寄存器名(没有任何意义)来增加重构的call sites:
(i) 如果能确定寄存器的具体值;
(ii) 一个更广泛的范畴,我们称之为参数抽象值。
抽象值是通过将切片分类为以下类之一来提取的:

  • argument (ARG)
  • global value (GLOBAL)
  • unknown return value from calling a procedure (RET)
  • local variable stored on the stack (STK).

如 Fig. 2© 所示,执行这个过程在 connect call site 生成 connect(RET,ARG,16)

Predicting procedure names using augmented call site graph.
Figure 2© 显示了如何使用 augmented call sites来 创建 the augmented call sites graph。每个节点代表 augmented call sites。Call site 节点由边连接表示代表 P P P 的可能的运行序列。这描述了正确排序的扩展调用站点如何为二进制过程创建强大的表示。与反汇编代码( Fig. 1(a))相比,这一点尤其明显。

在这里插入图片描述
Using this representation, we can train a GNN based model on the graph itself. Alternatively, we can employ LSTM-based and Transformer-based models by extracting simple paths from the graph, to serve as sequences that can be fed into these models.
使用这种表示,我们可以根据图本身训练基于GNN的模型。另外,我们可以通过从图中提取简单的路径来使用基于 LSTM 和基于 Transformer 的模型,作为可以输入这些模型的序列
Key aspects.

  • Using static analysis, augmented API call sites can be reconstructed from the shallow assembly code.
  • Analyzing pointer-aware slices for arguments allows call site augmentation by replacing register names with concrete or abstracted values.
  • By analyzing the CFG, we can learn augmented call sites in their approximated runtime order.
  • A CFG of augmented call site is an efficient and informative representation of a binary procedure.
  • Various neural network architectures trained using this representation can accurately predict complex procedure names.
  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值