Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation

本文链接：https://blog.csdn.net/weixin_43846270/article/details/109103725

Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation (Type IV)

four types of clones:
Type I: literally identical
Type II: syntactically equivalent
Type III: slightly modified
Type IV: semantically similar (assembly functions may appear syntactically different, but share similar functional logic in their source
code)

Introduction

However, designing an effective search engine is difficult, due to varieties of compiler optimizations and obfuscation techniques that make logically similar assembly functions appear to be dramatically different.
It is challenging to identify these semantically similar, but structurally and syntactically different assembly functions as clones.
Two problems

P1: Existing state-of-the-art static approaches fail to consider the relationships among features. （没有考虑特征之间的关系）
e.g.,
A fclose libc function call is related to other file-related libc calls such as fopen.
A strcpy libc call can be replaced with memcpy.

To address this problem, we propose to incorporate lexical semantic relationship into the feature engineering process. （将词汇语义关系纳入特征工程过程）Asm2V ec explores co-occurrence relationships among tokens and discovers rich lexical semantic relationships among tokens.

P2: The existing static approaches assume that features are equally important or require a mapping of equivalent assembly functions to learn the weights.

Inspired by recent development in representation learning, we propose to train a neural network model to read many assembly code data and let the model identify the best representation that distinguishes one function from the rest.

Problem Definition

在这里插入图片描述

Overall Workflow

Step 1: Given a repository of assembly functions, we first build a neural network model for these functions.
给定一个函数库，首先为这些函数建立一个神经网络模型。
Step 2: After the training phase, the model produces a vector representation for each repository function.
在训练阶段之后，模型为每个函数生成一个向量表示
Step 3: Given a target function $f_t$ that was not trained with this model, we use the model to estimate its vector representation.
给定一个未经过该模型训练的目标函数 $f_t$ ，我们使用该模型估计其向量表示。
Step 4: We compare the vector of $f_t$ against the other vectors in the repository by using cosine similarity to retrieve the top- $k$ ranked candidates as results.
将 $f_t$ 向量与库中其他向量进行比较，利用余弦相似度检索排名前k的候选向量作为结果。

The training process is a one-time effort and is efficient to learn representation for queries. If a new assembly function is added to the repository, we follow the same procedure in Step 3 to estimate its vector representation. The model can be retrained periodically to guarantee the vectors’ quality.
（该训练过程是一次性的，可以有效地学习查询的表示。如果向存储库中添加了一个新的汇编函数，我们将按照步骤3中的相同过程来估计它的向量表示。该模型可以周期性地进行再训练，以保证矢量的质量。）

Assembly Code Representation Learning（based on the PV-DM model）

Preliminaries (PV-DM model)

在这里插入图片描述
给定一个包含多个句子的文本段落，PV-DM在每个句子上应用一个滑动窗口。滑动窗口从句首开始，每一步向前移动一个单词。
e.g.,
Figure 4. 滑动窗口是5

第一步，滑动窗口包含五个单词“the”，“cat”，“sit”，“on”和“a”。
中间的单词“sat”被当作目标，周围的单词被当作上下文。
第二步，窗口向前移动一个单词，其中包含“cat”、“sit”、“on”、“a”和“mat”，其中“on”是目标单词。

在每一步，PV-DM模型执行一个多类预测任务。它将当前段落映射为基于段落ID的向量，并将上下文中的每个单词映射为基于单词ID的向量。该模型对这些向量进行平均，并通过softmax分类从词汇表中预测目标词。利用反向传播的分类误差来更新这些向量。
给定一个文本语料库 $T$ , 包括一个段落列表 $\in T$ ，每一个段落包含一个句子列表 $\in p$ ，并且每一个句子包含一个单词序列 $w_t \in s$ , 单词数目为 $∣ s ∣$ .

PV-DM maximizes the log probability:

(1) $\sum_{p}^{T}\sum_{s}^{p}\sum_{t=k}^{|s|-k}logP(w_t|p,w_{t-k},...,w_{t+k})$

滑动窗口大小为2k + 1。段落向量捕捉上下文中缺失的信息来预测目标。它被解释为主题。PV-DM是为按顺序排列的文本数据设计的。

The Asm2Vec Model

PV-DM是为按顺序排列的文本数据设计的。但是，汇编代码比纯文本具有更丰富的语法。它包含与纯文本在结构上不同的操作、操作数和控制流。
一个汇编函数可以被表示为控制流图（CFG）。我们将控制流图建模为多个序列。每个序列都对应于一个潜在的执行轨迹，该轨迹包含线性排列的汇编指令。
这个步骤对应于Step1， Step2. 通过这两个步骤，我们训练出一个表示模型，为每一个函数库中的函数 $f_s\in RP$ 生成一个数值向量。

首先，我们将每个函数 $f_s$ 映射为一个向量 ${\mathop{\theta} \limits ^{\rightarrow}}_{f_s}\in\mathbb{R}^{2\times d}$ .

${\mathop{\theta} \limits ^{\rightarrow}}_{f_s}$ 是函数 $f_s$ 的需要在训练中学习的表示向量。
$d$ 是用户选择的参数。

类似地，我们手机在存储库 $R P$ 中所有独特的token，我们将汇编代码的操作数和操作符视为token。我们将每一个token $t$ 映射为一个数值向量 ${\mathop{v_t} \limits ^{\rightarrow}}\in\mathbb{R}^{d}$ 和另一个数值向量 ${\mathop{v'_t} \limits ^{\rightarrow}}\in\mathbb{R}^{2\times d}$ 。

$\mathop{v_t} \limits ^{\rightarrow}$ 是 token $t$ 的向量表示。训练后，它代表了token的词汇语义。向量 $\mathop{v_t} \limits ^{\rightarrow}$ 用来可视化token之间的关系。
$\mathop{v'_t} \limits ^{\rightarrow}$ 被用作token的预测。

所有的 ${\mathop{\theta} \limits ^{\rightarrow}}_{f_s}$ 和 $\mathop{v_t} \limits ^{\rightarrow}$ 被初始化为 0 附近的小随机值。
所有的 $\mathop{v'_t} \limits ^{\rightarrow}$ 被初始化为 0 。
对于 $f_s$ ，我们是用 $2\times d$ ，因为我们连接操作符的向量和操作数的向量来表示一条指令。

我们将存储库 $R P$ 中的每一个函数 $f_s \in RP$ 看做一个多序列 $\mathcal{S}(f_s) = seq[1:i]$ , 其中 $seq_i$ 是其中的一条。我们假设序列的顺序是随机的。
一个序列被表示为一个指令列表 $\mathcal{I}(seq_i) = in[1:j]$ ，其中 $in_j$ 是其中的一个指令。
一条指令 $in_j$ 包含了多个操作数 $\mathcal{A}(in_j)$ 和一个操作符 $\mathcal{P}(in_j)$ 。它们的连接被表示为它的 token $\mathcal{T}(in_j) = \mathcal{P}(in_j) || \mathcal{A}(in_j)$ , 其中 $∣ ∣$ 表示连接. 常量 token 被规范化为十六进制形式。

对于函数 $f_s$ 中的每一条序列 $seq_i$ ，神经网络从序列的开始遍历指令。
我们收集当前的指令 $in_j$ ，它的前一条指令 $in_{j-1}$ ，以及它的下一条指令 $in_{j+1}$ 。我们忽视那些越界的指令。

提出的模型试图最大化存储库 $R P$ 的 $l o g$ 概率：

(2) $\sum_{f_s}^{RP}\sum_{seq_i}^{\mathcal{S_{f_s}} }\sum_{in_j}^{\mathcal{I(seq_i)}}\sum_{t_c}^{\mathcal{T(in_j})}log \mathbf P(t_c|f_s,in_{j-1},in_{j+1})$

如果给定当前的汇编函数 $f_s$ 和她的邻居指令，这个公式用来最大化当前指令的token $t_c$ 的 $l o g$ 概率。是使用当前函数的向量和相邻指令提供的上下文来预测当前指令。相邻指令提供的向量捕获词汇语义关系。函数的向量记住了在给定的上下文下不能预测的东西。它模拟指令，以区分当前函数与其他函数。

对于一个给定函数 $f_s$ ，我们首先通过预先建好的字典寻找它的向量表示 ${\mathop{\theta} \limits ^{\rightarrow}}_{f_s}$ 。为了将一个邻居指令 $i n$ 建模为 $\mathcal{CT} (in) \in \mathbb{R}^{2\times d}$ , 我们对它的操作数的向量 $(\in \mathbb{R}^d)$ 进行平均，并将其与操作符的平均向量表示 $(\in \mathbb{R}^d)$ 相连接。
可以表示为：
(3) $\mathcal{CT}(in)= {\mathop{v} \limits ^{\rightarrow} }_\mathcal{P}||\frac{1}{\mathcal{A}(in)}\sum_{t}^{\mathcal{A}(in)} {\mathop{v} \limits ^{\rightarrow}}_{t_b}$

$\mathcal{P}(*)$ 表示一个操作符并且它是一个单独的 token 。
通过用 $\mathcal{CT}(in_j -1)$ 和 $\mathcal{CT}(in_j +1)$ 平均 $f_s$ ， $\delta(in, f_s)$ 对相邻指令的联合内存进行建模:

(4) $\delta(in, f_s) = \frac{1}{3}({\mathop{\theta} \limits ^{\rightarrow}}_{f_s} + \mathcal{CT}(in_{j-1})+\mathcal{CT}(in_{j+1}))$

例：
考虑一个简单的汇编函数 $f_s$ 和它的其中一条序列，如图5所示
在这里插入图片描述
用第三条指令举例， $j = 3$ , $\mathcal{T}(in_3) = \{'push', 'rbx'\}$ 。
$\mathcal{A}(in_{3-1}) =\{'rbp', 'rsp\}$ . $\mathcal{P}(in_{3-1}) = \{'mov' \}$ .
我们收集它们各自的向量： ${\mathop{v} \limits ^{\rightarrow}}_{rbp}$ , ${\mathop{v} \limits ^{\rightarrow}}_{rsp}$ , ${\mathop{v} \limits ^{\rightarrow}}_{mov}$ , 并且计算 $\mathcal{CT}(in_{3-1}) = {\mathop{v} \limits ^{\rightarrow}}_{mov}|| ({\mathop{v} \limits ^{\rightarrow}}_{rbp}+{\mathop{v} \limits ^{\rightarrow}}_{rsp})/2$ 。
使用同样的步骤，计算 $\mathcal{CT}(in_{3+1})$ .
使用公式（4）, 可以得到 $\delta(in_3, f_s)$
给定 $\delta(in, f_s)$ ，公式（2）中词的概率可以被写作：
(5) $\mathbf{P}(t_c|f_s, in_{j-1}, in_{j+1}) = \mathbf{P}(t_c|\delta(in_j, f_s))$

我们将每一个 token 映射为两个向量 ${\mathop{v} \limits ^{\rightarrow}}$ , ${\mathop{v'} \limits ^{\rightarrow}}$ 。
对于每一个 token $t_c \in \mathcal{T}(in_j)$ ，属于当前的指令，我们寻找它的输出向量 ${\mathop{v'} \limits ^{\rightarrow}}_{t_c}$ 。
公式（5）的概率可以被建模为一个 softmax 多类回归录像：
在这里插入图片描述

$D$ 表示在存储库 $R P$ 上构建的整个词汇表。
$Uh(\cdot)$ 表示一个应用在向量中每一个值的 sigmoid 函数
对于softmax 要布局的每一遍，要估计的参数总数是 $(|D|+1)\times2\times d$ 。
$∣ D ∣$ 对于softmax分类来说太大了。

我们使用 $k$ negative sampling 方法去近似 log 概率：

在这里插入图片描述是一个 identity 函数。如果这个函数内的表达式被计算为真，那么它输出1；否则为0.
例如：

negative sampling 算法区分出正确的猜测 $t_c$ 用 $k$ 个随机选择的 negative samples $\{t_d|t_d \neq t_c\}$ 使用 $k + 1$ logistic 回归.。

$\mathbb E_{t_d \sim P {t_c}}$ 是一个采样函数，从词汇表 $D$ 中采样 $t_d$ 根据从 $D$ 中构建的噪声分布 $P_n(t_c)$ 。

通过导数，分别对于 ${\mathop{v'} \limits ^{\rightarrow}}_t$ , ${\mathop{\theta} \limits ^{\rightarrow}}_{f_s}$ 计算梯度如下：

在这里插入图片描述
通过导数，分别对于 ${\mathop{v'} \limits ^{\rightarrow}}_{\mathcal P(in_{j+1})}$ , $\{{\mathop{v} \limits ^{\rightarrow}}_{t_b}|t_b \in \mathcal A(in_{j+1})\}$ , 计算梯度如下：

之后，我们使用反向传播来更新所涉及向量的值。具体来说，我们更新 ${\mathop{\theta} \limits ^{\rightarrow}}_{f_s}$ ，所有涉及到的
${\mathop{v} \limits ^{\rightarrow}}_t$ 和 ${\mathop{v'} \limits ^{\rightarrow}}_t$ 根据它们的梯度，使用一个学习率。

例：
继续示例1，其中目标 token 为 $^{'} p u s h^{'}$ 。接下来，我们使用 negative sampling (方程A)计算 $\mathbf P ({\mathop{v'} \limits ^{\rightarrow}}_{push} |\delta(in_j, f_s))$ 。之后，我们使用公式7和8计算梯度。我们更新这两个例子中所有涉及的向量，根据它们各自的梯度，以学习速率。