Contribution
-
提出一种图简化的代码表征,以解决节点的远程依赖问题
-
提出一种图神经网络增强学习的方法
-
开发了一个系统AMPLE,其在3个数据集上拥有最优性能
https:// github.com/AMPLE001/AMPLE
Framework
AMPLE包含两个大阶段,第一个是代码表征的图简化,第二个是特征提取的增强GNN学习
Source Code ->
Graph Simplification ( Graph Generating -> TGS -> VGS ) ->
( word2vec -> ) Enhanced Graph Representation Learning ( Edge-Aware GCN module -> Kernel-scaled representation module ) ->
( Binary Classification (2-Layer FC -> Softmax) )
Graph Simlification
Graph Generating
文中使用 joern 工具用作 AST、 PDG、 DFG 的初始图生成,然后同 Devign 同样处理,得到一个基于 AST 的复合图
Type-base Graph Simplification
-
定义:根据复合图中的节点类型进行节点删除。根据解析原则 1 \ ^1 1和手工检查生成的P-C简化规则进行相邻节点删除
-
P-C简化规则,即根据父(Parent)节点的类型和子(Child)节点的类型进行子节点的删除,父节点继承子节点的边,因为子节点的会在父节点和后续节点中被反映出来
Rule | P-Type | C-Type |
---|---|---|
1 | Expression Statement | Assignment Expression Unary Expression Call Expression Post Inc Dec Operation Expression |
2 | Indentifier Declare Expression | Indentifier Declaraion |
3 | Condition | * |
4 | ForInit | * |
5 | Call Expression | Argument List |
6 | Argument | * |
7 | Callee | Identifier |
Variable-based Graph Simplification
-
定义:根据复合图中的变量类型进行不改变层级关系的叶节点合并。
-
一个形象的说法就是,根据NCS边遍历叶节点,重复出现的同一变量节点,仅保留首次出现的叶子节点以及后续叶子的边
-
此时一个叶子节点的父节点可能有多个,这样就利用了跨语句的语义信息
Algorithm1:Graph Simplification
I n p u t : Original Code Structure Graph: G r a p h O r i g i n a l O u t p u t : Simplified Code Structure Graph: G r a p h O r i g i n a l F u n c t i o n GS: // The procedure of TGS G r a p h T G S ← G r a p h O r i g i n a l f o r e a c h A S T T ∈ G r a p h T G S d o // Do breadth-first traversal S t a c k ← ∅ P u s h i n g T . r o o t i n t o S t a c k w h i l e S t a c k ≠ ∅ d o u ← P o p S t a c k C h i l d r e n ← Get all the child nodes of u f o r v ∈ C h i l d r e n d o i f ( u . T y p e , v . T y p e ) c o n f o r m t o r u l e s T a b l e 1 t h e n D e l e t e e d i t ( u , v ) a n d v f r o m G r a p h T G S ; A d d e d g e s b e t w e e n u a n d a l l c h i l d n o d e s o f v ; P u s h a l l c h i l d n o d e s o f v i n t o S t a c k ; e l s e P u s h v i n t o S t a c k ; e n d i f e n d f o r e n d w h i l e e n d f o r // The procedure of VGS G r a p h G S ← G r a p h T G S l e a f n o d e s ← get all the leaf nodes of AST in G r a p h G S s a m e v a r i a b l e ← get all the node groups with the same variable in l e a f _ n o d e s f o r e a c h g r o u p ∈ s a m e _ b a r i a b l e d o Merging all the nodes in the g r o u p into one variable node e n d f o r r e t u r n G r a p h G S \mathbf {Input\ \ \ :} \text{ Original Code Structure Graph: } Graph_{Original}\\ \mathbf{Output:} \text{Simplified Code Structure Graph: } Graph_{Original}\\ \mathbf{Function} \text{ GS:}\\ \quad \text{// The procedure of TGS}\\ \quad Graph_{TGS} \leftarrow Graph_{Original}\\ \quad \mathbf{for}\ each\ AST\ \mathbf T\ \in\ Graph_{TGS}\ \mathbf{do}\\ \quad\quad \text{// Do breadth-first traversal}\\ \quad\quad Stack\ \leftarrow\ \empty\\ \quad\quad Pushing\ \mathbf T.root\ into\ Stack\\ \quad\quad \mathbf{while}\ Stack\ \ne\ \empty\ \mathbf{do}\\ \quad\quad\quad u \leftarrow\ Pop\ Stack \\ \quad\quad\quad Children\ \leftarrow\ \text{Get all the child nodes of }u\\ \quad\quad\quad \mathbf{for}\ v\in\ Children\ \mathbf{do}\\ \quad\quad\quad\quad \mathbf{if}\ (u.Type,\ v.Type)\ conform\ to\ rules\ Table\ 1\ \mathbf{then}\\ \quad\quad\quad\quad\quad Delete\ edit\ (u,\ v)\ and\ v\ from\ Graph_{TGS};\\ \quad\quad\quad\quad\quad Add\ edges\ between\ u\ and\ all\ child\ nodes\ of\ v;\\ \quad\quad\quad\quad\quad Push\ all\ child\ nodes\ of\ v\ into\ Stack;\\ \quad\quad\quad\quad \mathbf{else}\\ \quad\quad\quad\quad\quad Push\ v\ into\ Stack;\\ \quad\quad\quad\quad \mathbf{end\ if}\\ \quad\quad\quad\ \mathbf{end\ for}\\ \quad\quad \mathbf{end\ while}\\ \quad \mathbf{end\ for}\\ \quad \text{// The procedure of VGS}\\ \quad Graph_{GS}\ \leftarrow\ Graph_{TGS}\\ \quad leaf_nodes\ \leftarrow\ \text{get all the leaf nodes of AST in }Graph_{GS}\\ \quad same_variable\ \leftarrow\ \text{get all the node groups with the same variable in }leaf\_ nodes\\ \quad\mathbf{for}\ each\ group\ \in\ same\_ bariable\ \mathbf{do}\\ \quad\quad \text{Merging all the nodes in the } group \text{ into one variable node}\\ \quad \mathbf{end\ for}\\ \mathbf{return\ }Graph_{GS} Input : Original Code Structure Graph: GraphOriginalOutput:Simplified Code Structure Graph: GraphOriginalFunction GS:// The procedure of TGSGraphTGS←GraphOriginalfor each AST T ∈ GraphTGS do// Do breadth-first traversalStack ← ∅Pushing T.root into Stackwhile Stack = ∅ dou← Pop StackChildren ← Get all the child nodes of ufor v∈ Children doif (u.Type, v.Type) conform to rules Table 1 thenDelete edit (u, v) and v from GraphTGS;Add edges between u and all child nodes of v;Push all child nodes of v into Stack;elsePush v into Stack;end if end forend whileend for// The procedure of VGSGraphGS ← GraphTGSleafnodes ← get all the leaf nodes of AST in GraphGSsamevariable ← get all the node groups with the same variable in leaf_nodesfor each group ∈ same_bariable doMerging all the nodes in the group into one variable nodeend forreturn GraphGS
Enhanced Graph Representation Learning
Node->vector
使用word2vec完成, d d d 为向量维度
EA-GCN:Edge-Aware Graph Convolutional Network Module
-
EA-GCN 首先通过对不同类型边的加权计算节点向量,然后通过多头注意力(multi-head attention mechanism)加强向量的特征提取
-
定义经过之前的阶段,现在接受一个图 G ( V , E , R ) G(\mathcal{V,\ E,\ R}) G(V, E, R) ,其中 v i ∈ V v_i\in\mathcal V vi∈V 是点集,每个点的初始向量状态为 h i 0 ∈ R d h_i^0\in\mathbb R^{d} hi0∈Rd , ( v i , β , v j ) ∈ E (\ v_i,\ \beta,\ v_j)\in \mathcal E ( vi, β, vj)∈E 表示图中的一条从节点 v i v_i vi 出发,在节点 v j v_j vj 结束的类型为 β ∈ R \beta\in\mathcal R β∈R 的有向边
-
考虑节点 v i ∈ V v_i\in\mathcal V vi∈V 的一条 β \beta β 类型的边,在第 l l l 层网络中,权重矩阵定义为: W β l = a β l V l W_\beta^l=a_\beta^lV^l Wβl=aβlVl ,其中 a β a_\beta aβ 是可学习的权重
-
权重矩阵 W β l W_\beta^l Wβl 被用来更新节点向量:
$\ \begin{align}\widetilde{h_i^l}=\sigma \large(\ \normalsize \sum_{\beta\in\mathcal E}\sum_{j\in\mathcal{N_i\beta}} \frac{1}{c_i{,\beta}} W_\betal\ h_j^{l-1}\ +\ W_0l h_i{l-1}\ \large)\end{align}\$ -
使用多头注意力机制更好的提取边缘信息。注意分数定义为:
KaTeX parse error: No such environment: align at position 10: \\ \begin{̲a̲l̲i̲g̲n̲}̲w_{i\rightarrow… -
多头注意力的聚合通过下列定义进行:
KaTeX parse error: No such environment: align at position 10: \\ \begin{̲a̲l̲i̲g̲n̲}̲A_i^l=CONCAT_{k…
Kernel-scaled Representation Module
-
捕获远程节点间的语义信息。用一个大核和一个小核并行计算,分别关注远节点和近节点
-
给定经过 EA-GCN 学习后的向量矩阵 H i l H_i^l Hil ,以及大小核的大小 N , M ( M < N ) N,\ M\ (M\lt N) N, M (M<N) ,可计算这层的输出为:
$\ \begin{align} K{out}=BN( H_il * WL, \muL\ )+BN(\ H_il*WS,\ \nu^S\ )\end{align}\$
其中 B N BN BN 是批量归一化(Batch Normalize)层, μ L , ν S \mu^L,\ \nu^S μL, νS 分别是大核小核的 B N BN BN 层数。 W L ∈ R C o u t × C i n × N , W S ∈ R C o u t × C i n × M W^L\in\mathbb R^{C_{out}\times C_{in}\times N},\ W^S\in\mathbb R^{C_{out}\times C_{in}\times M} WL∈RCout×Cin×N, WS∈RCout×Cin×M 分别是大小卷积核
Bianry Classification
- 2-Layer FC + Softmax
Experiment
-
有效性:AMPLE是SOTA
-
图简化的有效性:通过将Original、TGS、VGS、GS分别喂同样的网络模型学习,发现GS的效果最好
-
图简化程度:通过节点和边数量精简率、节点距离精简率来对比发现,TGS和VGS都有很好的效果
-
EA-GCN和Kernel-scaled的有效性:通过将MAPLE、MA-GCN、Kernel-scaled进行检测,发现MAPLE效果最好
-
EA-GCN的可替代性:通过将GCN、R-GCN、GGNN替换EA-GCN,发现EA-GCN效果最好
-
Kernel-scaled的可替代性:通过使用Dense层替换进行实验,发现Kernel-scaled效果最好
-
文中对大小卷积核的尺寸、EA-GCN的层数、注意力机制的头数等超参数进行了多次对比试验
Explaination
-
GS保证了数据的精简,缩短了语义距离,网络能够更有效更快的学习到特征
-
EA-GCN抽取多种异构图的复合图信息
-
Kernel-scaled关注了全局信息
-
多头注意力机制确保MAPLE对于漏洞语句的权重更高
Limitation
-
数据集不够多,使用了3个常用数据集(FFMPeg+Qemu、Reveal、Fan)
-
局限与C/C++