Contribution
- 提出新框架SySeVR(Syntax-based, Semantic-based and Vector Representations)
SyVCs:Syntax-based Vulnerability Candidates,基于语法的漏洞候选
SeVCs:Semantic-based Vulnerability Candidates,基于语义的漏洞候选
SeVCs在SyVCs上进行扩展,将数据流和控制流纳入考虑范围
并设计了自动获取SyVCs和SeVCs的算法
- 开放自NVD和SARD整合的数据集
含有126种漏洞信息,https://github.com/SySeVR/SySeVR
-
允许多种深度神经网络在框架内运行,其中BGRU效果最好
-
将控制流纳入考虑,融入了更多的语义信息
Basic Idea
来自CNN中的框图概念。对于一个程序中,易受攻击的代码只有部分行,将整个程序区分为多个在语义(数据依赖和控制依赖)上关联的片段,检查这个片段是否是易受攻击的。
所以得到如下的框架:抽取语法片段->转为语义片段->向量化->深度学习网络
Dataset
-
NVD+SARD
-
真实值标签同VulDeePecker
-
Merics:除了FNR、FPR、P、F1以外,添加 A ( A c c u r a c y ) = T P + T N T P + F P + T N + F N A(Accuracy)=\frac{TP+TN}{TP+FP+TN+FN} A(Accuracy)=TP+FP+TN+FNTP+TN 评价模型预测的准确度和 M C C = T P ∗ T N − F P ∗ F N ( T P + F P ) ( T P + F N ) ( F P + T N ) ( F N + T N ) MCC=\frac{TP*TN-FP*FN}{\sqrt{(TP+FP)(TP+FN)(FP+TN)(FN+TN)}} MCC=(TP+FP)(TP+FN)(FP+TN)(FN+TN)TP∗TN−FP∗FN 评价模型预测与真值标签匹配的程度
Extract SyVC
Definition
- 漏洞具有自己的语法特征,定义为 H = { h k } , 1 ≤ k ≤ β H=\{h_k\}, 1 \leq k \leq \beta H={hk},1≤k≤β , β \beta β是漏洞语法特征总数
如果一个程序的具有语义关联的部分的语法特征跟H集合中的某个语法特征相匹配了,则将其认为是SyVC
-
程序P是函数 f 1 , . . . , f η f_1,...,f_\eta f1,...,fη 组成的集合,写作 P = { f 1 , . . . , f η } P=\{f_1,...,f_\eta\} P={f1,...,fη}
-
函数 f i f_i fi 是由语句 s i , 1 , . . . , s i , m i s_{i,1},...,s_{i,m_i} si,1,...,si,mi 组成的集合 ,写作 f i = { s i , 1 , . . . , s i , m i } , 1 ≤ i ≤ η f_i=\{s_{i,1},...,s_{i,m_i}\}, 1\leq i \leq \eta fi={si,1,...,si,mi},1≤i≤η
-
语句 s i , j s_{i,j} si,j 是由令牌(token) t i , j , 1 , . . . t i , j , ω i , j t_{i,j,1},...t_{i,j,\omega_{i,j}} ti,j,1,...ti,j,ωi,j 组成的集合,写作 s i , j = { t i , j , 1 , . . . t i , j , ω i , j } , 1 ≤ i ≤ η , 1 ≤ j ≤ m i s_{i,j}=\{t_{i,j,1},...t_{i,j,\omega_{i,j}}\}, 1\leq i \leq \eta , 1 \leq j \leq m_i si,j={ti,j,1,...ti,j,ωi,j},1≤i≤η,1≤j≤mi
一个SyVC可能对应一个令牌,也可能对应多个令牌组合。对应到程序的AST中,就是AST的叶子节点和中间节点
- 代码元素(code element) e i , j , z e_{i,j,z} ei,j,z 是由若干令牌组成的,写作 e i , j , z = ( s i , j , μ , . . . , s i , j , ν ) , 1 ≤ μ ≤ ν ≤ ω i , j e_{i,j,z}=(s_{i,j,\mu},...,s_{i,j,\nu}),1 \leq \mu \leq \nu \leq \omega_{i,j} ei,j,z=(si,j,μ,...,si,j,ν),1≤μ≤ν≤ωi,j 。如果一个代码元素 e i , j , z e_{i,j,z} ei,j,z 跟漏洞语法特征H集合中的某个特征匹配上,则这个代码元素 e i , j , z e_{i,j,z} ei,j,z 被称为SyVC(Syntax-based Vulnerability Candidates, 基于语法的漏洞候选)
Algorithm1:Extracting SyVCs from a program
Input: A program P = { f 1 , … , f η } ; a set H = { h k ∣ 1 ≤ k ≤ β } of vulnerability syntax characteristics Output: A set Y of SyVCs 1 : Y = { } ; 2 : for each function f i ∈ P do 3 : Generate an abstract syntax tree T i for f i ; 4 : for each code element e i , j , z in T i do 5 : for each h k ∈ H do 6 : if e i , j , z matches h k then 7 : Y = Y ∪ { e i , j , z } ; 8 : end if 9 : end for 10 : end for 11 : end for 12 : return Y , the set of SyVCs \small\text{Input: A program } P = \{ f_1, \ldots, f_{\eta} \}; \text{ a set } H = \{ h_k | 1 \leq k \leq \beta \} \text{ of vulnerability syntax characteristics} \\ \text{Output: A set } Y \text{ of SyVCs} \\ 1: Y = \{\}; \\ 2: \text{for each function } f_i \in P \text{ do} \\ 3: \qquad\text{Generate an abstract syntax tree } T_i \text{ for } f_i; \\ 4: \qquad\text{for each code element } e_{i,j,z} \text{ in } T_i \text{ do} \\ 5: \qquad\qquad\text{for each } h_k \in H \text{ do} \\ 6: \qquad\qquad\qquad\text{if } e_{i,j,z} \text{ matches } h_k \text{ then} \\ 7: \qquad\qquad\qquad Y = Y \cup \{ e_{i,j,z} \}; \\ 8: \qquad\qquad\qquad\text{end if} \\ 9: \qquad\qquad\text{end for} \\ 10: \qquad\text{end for} \\ 11: \text{end for} \\ 12: \text{return } Y \text{, the set of SyVCs} Input: A program P={f1,…,fη}; a set H={hk∣1≤k≤β} of vulnerability syntax characteristicsOutput: A set Y of SyVCs1:Y={};2:for each function fi∈P do3:Generate an abstract syntax tree Ti for fi;4:for each code element ei,j,z in Ti do5:for each hk∈H do6:if ei,j,z matches hk then7:Y=Y∪{ei,j,z};8:end if9:end for10:end for11:end for12:return Y, the set of SyVCs
H set
使用checkmarx工具提取数据集中的漏洞语法特征,经过人工检查得到如下四个特征:
名称(缩写) | 简介 | 漏洞对应数量 |
---|---|---|
Libarary/API Function Call(FC) | 811 | 106 |
Array Usage(AU) | 元素访问、地址运算 | 87 |
Pointer Usage(PU) | 指针算数、引用、传参 | 103 |
Arithmetic Expression(AE) | 溢出 | 45 |
注意:一个漏洞可能对应多个语法特征
AST
文中使用Joern工具生成函数的AST
Match
根据H集合中的语法特征进行条件匹配:
特征 | 代码 | 代码元素 e i , j , z e_{i,j,z} ei,j,z | 匹配条件1 | 匹配条件2 |
---|---|---|---|---|
FC | memset(buf,‘\0’) | memset | 在AST中是一个callee节点 | 调用的函数是FC总结的811之一 |
AU | source[99]=‘a’ | source | 在AST中是一个identifier declaration节点 | 声明的节点中具有’[‘和’]'字符 |
PU | data[99]=‘\0’ | data | 在AST中是一个identifier declaration节点 | 声明节点中具有’*'字符 |
AE | data=databuf-8 | data=databuf-8 | 在AST中是一个expression节点 | 节点包含’=‘字符,并且’='右侧具有多个标识符 |
Transfrom SyVCs to SeVCs
Definition
SyVC是从AST中截取而来,SeVC是从PDG中截取而来
-
CFG(Control Flow Graph):对于一个程序 P = { f 1 , . . . , f η } P=\{f_1,...,f_\eta\} P={f1,...,fη} 中的函数 f i f_i fi ,给定一个图 G i = { V i , E i } G_i = \{V_i, E_i\} Gi={Vi,Ei} , 其中 V i = { n i , 1 , . . . , n i , c i } V_i=\{n_{i,1},...,n_{i,c_i}\} Vi={ni,1,...,ni,ci} 中的每一个节点(node) n i , j n_{i,j} ni,j 都表示一个语句(statement)或控制谓词(control predicate), E i = { ϵ i , 1 , . . . , ϵ i , d i } E_i=\{\epsilon_{i,1},...,\epsilon_{i,d_i}\} Ei={ϵi,1,...,ϵi,di} 中的每一个有向边 ϵ i , j \epsilon_{i,j} ϵi,j 都表示两个节点间可能的控制流(control flow)
-
Data Dependency(数据依赖):考虑一个程序 P = { f 1 , . . . , f η } P=\{f_1,...,f_\eta\} P={f1,...,fη} ,函数 f i f_i fi 的CFG G i = { V i , E i } G_i = \{V_i, E_i\} Gi={Vi,Ei} 中,有两个节点 n i , j , n i , γ , 1 ≤ j , γ ≤ c i , j ≠ γ n_{i,j},n_{i,\gamma}, 1 \leq j,\gamma \leq c_i, j \ne \gamma ni,j,ni,γ,1≤j,γ≤ci,j=γ ,有一条从 n i , γ n_{i,\gamma} ni,γ 到 n i , j n_{i,j} ni,j 的路径,并且 n i , γ n_{i,\gamma} ni,γ 的值经过运算合并到 n i , j n_{i,j} ni,j 中,则称 n i , j n_{i,j} ni,j 数据依赖(Data Dependent on) n i , γ n_{i,\gamma} ni,γ
-
Control Dependency(控制依赖):考虑一个程序 P = { f 1 , . . . , f η } P=\{f_1,...,f_\eta\} P={f1,...,fη} ,函数 f i f_i fi 的CFG G i = { V i , E i } G_i = \{V_i, E_i\} Gi={Vi,Ei} 中的两个节点 n i , j , n i , γ , 1 ≤ j , γ ≤ c i , j ≠ γ n_{i,j},n_{i,\gamma}, 1 \leq j,\gamma \leq c_i, j \ne \gamma ni,j,ni,γ,1≤j,γ≤ci,j=γ ,如果从节点 n i , γ n_{i,\gamma} ni,γ 出发到程序结束的所有路径都要经过 n i , j n_{i,j} ni,j, 则称节点 n i , j n_{i,j} ni,j 后向支配(post-dominate)节点 n i , γ n_{i,\gamma} ni,γ 。如果存在一条路径 p p p ,从节点 n i , γ n_{i,\gamma} ni,γ 出发,在节点 n i , j n_{i,j} ni,j 结束,那么则有(i)节点 n i , j n_{i,j} ni,j 后向支配路径 p p p 上除头尾节点的所有节点;(ii)节点 n i , j n_{i,j} ni,j 控制依赖(control dependent on)节点 n i , γ n_{i,\gamma} ni,γ
-
PDG(Program Dependency Graph):对于一个程序 P = { f 1 , . . . , f η } P=\{f_1,...,f_\eta\} P={f1,...,fη} 中的函数 f i f_i fi ,给定一个图 G i ′ = { V i , E i ′ } G_{i}^{'} = \{V_i, E_{i}^{'}\} Gi′={Vi,Ei′} , 其中 V i = { n i , 1 , . . . , n i , c i } V_i=\{n_{i,1},...,n_{i,c_i}\} Vi={ni,1,...,ni,ci} 中的每一个节点(node) n i , j n_{i,j} ni,j 都表示一个语句(statement)或控制谓词(control predicate), E i = { ϵ i , 1 ′ , . . . , ϵ i , d i ′ ′ } E_i=\{\epsilon_{i,1}^{'},...,\epsilon_{i,d_{i}^{'}}^{'}\} Ei={ϵi,1′,...,ϵi,di′′} 中的每一个有向边 ϵ i , j ′ \epsilon_{i,j}^{'} ϵi,j′ 都表示两个节点间的控制流
依旧使用程序切片的思路抽取漏洞相关代码,如下定义全部都基于上述所有定义(抽取SeVC和PDG)中的程序P、函数f、PDG G、SyVC ei,j,z
-
Forward Slice(前向切片): f s i , j , z = { n i , x 1 , . . . , n i , x μ i } ⊆ V i , w h e r e x p s a t i s f y i n g 1 ≤ x 1 ≤ x p ≤ μ i ≤ c i fs_{i,j,z}=\{n_{i,x_1},...,n_{i,x_{\mu_i}}\}\subseteq\ V_i,\ where\ x_p\ satisfying\ 1 \leq x_1 \leq x_p \leq \mu_i \leq c_i fsi,j,z={ni,x1,...,ni,xμi}⊆ Vi, where xp satisfying 1≤x1≤xp≤μi≤ci ,并且切片中的所有节点都存在于图 G ′ G^{'} G′ 中自 e i , j , z e_{i,j,z} ei,j,z 出发的路径中
-
Interprocedual Forward Slice(程序间前向切片): f s i , j , z ′ fs_{i,j,z}^{'} fsi,j,z′ 是 f s i , j , z fs_{i,j,z} fsi,j,z 跨越或不跨越PDG的版本,即 f s i , j , z ′ fs_{i,j,z}^{'} fsi,j,z′ 中的所有节点都满足(i)节点可能属于一个或多个PDG;(ii)每一个节点都可以从 e i , j , z e_{i,j,z} ei,j,z 出发,通过若干个函数调用到达
-
Backword Slice(后向切片): b s i , j , z = { n i , y 1 , . . . , n i , y ν i } ⊆ V i , w h e r e y p s a t i s f y i n g 1 ≤ y 1 ≤ y p ≤ ν i ≤ c i bs_{i,j,z}=\{n_{i,y_1},...,n_{i,y_{\nu_i}}\}\subseteq\ V_i,\ where\ y_p\ satisfying\ 1 \leq y_1 \leq y_p \leq \nu_i \leq c_i bsi,j,z={ni,y1,...,ni,yνi}⊆ Vi, where yp satisfying 1≤y1≤yp≤νi≤ci ,并且切片中的所有节点都存在于图 G ′ G^{'} G′ 中在 e i , j , z e_{i,j,z} ei,j,z 结束的路径中
-
Interprocedual Backword Slice(程序间后向切片): b s i , j , z ′ bs_{i,j,z}^{'} bsi,j,z′ 是 b s i , j , z bs_{i,j,z} bsi,j,z 跨越或不跨越PDG的版本,即 b s i , j , z ′ bs_{i,j,z}^{'} bsi,j,z′ 中的所有节点都满足(i)节点可能属于一个或多个PDG;(ii)每一个节点都可以通过若干个函数调用到达 e i , j , z e_{i,j,z} ei,j,z
-
Program Slice(程序切片):SyVC e i , j , z e_{i,j,z} ei,j,z 对应的 f s i , j , z ′ fs_{i,j,z}^{'} fsi,j,z′ 和 b s i , j , z ′ bs_{i,j,z}^{'} bsi,j,z′ 经过保序的去重合并,得到程序切片 p s i , j , z ′ ps_{i,j,z}^{'} psi,j,z′
将SyVC对应的程序切片中的节点替换为源代码之后就能够得到对应的SeVC
- SeVC:SeVC就是SyVC对应的程序切片的源代码形式。形式化描述,对于程序P和SyVC e i , j , z e_{i,j,z} ei,j,z ,对应的 SeVC 定义为 δ i , j , z = { s a 1 , b 1 , . . . , s a ν i , j , z , b ν i , j , z } \delta_{i,j,z}=\{s_{a_1,b_1},...,s_{a_{\nu_{i,j,z}},b_{\nu_{i,j,z}}}\} δi,j,z={sa1,b1,...,saνi,j,z,bνi,j,z} ,其中语句 s a p , b q , 1 ≤ a p , b q ≤ ν i , j , z s_{a_p,b_q}\ ,\ 1 \leq a_p,\ b_q \leq \nu_{i,j,z} sap,bq , 1≤ap, bq≤νi,j,z 和SyVC e i , j , z e_{i,j,z} ei,j,z 之间存在数据依赖或者控制依赖
Algorithm2:Transforming SyVCs to SeVCs in program P
Input: A program P = { f 1 , … , f η } ; a set Y of SyVCs generated by Algorithm 1 Output: The set of SeVCs 1 : C ← { } ; 2 : for each f i ∈ P do 3 : Generate a PDG G 0 i = ( V i , E 0 i ) for f i ; 4 : end for 5 : for each e i j z ∈ Y in G 0 i do 6 : Generate forward slice f s i j z & backward slice b s i j z of e i j z ; 7 : Generate interprocedural forward slice f s 0 i j z by interconnecting f s i j z and the forward slices from the functions called by f i ; 8 : Generate interprocedural backward slice b s 0 i j z by interconnecting b s i j z and the backward slices from both the functions called by f i and the functions calling f i ; 9 : Generate program slice p s i j z by connecting f s 0 i j z and b s 0 i j z at e i j z ; 10 : for each statement s i j ∈ f i appearing in p s i j z as a node do 11 : δ i j z ← δ i j z ∪ { f s i j } , according to the order of the appearance of s i j in f i ; 12 : end for 13 : for two statements s i j ∈ f i and s a p b q ∈ f a p ( i ≠ a p ) appearing in p s i j z as nodes do 14 : if f i calls f a p then 15 : δ i j z ← δ i j z ∪ { f s i j ; f s a p b q } , where s i j < s a p b q ; 16 : else 17 : δ i j z ← δ i j z ∪ { f s i j ; f s a p b q } , where s i j > s a p b q ; 18 : end if 19 : end for 20 : C ← C ∪ { δ i j z } ; 21 : end for 22 : return C , the set of SeVCs \small \text{Input: } A \text{ program } P = \{f_1, \ldots, f_\eta\}; \text{a set } Y \text{ of SyVCs generated by Algorithm 1}\\ \text{Output: The set of SeVCs}\\ 1: C \gets \{\};\\ 2: \text{for each } f_i \in P \text{ do}\\ 3: \quad \text{Generate a PDG } G_{0i} = (V_i, E_{0i}) \text{ for } f_i;\\ 4: \text{end for}\\ 5: \text{for each } e_{ijz} \in Y \text{ in } G_{0i} \text{ do}\\ 6: \quad \text{Generate forward slice } f_{sijz} \text{ \& backward slice } b_{sijz} \text{ of } e_{ijz};\\ 7: \quad \text{Generate interprocedural forward slice } f_{s0ijz} \\ \quad\quad\ \text{ by interconnecting } f_{sijz} \text{ and the forward slices from the functions called by } f_i;\\ 8: \quad \text{Generate interprocedural backward slice } b_{s0ijz} \\\quad\quad\ \text{ by interconnecting } b_{sijz} \text{ and the backward slices from both the functions called by } f_i \text{ and the functions calling } f_i;\\ 9: \quad \text{Generate program slice } p_{sijz} \text{ by connecting } f_{s0ijz} \text{ and } b_{s0ijz} \text{ at } e_{ijz};\\ 10:\quad \text{for each statement } s_{ij} \in f_i \text{ appearing in } p_{sijz} \text{ as a node do}\\ 11: \quad \quad \delta_{ijz} \gets \delta_{ijz} \cup \{f_{sij}\}, \text{ according to the order of the appearance of } s_{ij} \text{ in } f_i;\\ 12: \quad \text{end for}\\ 13: \quad \text{for two statements } s_{ij} \in f_i \text{ and } s_{apbq} \in f_{ap} (i \neq ap) \text{ appearing in } p_{sijz} \text{ as nodes do}\\ 14: \quad \quad \text{if } f_i \text{ calls } f_{ap} \text{ then}\\ 15: \quad \quad \quad \delta_{ijz} \gets \delta_{ijz} \cup \{f_{sij} ; f_{sapbq}\}, \text{ where } s_{ij} < s_{apbq};\\ 16: \quad \quad \text{else}\\ 17: \quad \quad \quad \delta_{ijz} \gets \delta_{ijz} \cup \{f_{sij} ; f_{sapbq}\}, \text{ where } s_{ij} > s_{apbq};\\ 18: \quad \quad \text{end if}\\ 19: \quad \text{end for}\\ 20: C \gets C \cup \{\delta_{ijz}\};\\ 21: \text{end for}\\ 22: \text{return } C \text{, the set of SeVCs} Input: A program P={f1,…,fη};a set Y of SyVCs generated by Algorithm 1Output: The set of SeVCs1:C←{};2:for each fi∈P do3:Generate a PDG G0i=(Vi,E0i) for fi;4:end for5:for each eijz∈Y in G0i do6:Generate forward slice fsijz & backward slice bsijz of eijz;7:Generate interprocedural forward slice fs0ijz by interconnecting fsijz and the forward slices from the functions called by fi;8:Generate interprocedural backward slice bs0ijz by interconnecting bsijz and the backward slices from both the functions called by fi and the functions calling fi;9:Generate program slice psijz by connecting fs0ijz and bs0ijz at eijz;10:for each statement sij∈fi appearing in psijz as a node do11:δijz←δijz∪{fsij}, according to the order of the appearance of sij in fi;12:end for13:for two statements sij∈fi and sapbq∈fap(i=ap) appearing in psijz as nodes do14:if fi calls fap then15:δijz←δijz∪{fsij;fsapbq}, where sij<sapbq;16:else17:δijz←δijz∪{fsij;fsapbq}, where sij>sapbq;18:end if19:end for20:C←C∪{δijz};21:end for22:return C, the set of SeVCs
Encoding SeVCs
-
函数和变量名重写解决OV问题
-
词法分析得到token,使用word2vec向量化
-
定长 θ \theta θ :
-
< θ \lt \theta <θ:尾0占位
-
> θ \gt \theta >θ:使用不同策略,尽可能使SeVC在向量中间
- 到SyVC的子向量长度 < θ 2 \lt \frac{\theta}{2} <2θ ,则删除SyVC右边多余的元素
- SyVC到末尾的子向量长度 < θ 2 \lt \frac{\theta}{2} <2θ ,则删除SyVC左侧多余的元素
- 否则,取SyVC左侧元素 ⌊ θ − 1 2 ⌋ \lfloor \frac{\theta-1}{2} \rfloor ⌊2θ−1⌋ 个,右侧元素 ⌈ θ − 1 2 ⌉ \lceil \frac{\theta-1}{2} \rceil ⌈2θ−1⌉ 个
Algorithm3:
Input: A set Y of SyVCs generated by Algorithm 1; a set C of SeVCs corresponding to Y and generated by Algorithm 2; A threshold θ Output: The set of vectors corresponding to SeVCs 1 : R = { } ; 2 : for each δ i , j , z ∈ C (corresponding to e i , j , z ∈ Y ) do 3 : Remove non-ASCII characters in δ i , j , z ; 4 : Map variable names in δ i , j , z to symbolic names; ; 5 : Map function names in δ i , j , z to symbolic names; 6 : end for 7 : for each δ i , j , z ∈ C (corresponding to e i , j , z ∈ Y ) do 8 : R i , j , z = { } ; 9 : Divide δ i , j , z into a set of symbols S; 10 : for each α ∈ S in order to do 11 : Transform α to a fixed-length vector v ( α ) ; 12 : R i , j , z = R i , j , z ∣ ∣ v ( α ) , where ∣ ∣ means concatenation ; 13 : end for 14 : if R i , j , z is shorter than θ then 15 : Zerors are padded to the end of R i , j , z ; 16 : else if the sub-vector (of δ i , j , z ) up to the position of the SyVC e i , j , z is shorter than θ 2 then 17 : Delete the rightmost portion of R i , j , z to make the resulting vector of length θ ; 18 : else if the sub-vector (of δ i , j , z ) next to the the position of the SyVC e i , j , z is shorter than θ 2 then 19 : Delete the leftmost portion of R i , j , z to make resulting vector of length θ ; 20 : else 21 : Keep the sub-vector (of δ i , j , z ) immediately left to the position of the SyVC of length ⌊ θ − 1 2 ⌋ , the sub-vector corresponding to the SyVC, and the sub-vector immediately right to the position of the SyVC of length ⌈ θ − 1 2 ⌉ fthe resulting vector has length θ ; 22 : end if 23 : R = R ∪ { R i , j , z } ; 24 : end for 25 : return R ; {fthe set of vectors corresponding to SeVCs} \small\text{Input: } \text{A set Y of SyVCs generated by Algorithm 1;}\\ \qquad\ \ \ \ \text{a set C of SeVCs corresponding to Y and generated by Algorithm 2;}\\ \qquad\ \ \ \ \text{A threshold }\theta \\ \text{Output: The set of vectors corresponding to SeVCs}\\ 1: R = \{\};\\ 2: \text{for each } \delta_{i,j,z} \in C \text{ (corresponding to } e_{i,j,z} \in Y \text{) do}\\ 3: \qquad\text{Remove non-ASCII characters in } \delta_{i,j,z} ;\\ 4: \qquad\text{Map variable names in } \delta_{i,j,z} \text{ to symbolic names;};\\ 5: \qquad\text{Map function names in } \delta_{i,j,z} \text{ to symbolic names;}\\ 6: \text{end for}\\ 7: \text{for each } \delta_{i,j,z} \in C \text{ (corresponding to } e_{i,j,z} \in Y \text{) do}\\ 8: \qquad R_{i,j,z} = \{\};\\ 9: \qquad\text{Divide } \delta_{i,j,z} \text{ into a set of symbols S;}\\ 10: \qquad\text{for each } \alpha \in S \text{ in order to do}\\ 11: \qquad\qquad\text{Transform } \alpha \text{ to a fixed-length vector } v(\alpha);\\ 12: \qquad\qquad R_{i,j,z} = R_{i,j,z} || v(\alpha), \text{where } || \text{ means concatenation};\\ 13: \qquad\text{end for}\\ 14: \qquad\text{if } R_{i,j,z} \text{ is shorter than } \theta \text{ then}\\ 15: \qquad\qquad\text{Zerors are padded to the end of } R_{i,j,z};\\ 16: \qquad\text{else if the sub-vector (of } \delta_{i,j,z} \text{) up to the position of the SyVC } \\ \qquad\qquad e_{i,j,z} \text{ is shorter than }\frac{\theta}{2} \text{ then}\\ 17: \qquad\qquad\text{Delete the rightmost portion of } R_{i,j,z} \text{ to make the resulting vector of length } \theta;\\ 18: \qquad\text{else if the sub-vector (of } \delta_{i,j,z} \text{) next to the the position of the SyVC }\\ \qquad\qquad e_{i,j,z}\text{ is shorter than } \frac{\theta}{2} \text{ then}\\ 19: \qquad\qquad\text{Delete the leftmost portion of } R_{i,j,z} \text{ to make resulting vector of length } \theta;\\ 20: \qquad\text{else}\\ 21: \qquad\qquad\text{Keep the sub-vector (of }\delta_{i,j,z} \text{) immediately left to the position of the SyVC } \\\qquad\qquad\qquad \text{of length } \lfloor \frac{\theta-1}{2} \rfloor \text{, the sub-vector corresponding to the SyVC, and the sub-vector} \\\qquad\qquad\qquad \text{immediately right to the position of the SyVC of length } \lceil\frac{\theta-1}{2}\rceil \text{ fthe resulting vector has length }\theta;\\ 22: \qquad\text{end if}\\ 23: \qquad R = R \cup \{R_{i,j,z}\};\\ 24: \text{end for}\\ 25: \text{return } R; \text{\{fthe set of vectors corresponding to SeVCs\}} Input: A set Y of SyVCs generated by Algorithm 1; a set C of SeVCs corresponding to Y and generated by Algorithm 2; A threshold θOutput: The set of vectors corresponding to SeVCs1:R={};2:for each δi,j,z∈C (corresponding to ei,j,z∈Y) do3:Remove non-ASCII characters in δi,j,z;4:Map variable names in δi,j,z to symbolic names;;5:Map function names in δi,j,z to symbolic names;6:end for7:for each δi,j,z∈C (corresponding to ei,j,z∈Y) do8:Ri,j,z={};9:Divide δi,j,z into a set of symbols S;10:for each α∈S in order to do11:Transform α to a fixed-length vector v(α);12:Ri,j,z=Ri,j,z∣∣v(α),where ∣∣ means concatenation;13:end for14:if Ri,j,z is shorter than θ then15:Zerors are padded to the end of Ri,j,z;16:else if the sub-vector (of δi,j,z) up to the position of the SyVC ei,j,z is shorter than 2θ then17:Delete the rightmost portion of Ri,j,z to make the resulting vector of length θ;18:else if the sub-vector (of δi,j,z) next to the the position of the SyVC ei,j,z is shorter than 2θ then19:Delete the leftmost portion of Ri,j,z to make resulting vector of length θ;20:else21:Keep the sub-vector (of δi,j,z) immediately left to the position of the SyVC of length ⌊2θ−1⌋, the sub-vector corresponding to the SyVC, and the sub-vectorimmediately right to the position of the SyVC of length ⌈2θ−1⌉ fthe resulting vector has length θ;22:end if23:R=R∪{Ri,j,z};24:end for25:return R;{fthe set of vectors corresponding to SeVCs}
Experiment Result
训练集和测试集为4:1,5-折交叉验证
-
SySeVR-BLSTM可以检测与函数调用、数组使用、指针使用和算术表达式相关的漏洞,在检测库/API函数调用相关漏洞时,与VulDeePecker相比,FPR降低3.4%,FNR降低5.0%
-
SySeVR-BGRU在若干深度神经网络中效果最好,但是FNR始终高于FPR
-
使用分布式表示(如word2vec)来捕获上下文信息对SySeVR很重要。特别是,以令牌频率为中心的表示是不够的
-
如果一个语法元素(例如,令牌)出现在易受攻击(例如:非易受攻击的sevc比非易受攻击的sevc出现的频率要高得多。脆弱的),语法元素可能会导致误报(如。误报警);这意味着语法元素的出现频率很重要
-
一个包含更多语义信息(即控制依赖和数据依赖)的模型可以获得更高的漏洞检测能力
-
启用sysevr的BGRU比最先进的漏洞检测方法有效得多
Limitation
-
局限在C/C++
-
漏洞语法特征只提取了4个,需要提取更多
-
可改进SeVC和SyVC的生成算法,以求得包含更多的语义信息
-
现在使用单一模型检查多类型,但是是否可以根据不同类型选择最适合的模型网络
-
我们在切片级别检测漏洞(即,语义上彼此相关的多行代码),这可以改进为更精确地确定包含漏洞的代码行
-
数据集真是标签的自动标注
-
可解释性?