Contribution
- 提出了一个跨PL的漏洞检测系统ReGVD
Problem Definition
-
设数据集中的一条数据为 { ( c i , y i ) ∣ c i ∈ C , y i ∈ Y } , i ∈ { 1 , 2 , . . . , n } \{(c_i,\ y_i)\ |\ c_i \in \mathcal C,\ y_i \in \mathcal Y\},\ i \in \{1,\ 2,\ ...,\ n\} {(ci, yi) ∣ ci∈C, yi∈Y}, i∈{1, 2, ..., n} ,其中 C \mathcal C C 表示函数实例集合, Y = { 0 , 1 } n \mathcal Y=\{0,\ 1\}^n Y={0, 1}n 表示该函数是否易受攻击(1是,0否)
-
每一个函数实例 c i c_i ci 经过图嵌入层,得到一个图 g i { V , X , A } ∈ G g_i\{V,X,A\} \in \mathcal G gi{V,X,A}∈G ,其中 V V V 是点集, m m m 是点集中的元素个数, X = R m × d X= \mathbb R^{m\times d} X=Rm×d 是每个点的特征矩阵, d d d 是每个特征向量的维度, A = { 0 , 1 } m × m A=\{0,\ 1\}^{ m\times m} A={0, 1}m×m 是点的邻接矩阵
e.g. 邻接矩阵 A A A 中某个元素 e s , t p e_{s,t}^{p} es,tp 值为 1 1 1 ,表示存在一条类别为 p p p 的有向边,从点 v s v_s vs 连接到点 v t v_t vt
-
模型目标是学习一个映射函数 f : G → Y f:\mathcal G \to \mathcal Y f:G→Y ,可以通过如下最小化损失函数学习得到: min ∑ i = 1 n L ( f ( g i ( V , X , A ) , c i ∣ y i ) ) + λ ∥ ( θ ) ∥ 2 2 \min \sum_{i=1}^{n}\mathcal L(f(g_i(V,\ X,\ A)\ ,\ c_i\ |\ y_i))\ +\ \lambda\|(\theta)\|_2^2 min∑i=1nL(f(gi(V, X, A) , ci ∣ yi)) + λ∥(θ)∥22
Graph Contruction
Overview
用别人的方法,将代码视作平面token序列,再转为两类图(unique-token-focused、index-focused),然后取消原方法种的自循环边
Lianzhe Huang, Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2019.
Text Level Graph Neural Network for Text Classification. In EMNLP-IJCNLP.
Yufeng Zhang, Xueli Yu, Zeyu Cui, Shu Wu, Zhongzhen Wen, and Liang Wang.
2020. Every Document Owns Its Structure: Inductive Text Classification via
Graph Neural Networks. In ACL. 334–339.
Unique-Token-Focused-Constructin
将函数中唯一令牌作为一个节点,根据给定窗口大小的滑动窗口(sliding window),确定一个邻接矩阵: A v , u = { 1 t o k e n v , u a p p e a r t o g e t h e r i n s l i d i n g w i n d o w , v ≠ u ; 0 o t h e r w i s e . \mathbf A_{v,u}= \begin{cases} 1\ \ \ token\ v,u\ appear\ together\ in\ sliding\ window,\ v\ne u; \\ 0\ \ \ otherwise. \end{cases} Av,u={1 token v,u appear together in sliding window, v=u;0 otherwise.
Index-Focused-Contruction
将所有token都作为一个节点,不作去重操作,根据给定窗口大小的滑动窗口(sliding window),确定一个邻接矩阵: A i , j = { 1 t o k e n i , j a p p e a r t o g e t h e r i n s l i d i n g w i n d o w , i ≠ j ; 0 o t h e r w i s e . \mathbf A_{i,j}= \begin{cases} 1\ \ \ token\ i,j\ appear\ together\ in\ sliding\ window,\ i\ne j; \\ 0\ \ \ otherwise. \end{cases} Ai,j={1 token i,j appear together in sliding window, i=j;0 otherwise.
Node-Feature-Initialize
用预训练PL模型(如CodeBERT)的token嵌入层来对每个节点进行编码
Feature Extracting
Overview
代码嵌入 -> 残差GCN/残差GGNN -> Readout层(求和池化和最大池化融合) -> FC+Sofmax分类
GNN
给定一个图 g ( V , X , A ) g(V,X,A) g(V,X,A) , 定义GNN: H k + 1 = G N N ( A , H k ) , H 0 = X \mathbf H^{k+1}=GNN(\mathbf A,\ \mathbf H^k),\ \mathbf H^0=X Hk+1=GNN(A, Hk), H0=X
GCN
h ‾ v k + 1 = ϕ ( ∑ u ∈ N v a v , u W ‾ k h ‾ u k ) , ∀ v ∈ V \overline h_v^{k+1}=\phi\ (\ \sum_{u \in N_v}\ a_{v,u}\overline W^k\overline h_u^k\ ),\ \forall \ v \in \mathcal V hvk+1=ϕ ( ∑u∈Nv av,uWkhuk ), ∀ v∈V
其中 ϕ \phi ϕ 是非线性激活函数(如 R e l u \mathcal Relu Relu), W W W 是权重矩阵, a v , u a_{v,u} av,u 是拉普拉斯再归一化邻接矩阵(Laplacian re-normalized adjacency matrix)中的值, D − 1 2 A D − 1 2 D^{-\frac{1}{2}}AD^{-\frac{1}{2}} D−21AD−21 是拉普拉斯再归一化邻接矩阵,矩阵 D D D 是原邻接矩阵 A A A 的对角节点度矩阵(diagonal node degree matrix)
GGNN
a v k + 1 = ∑ u ∈ N v a v k h ‾ u k a_{v}^{k+1}=\sum_{u\in N_v}a_v^k\overline h_u^k avk+1=∑u∈Nvavkhuk
z v k + 1 = σ ( W z a v k + 1 , U z h v k ) z_v^{k+1}=\sigma(\ W^za_v^{k+1},\ U^zh_v^k\ ) zvk+1=σ( Wzavk+1, Uzhvk )
r v k + 1 = σ ( W r a v k + 1 , U r h v k ) r_v^{k+1}=\sigma(\ W^ra_v^{k+1},\ U^rh_v^k\ ) rvk+1=σ( Wravk+1, Urhvk )
h ~ v k + 1 = ϕ ( W o a v k + 1 , U o ( r v k + 1 ⊙ h v k ) ) \widetilde h_v^{k+1}=\phi(\ W^oa_v^{k+1},\ U^o(\ r_v^{k+1}\odot h_v^k\ )\ ) h vk+1=ϕ( Woavk+1, Uo( rvk+1⊙hvk ) )
h v k + 1 = ( 1 − z v k + 1 ) ⊙ h v k + z v k + 1 ⊙ h ~ v k + 1 h_v^{k+1}=(\ 1-z_v^{k+1}\ )\ \odot h_v^{k}\ + z_v^{k+1}\ \odot \widetilde h_v^{k+1} hvk+1=( 1−zvk+1 ) ⊙hvk +zvk+1 ⊙h vk+1
Residual Connection
H k + 1 = H k + G N N ( A , H k ) H^{k+1}=H^{k}+GNN(\ A,\ H^k\ ) Hk+1=Hk+GNN( A, Hk )
Readout
根据其他论文研究成果,发现和池化(sum pooling)对与图分类具有更好的性能,所以ReGVD使用和池化。同时最大池化(max pooling)能够提取更多图中信息,所以ReGVD将两种池化层进行融合
e v = σ ( w ⊺ h v K + b ) ⊙ ϕ ( W h v K + b ‾ ) e_v=\sigma (\ w^\intercal h^K_v +b\ )\ \odot\phi(\ Wh_v^K+\overline b\ ) ev=σ( w⊺hvK+b ) ⊙ϕ( WhvK+b )
e g = M I X ( ∑ v ∈ V e v , M A X P O O L { e v } v ∈ V ) e_g=MIX(\ \sum_{v\in\mathcal V}e_v,\ \mathcal MAXPOOL\{e_v\}_{v\in \mathcal V}\ ) eg=MIX( ∑v∈Vev, MAXPOOL{ev}v∈V )
其中 e v e_v ev 是节点 v v v 的最终向量, e g e_g eg 是函数的最终向量表示, σ ( w ⊺ h v K + b ) \sigma (\ w^\intercal h^K_v +b\ ) σ( w⊺hvK+b ) 是软注意力机制(soft attention mechanism), M I X MIX MIX 函数是集合 S U M , C O N C A T , M U L {SUM,\ CONCAT,\ MUL} SUM, CONCAT, MUL 中任意一个
S U M : e g = ∑ v ∈ V e v + M A X P O O L { e v } v ∈ V \small SUM:\ \normalsize e_g=\sum_{v\in\mathcal V}e_v+\mathcal MAXPOOL\{e_v\}_{v\in \mathcal V} SUM: eg=∑v∈Vev+MAXPOOL{ev}v∈V
M U L : e g = ∑ v ∈ V e v ⊙ M A X P O O L { e v } v ∈ V \small MUL:\ \normalsize e_g=\sum_{v\in\mathcal V}e_v\odot\mathcal MAXPOOL\{e_v\}_{v\in \mathcal V} MUL: eg=∑v∈Vev⊙MAXPOOL{ev}v∈V
C O N C A T : e g = [ ∑ v ∈ V e v ∥ M A X P O O L { e v } v ∈ V ] \small CONCAT:\ \normalsize e_g=\large[\ \normalsize\sum_{v\in\mathcal V}e_v\ \|\ \mathcal MAXPOOL\{e_v\}_{v\in \mathcal V}\ \large ] \normalsize CONCAT: eg=[ ∑v∈Vev ∥ MAXPOOL{ev}v∈V ]
Classification
y g ^ = s o f m a x ( W 1 e g + b 1 ) \widehat{y_g}=sofmax(\ W_1e_g+b_1\ ) yg =sofmax( W1eg+b1 )
Result
SOTA