可能标题这样讲还是有些让人糊涂,具体而言,我希望实现下面这篇论文做的事情:
Kim, Seulbae, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. "Vuddy: A scalable approach for vulnerable code clone discovery." In 2017 IEEE Symposium on Security and Privacy (SP), pp. 595-614. IEEE, 2017.
这篇论文是怎么做normalization和tokenization的呢?用下面这张图就可以说清楚了:
仔细看看并没有想象的那么简单,我把这段代码摘出来:
void avg (float arr[], int len){
static float sum = 0;
unsigned int i;
for (i = 0; i < len; i++)
sum += arr[i];
printf("%f %d",sum/len,validate(sum));
}
对于这段代码,我们如果用srcML处理一下,可以得到:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<unit xmlns="http://www.srcML.org/srcML/src" xmlns:cpp="http://www.srcML.org/srcML/cpp" revision="0.9.5" language="C" filename="VuddyTest.c"><function><type><name>void</name></type> <name>avg</name> <parameter_list>(<parameter><decl><type><name>float</name></type> <name><name>arr</name><index>[]</index></name></decl></parameter>, <parameter><decl><type><name>int</name></type> <name>len</name></decl></parameter>)</parameter_list><block>{
<decl_stmt><decl><specifier>static</specifier> <type><name>float</name></type> <name>sum</name> <init>= <expr><literal type="number">0</literal></expr></init></decl>;</decl_stmt>
<decl_stmt><decl><type><name>unsigned</name> <name>int</name></type> <name>i</name></decl>;</decl_stmt>
<for>for <control>(<init><expr><name>i</name> <operator>=</operator> <literal type="number">0</literal></expr>;</init> <condition><expr><name>i</name> <operator><</operator> <name>len</name></expr>;</condition> <incr><expr><name>i</name><operator>++</operator></expr></incr>)</control>
<block type="pseudo"><expr_stmt><expr><name>sum</name> <operator>+=</operator> <name><name>arr</name><index>[<expr><name>i</name></expr>]</index></name></expr>;</expr_stmt></block></for>
<expr_stmt><expr><call><name>printf</name><argument_list>(<argument><expr><literal type="string">"%f %d"</literal></expr></argument>,<argument><expr><name>sum</name><operator>/</operator><name>len</name></expr></argument>,<argument><expr><call><name>validate</name><argument_list>(<argument><expr><name>sum</name></expr></argument>)</argument_list></call></expr></argument>)</argument_list></call></expr>;</expr_stmt>
}</block></function>
</unit>
其实这里面是包含了所有我们需要的信息的,可以参考我之前这篇博客。但我始终觉得查询XML太复杂,很多情况下也没法保持自然的顺序(可能是我还不太了解XPath吧,水平有限)。这时候就想到利用Understand的输出Token的功能。我们如果用下面这段perl来处理一下这段源码(我这里是针对File的,处理Function是类似的):
use Understand;
use Cwd;
$current_dir=getcwd;
$current_dir=~ s/\//\\/g;
$current_dir=$current_dir."\\".$ARGV[0]."\\";
$dir_length=length($current_dir);
system( "und -quiet create -db temp.udb -languages c++ add " . $ARGV[0] . " analyze" );#
$project_name=$ARGV[0];
( $db, $status ) = Understand::open("temp.udb");
die "Error status: ", $status, "\n" if $status;
@file_list=$db->ents("c file");
foreach $file (@file_list){
$loc=$file->metric("CountLine");
if($loc==0){
next;
}
$file_name=$file->longname();
$file_name_length=length($file_name);
$file_full_name=substr($file_name,$dir_length,$file_name_length-$dir_length);
$file_full_name=~ s/\\/+/g;
$output_file2=">".$ARGV[0]."/".$file_full_name."-tokens";
open( OUTFILE2, $output_file2 );
$lexer=$file->lexer();
@lexemes=$lexer->lexemes();
foreach $lexeme (@lexemes){
$token=$lexeme->token();
if($token eq "Comment"){
next;
}
if((not $token eq "Newline") && (not $token eq "Whitespace")){
print OUTFILE2 $lexeme->text()."|&|".$token."|&|".$lexeme->line_begin()."\n";
}
}
}
可以得到下面这样的输出:
void|&|Keyword|&|1
avg|&|Identifier|&|1
(|&|Punctuation|&|1
float|&|Keyword|&|1
arr|&|Identifier|&|1
[|&|Operator|&|1
]|&|Operator|&|1
,|&|Operator|&|1
int|&|Keyword|&|1
len|&|Identifier|&|1
)|&|Punctuation|&|1
{|&|Punctuation|&|1
static|&|Keyword|&|2
float|&|Keyword|&|2
sum|&|Identifier|&|2
=|&|Operator|&|2
0|&|Literal|&|2
;|&|Punctuation|&|2
unsigned|&|Keyword|&|3
int|&|Keyword|&|3
i|&|Identifier|&|3
;|&|Punctuation|&|3
for|&|Keyword|&|4
(|&|Punctuation|&|4
i|&|Identifier|&|4
=|&|Operator|&|4
0|&|Literal|&|4
;|&|Punctuation|&|4
i|&|Identifier|&|4
<|&|Operator|&|4
len|&|Identifier|&|4
;|&|Punctuation|&|4
i|&|Identifier|&|4
++|&|Operator|&|4
)|&|Punctuation|&|4
sum|&|Identifier|&|5
+=|&|Operator|&|5
arr|&|Identifier|&|5
[|&|Operator|&|5
i|&|Identifier|&|5
]|&|Operator|&|5
;|&|Punctuation|&|5
printf|&|Identifier|&|6
(|&|Punctuation|&|6
"%f %d"|&|String|&|6
,|&|Operator|&|6
sum|&|Identifier|&|6
/|&|Operator|&|6
len|&|Identifier|&|6
,|&|Operator|&|6
validate|&|Identifier|&|6
(|&|Punctuation|&|6
sum|&|Identifier|&|6
)|&|Punctuation|&|6
)|&|Punctuation|&|6
;|&|Punctuation|&|6
}|&|Punctuation|&|7
|&|EOF|&|7
说实话,这个结果是比较清楚了,但是还是不符合我们的需求,例如:sum和len都是identifier,但实际上两个中,一个是LVAR,一个是LPARAM。那最简单的想法,就是把这两者结合起来,将Understand的输出作为基础,在srcML的输出中进行查询:
import subprocess
from lxml import etree
basic_type_list=['char','unsigned char','signed char','int','unsigned int','signed int','short int',
'unsigned short int','signed short int','long int','signed long int','unsigned long int',
'float','double','long double','wchar_t']
basic_type_list=['char','int','short int',
'long int',
'float','double','long double','wchar_t']
file_name="VuddyTest.c"
output=subprocess.run(['srcml', file_name], capture_output=True, check=False)
root=etree.fromstring(output.stdout)
tokens_file=open(file_name+'-tokens','r')
lines=tokens_file.readlines()
token_seq_list=[]
kind_list=[]
line_list=[]
for each_line in lines:
records=each_line.strip().split('|&|')
if(not len(records)==3):
continue
if(records[0]==''):
continue
token_seq_list.append(records[0])
kind_list.append(records[1])
line_list.append(records[2])
funcs=root.xpath('//x:function | //x:constructor | //x:destructor',namespaces={'x':'http://www.srcML.org/srcML/src'})
if(not funcs):
funcs=root.xpath('//x:macro',namespaces={'x':'http://www.srcML.org/srcML/src'})
func=funcs[0]
func_body = func.xpath('./x:block',namespaces={'x':'http://www.srcML.org/srcML/src'})[0]
content = func_body.xpath('.//text()')
content = [str(v).strip() for v in content]
content = list(filter(None, content))
parameter_list=[]
for parameter in func.xpath('./x:parameter_list/x:parameter',namespaces={'x':'http://www.srcML.org/srcML/src'}):
for name in parameter.xpath('./x:decl/x:name/text()',namespaces={'x':'http://www.srcML.org/srcML/src'}):
parameter_list.append(name)
for name in parameter.xpath('./x:decl/x:name/x:name/text()',namespaces={'x':'http://www.srcML.org/srcML/src'}):
parameter_list.append(name)
print(parameter_list)
normalized_list=[]
for each_token in content:
if(each_token in parameter_list):
normalized_list.append('FPARAM')
else:
normalized_list.append(each_token.encode("utf-8").decode("utf-8"))
if(normalized_list[0]=='{' and normalized_list[-1]=='}'):
normalized_list.pop()
del(normalized_list[0])
print(normalized_list)
start=token_seq_list.index('{')+1
end=len(token_seq_list)
token_seq_list=token_seq_list[start:end]
kind_list=kind_list[start:end]
token_set=set()
for each_token in token_seq_list:
token_set.add(each_token)
token_list=list(token_set)
token_count=[0]*len(token_list)
token_count_dict={}
for index,each_token in enumerate(token_seq_list):
this_token=each_token
token_count[token_list.index(this_token)]=token_count[token_list.index(this_token)]+1
token_count_dict[index]=token_count[token_list.index(this_token)]
final_out_put_list=[]
SKIP_TWO=0
for i,token_value in enumerate(token_seq_list):
if(SKIP_TWO>0):
SKIP_TWO=SKIP_TWO-1
continue
kind=kind_list[i]
this_token_count=token_count_dict[i]
if(token_value in parameter_list):
final_out_put_list.append('FPARAM')
continue
if(token_value=='NULL'):
final_out_put_list.append('NULL')
continue
if(kind=='Keyword'):
Matched=False
if(token_value=='static'):
next_token=token_seq_list[i+1]
if(next_token in basic_type_list):
final_out_put_list.append('DTYPE')
Matched=True
SKIP_TWO=1
continue
else:
two_next_token=token_seq_list[i+2]
this_string=next_token+' '+two_next_token
if(this_string in basic_type_list):
final_out_put_list.append('DTYPE')
Matched=True
SKIP_TWO=2
continue
else:
third_next_token=token_seq_list[i+3]
this_string=this_string+' '+third_next_token
if(this_string in basic_type_list):
final_out_put_list.append('DTYPE')
Matched=True
SKIP_TWO=3
continue
else:
if(token_value in basic_type_list):
final_out_put_list.append('DTYPE')
Matched=True
continue
else:
next_token=token_seq_list[i+1]
this_string=token_value+' '+next_token
if(this_string in basic_type_list):
final_out_put_list.append('DTYPE')
Matched=True
SKIP_TWO=1
continue
else:
two_next_token=token_seq_list[i+2]
this_string=this_string+' '+two_next_token
if(this_string in basic_type_list):
final_out_put_list.append('DTYPE')
Matched=True
SKIP_TWO=2
continue
if(Matched==False):
final_out_put_list.append(token_value)
continue
if(kind=="Identifier"):
Matched=False
for j,identifier in enumerate(func.xpath('.//*[text()="%s"]'%(token_value))):
if(j+1==this_token_count):
type=identifier.xpath('../../x:type/x:name',namespaces={'x':'http://www.srcML.org/srcML/src'})
if(type):
this_text=type[0].xpath('./text()')[0]
if(this_text==token_value):
Matched=True
final_out_put_list.append('DTYPE')
continue
type=identifier.xpath('../../x:call/x:name',namespaces={'x':'http://www.srcML.org/srcML/src'})
if(type):
temp=type[0].xpath('./text()')
if(temp):
this_text=temp[0]
if(this_text==token_value):
Matched=True
final_out_put_list.append('FUNCCALL')
continue
if(Matched==False):
final_out_put_list.append('LVAR')
else:
final_out_put_list.append(token_value)
print(final_out_put_list)
其实吧,这段代码写的比需要的要更复杂一些,可以看到我最初定义了两个basic_type_list,我是按照更复杂的情况考虑的,但是在上面截图中,Vuddy并没有考虑这种更复杂的情形(例如unsigned并没有进行normalization)。所以写得复杂了一些,不过输出是对的(tokens_file是我们用Understand生成的Token Stream的file)。运行上面的代码,两次输出是:
['static', 'float', 'sum', '=', '0', ';', 'unsigned', 'int', 'i', ';', 'for', '(', 'i', '=', '0', ';', 'i', '<', 'FPARAM', ';', 'i', '++', ')', 'sum', '+=', 'FPARAM', '[', 'i', ']', ';', 'printf', '(', '"%f %d"', ',', 'sum', '/', 'FPARAM', ',', 'validate', '(', 'sum', ')', ')', ';']
['DTYPE', 'LVAR', '=', '0', ';', 'unsigned', 'DTYPE', 'LVAR', ';', 'for', '(', 'LVAR', '=', '0', ';', 'LVAR', '<', 'FPARAM', ';', 'LVAR', '++', ')', 'LVAR', '+=', 'FPARAM', '[', 'LVAR', ']', ';', 'FUNCCALL', '(', '"%f %d"', ',', 'LVAR', '/', 'FPARAM', ',', 'FUNCCALL', '(', 'LVAR', ')', ')', ';', '}']
大家可以看到第二个结果和论文中就是完全一样的了,上面的代码其实也很简单,我就不详细说明了。就简单总结这么多吧。