如何对C++或Java源代码进行细粒度normalization

最新推荐文章于 2024-03-27 10:36:28 发布

蛐蛐蛐

最新推荐文章于 2024-03-27 10:36:28 发布

阅读量394

点赞数

分类专栏：科研工具论文点评 Python技巧

本文链接：https://blog.csdn.net/qysh123/article/details/115365291

版权

科研工具同时被 3 个专栏收录

129 篇文章 12 订阅

订阅专栏

Python技巧

94 篇文章 2 订阅

订阅专栏

论文点评

38 篇文章 1 订阅

订阅专栏

可能标题这样讲还是有些让人糊涂，具体而言，我希望实现下面这篇论文做的事情：

Kim, Seulbae, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. "Vuddy: A scalable approach for vulnerable code clone discovery." In 2017 IEEE Symposium on Security and Privacy (SP), pp. 595-614. IEEE, 2017.

这篇论文是怎么做normalization和tokenization的呢？用下面这张图就可以说清楚了：

仔细看看并没有想象的那么简单，我把这段代码摘出来：

void avg (float arr[], int len){
    static float sum = 0;
    unsigned int i;
    for (i = 0; i < len; i++)
        sum += arr[i];
    printf("%f %d",sum/len,validate(sum));
}

对于这段代码，我们如果用srcML处理一下，可以得到：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<unit xmlns="http://www.srcML.org/srcML/src" xmlns:cpp="http://www.srcML.org/srcML/cpp" revision="0.9.5" language="C" filename="VuddyTest.c"><function><type><name>void</name></type> <name>avg</name> <parameter_list>(<parameter><decl><type><name>float</name></type> <name><name>arr</name><index>[]</index></name></decl></parameter>, <parameter><decl><type><name>int</name></type> <name>len</name></decl></parameter>)</parameter_list><block>{
    <decl_stmt><decl><specifier>static</specifier> <type><name>float</name></type> <name>sum</name> <init>= <expr><literal type="number">0</literal></expr></init></decl>;</decl_stmt>
    <decl_stmt><decl><type><name>unsigned</name> <name>int</name></type> <name>i</name></decl>;</decl_stmt>
    <for>for <control>(<init><expr><name>i</name> <operator>=</operator> <literal type="number">0</literal></expr>;</init> <condition><expr><name>i</name> <operator>&lt;</operator> <name>len</name></expr>;</condition> <incr><expr><name>i</name><operator>++</operator></expr></incr>)</control>
        <block type="pseudo"><expr_stmt><expr><name>sum</name> <operator>+=</operator> <name><name>arr</name><index>[<expr><name>i</name></expr>]</index></name></expr>;</expr_stmt></block></for>
    <expr_stmt><expr><call><name>printf</name><argument_list>(<argument><expr><literal type="string">"%f %d"</literal></expr></argument>,<argument><expr><name>sum</name><operator>/</operator><name>len</name></expr></argument>,<argument><expr><call><name>validate</name><argument_list>(<argument><expr><name>sum</name></expr></argument>)</argument_list></call></expr></argument>)</argument_list></call></expr>;</expr_stmt>
}</block></function>
</unit>

其实这里面是包含了所有我们需要的信息的，可以参考我之前这篇博客。但我始终觉得查询XML太复杂，很多情况下也没法保持自然的顺序（可能是我还不太了解XPath吧，水平有限）。这时候就想到利用Understand的输出Token的功能。我们如果用下面这段perl来处理一下这段源码（我这里是针对File的，处理Function是类似的）：

use Understand;
use Cwd;

$current_dir=getcwd;
$current_dir=~ s/\//\\/g;
$current_dir=$current_dir."\\".$ARGV[0]."\\";
$dir_length=length($current_dir);

system( "und -quiet create -db temp.udb -languages c++ add " . $ARGV[0] . " analyze" );#
$project_name=$ARGV[0];

( $db, $status ) = Understand::open("temp.udb");
die "Error status: ", $status, "\n" if $status;

@file_list=$db->ents("c file");
foreach $file (@file_list){

    $loc=$file->metric("CountLine");
    if($loc==0){
        next;
    }
    $file_name=$file->longname();
    $file_name_length=length($file_name);
    $file_full_name=substr($file_name,$dir_length,$file_name_length-$dir_length);
    $file_full_name=~ s/\\/+/g;

    $output_file2=">".$ARGV[0]."/".$file_full_name."-tokens";
    open( OUTFILE2, $output_file2 );

    $lexer=$file->lexer();
	@lexemes=$lexer->lexemes();
	foreach $lexeme (@lexemes){
        $token=$lexeme->token();
        if($token eq "Comment"){
            next;
        }
        if((not $token eq "Newline") && (not $token eq "Whitespace")){
            print OUTFILE2 $lexeme->text()."|&|".$token."|&|".$lexeme->line_begin()."\n";
        }
	}
}

可以得到下面这样的输出：

void|&|Keyword|&|1
avg|&|Identifier|&|1
(|&|Punctuation|&|1
float|&|Keyword|&|1
arr|&|Identifier|&|1
[|&|Operator|&|1
]|&|Operator|&|1
,|&|Operator|&|1
int|&|Keyword|&|1
len|&|Identifier|&|1
)|&|Punctuation|&|1
{|&|Punctuation|&|1
static|&|Keyword|&|2
float|&|Keyword|&|2
sum|&|Identifier|&|2
=|&|Operator|&|2
0|&|Literal|&|2
;|&|Punctuation|&|2
unsigned|&|Keyword|&|3
int|&|Keyword|&|3
i|&|Identifier|&|3
;|&|Punctuation|&|3
for|&|Keyword|&|4
(|&|Punctuation|&|4
i|&|Identifier|&|4
=|&|Operator|&|4
0|&|Literal|&|4
;|&|Punctuation|&|4
i|&|Identifier|&|4
<|&|Operator|&|4
len|&|Identifier|&|4
;|&|Punctuation|&|4
i|&|Identifier|&|4
++|&|Operator|&|4
)|&|Punctuation|&|4
sum|&|Identifier|&|5
+=|&|Operator|&|5
arr|&|Identifier|&|5
[|&|Operator|&|5
i|&|Identifier|&|5
]|&|Operator|&|5
;|&|Punctuation|&|5
printf|&|Identifier|&|6
(|&|Punctuation|&|6
"%f %d"|&|String|&|6
,|&|Operator|&|6
sum|&|Identifier|&|6
/|&|Operator|&|6
len|&|Identifier|&|6
,|&|Operator|&|6
validate|&|Identifier|&|6
(|&|Punctuation|&|6
sum|&|Identifier|&|6
)|&|Punctuation|&|6
)|&|Punctuation|&|6
;|&|Punctuation|&|6
}|&|Punctuation|&|7
|&|EOF|&|7

说实话，这个结果是比较清楚了，但是还是不符合我们的需求，例如：sum和len都是identifier，但实际上两个中，一个是LVAR，一个是LPARAM。那最简单的想法，就是把这两者结合起来，将Understand的输出作为基础，在srcML的输出中进行查询：

import subprocess
from lxml import etree

basic_type_list=['char','unsigned char','signed char','int','unsigned int','signed int','short int',
                'unsigned short int','signed short int','long int','signed long int','unsigned long int',
                'float','double','long double','wchar_t']

basic_type_list=['char','int','short int',
                'long int',
                'float','double','long double','wchar_t']

file_name="VuddyTest.c"
output=subprocess.run(['srcml', file_name], capture_output=True, check=False)
root=etree.fromstring(output.stdout)

tokens_file=open(file_name+'-tokens','r')

lines=tokens_file.readlines()
token_seq_list=[]
kind_list=[]
line_list=[]

for each_line in lines:
    records=each_line.strip().split('|&|')
    if(not len(records)==3):
        continue
    if(records[0]==''):
        continue
    token_seq_list.append(records[0])
    kind_list.append(records[1])
    line_list.append(records[2])

funcs=root.xpath('//x:function | //x:constructor | //x:destructor',namespaces={'x':'http://www.srcML.org/srcML/src'})
if(not funcs):
    funcs=root.xpath('//x:macro',namespaces={'x':'http://www.srcML.org/srcML/src'})
func=funcs[0]
func_body = func.xpath('./x:block',namespaces={'x':'http://www.srcML.org/srcML/src'})[0]
content = func_body.xpath('.//text()')
content = [str(v).strip() for v in content]
content = list(filter(None, content))

parameter_list=[]
for parameter in func.xpath('./x:parameter_list/x:parameter',namespaces={'x':'http://www.srcML.org/srcML/src'}):
    for name in parameter.xpath('./x:decl/x:name/text()',namespaces={'x':'http://www.srcML.org/srcML/src'}):
        parameter_list.append(name)
    for name in parameter.xpath('./x:decl/x:name/x:name/text()',namespaces={'x':'http://www.srcML.org/srcML/src'}):
        parameter_list.append(name)

print(parameter_list)

normalized_list=[]
for each_token in content:
    if(each_token in parameter_list):
        normalized_list.append('FPARAM')
    else:
        normalized_list.append(each_token.encode("utf-8").decode("utf-8"))

if(normalized_list[0]=='{' and normalized_list[-1]=='}'):
    normalized_list.pop()
    del(normalized_list[0])

print(normalized_list)

start=token_seq_list.index('{')+1
end=len(token_seq_list)

token_seq_list=token_seq_list[start:end]
kind_list=kind_list[start:end]

token_set=set()
for each_token in token_seq_list:
    token_set.add(each_token)
    
token_list=list(token_set)
token_count=[0]*len(token_list)

token_count_dict={}

for index,each_token in enumerate(token_seq_list):
    this_token=each_token
    token_count[token_list.index(this_token)]=token_count[token_list.index(this_token)]+1
    token_count_dict[index]=token_count[token_list.index(this_token)]

final_out_put_list=[]
SKIP_TWO=0

for i,token_value in enumerate(token_seq_list):
    if(SKIP_TWO>0):
        SKIP_TWO=SKIP_TWO-1
        continue

    kind=kind_list[i]
    this_token_count=token_count_dict[i]

    if(token_value in parameter_list):
        final_out_put_list.append('FPARAM')
        continue

    if(token_value=='NULL'):
        final_out_put_list.append('NULL')
        continue

    if(kind=='Keyword'):
        Matched=False
        if(token_value=='static'):
            next_token=token_seq_list[i+1]
            if(next_token in basic_type_list):
                final_out_put_list.append('DTYPE')
                Matched=True
                SKIP_TWO=1
                continue
            else:
                two_next_token=token_seq_list[i+2]
                this_string=next_token+' '+two_next_token
                if(this_string in basic_type_list):
                    final_out_put_list.append('DTYPE')
                    Matched=True
                    SKIP_TWO=2
                    continue
                else:
                    third_next_token=token_seq_list[i+3]
                    this_string=this_string+' '+third_next_token
                    if(this_string in basic_type_list):
                        final_out_put_list.append('DTYPE')
                        Matched=True
                        SKIP_TWO=3
                        continue
        else:
            if(token_value in basic_type_list):
                final_out_put_list.append('DTYPE')
                Matched=True
                continue
            else:
                next_token=token_seq_list[i+1]
                this_string=token_value+' '+next_token
                if(this_string in basic_type_list):
                    final_out_put_list.append('DTYPE')
                    Matched=True
                    SKIP_TWO=1
                    continue
                else:
                    two_next_token=token_seq_list[i+2]
                    this_string=this_string+' '+two_next_token
                    if(this_string in basic_type_list):
                        final_out_put_list.append('DTYPE')
                        Matched=True
                        SKIP_TWO=2
                        continue
        if(Matched==False):
            final_out_put_list.append(token_value)
            continue
    if(kind=="Identifier"):
        Matched=False
        for j,identifier in enumerate(func.xpath('.//*[text()="%s"]'%(token_value))):
            if(j+1==this_token_count):
                type=identifier.xpath('../../x:type/x:name',namespaces={'x':'http://www.srcML.org/srcML/src'})
                if(type):
                    this_text=type[0].xpath('./text()')[0]
                    if(this_text==token_value):
                        Matched=True
                        final_out_put_list.append('DTYPE')
                        continue
                
                type=identifier.xpath('../../x:call/x:name',namespaces={'x':'http://www.srcML.org/srcML/src'})
                if(type):
                    temp=type[0].xpath('./text()')
                    if(temp):
                        this_text=temp[0]
                        if(this_text==token_value):
                            Matched=True
                            final_out_put_list.append('FUNCCALL')
                            continue
        if(Matched==False):
            final_out_put_list.append('LVAR')
    else:
        final_out_put_list.append(token_value)


print(final_out_put_list)

其实吧，这段代码写的比需要的要更复杂一些，可以看到我最初定义了两个basic_type_list，我是按照更复杂的情况考虑的，但是在上面截图中，Vuddy并没有考虑这种更复杂的情形（例如unsigned并没有进行normalization）。所以写得复杂了一些，不过输出是对的（tokens_file是我们用Understand生成的Token Stream的file）。运行上面的代码，两次输出是：

['static', 'float', 'sum', '=', '0', ';', 'unsigned', 'int', 'i', ';', 'for', '(', 'i', '=', '0', ';', 'i', '<', 'FPARAM', ';', 'i', '++', ')', 'sum', '+=', 'FPARAM', '[', 'i', ']', ';', 'printf', '(', '"%f %d"', ',', 'sum', '/', 'FPARAM', ',', 'validate', '(', 'sum', ')', ')', ';']
['DTYPE', 'LVAR', '=', '0', ';', 'unsigned', 'DTYPE', 'LVAR', ';', 'for', '(', 'LVAR', '=', '0', ';', 'LVAR', '<', 'FPARAM', ';', 'LVAR', '++', ')', 'LVAR', '+=', 'FPARAM', '[', 'LVAR', ']', ';', 'FUNCCALL', '(', '"%f %d"', ',', 'LVAR', '/', 'FPARAM', ',', 'FUNCCALL', '(', 'LVAR', ')', ')', ';', '}']

大家可以看到第二个结果和论文中就是完全一样的了，上面的代码其实也很简单，我就不详细说明了。就简单总结这么多吧。