怎样使用srcML对C++和Java源代码中的function参数进行替换(normalization)

本文介绍了如何使用srcML工具解析Java源代码,通过XML表示进行精准定位,例如将代码中的变量'a'替换为'FPARAM'。通过XPath匹配方法,实现了对源代码中特定标识符的替换,适用于处理C++、Java和Python等语言的源代码。
摘要由CSDN通过智能技术生成

这个也是在现实科研中的需求,看似简单,实际上也得动动脑子。另外,掌握了这种方法,我们可以对源码中的任意满足条件的token进行替换。

这篇博客是承接上一篇的:https://blog.csdn.net/qysh123/article/details/110849387,不过稍微有所改进。上篇博客中,由于有命令空间(namespace)的存在,所以用了模糊匹配的方法,这里我们参考其他朋友的方法:https://blog.csdn.net/weixin_45069542/article/details/90229654,来直接解析命名空间。还是一样,以下面这段Java代码(假设这段代码的文件名为Example1.java)为例:

  public static long toLong(byte[] a) {
    long x = 0;
    for (int i = 0; i < 8; i++) {
      int j = (7 - i) << 3;
      x |= ((0xFFL << j) & ((long) a[i] << j));
    }
    return x;//test
  }

对于这段代码,我们想把a替换成:FPARAM,那应该怎么做呢,如果用上篇博客说的javalang:

import javalang

this_file=open("Example1.java",'r')
file_content=this_file.read()
tokens = list(javalang.tokenizer.tokenize(file_content))
for each_token in tokens:
    print(each_token)

我们得到的是下面这样的结果:

Modifier "public" line 1, position 3
Modifier "static" line 1, position 10
BasicType "long" line 1, position 17
Identifier "toLong" line 1, position 22
Separator "(" line 1, position 28
BasicType "byte" line 1, position 29
Separator "[" line 1, position 33
Separator "]" line 1, position 34
Identifier "a" line 1, position 36
Separator ")" line 1, position 37
Separator "{" line 1, position 39
BasicType "long" line 2, position 5
Identifier "x" line 2, position 10
Operator "=" line 2, position 12
DecimalInteger "0" line 2, position 14
Separator ";" line 2, position 15
Keyword "for" line 3, position 5
Separator "(" line 3, position 9
BasicType "int" line 3, position 10
Identifier "i" line 3, position 14
Operator "=" line 3, position 16
DecimalInteger "0" line 3, position 18
Separator ";" line 3, position 19
Identifier "i" line 3, position 21
Operator "<" line 3, position 23
DecimalInteger "8" line 3, position 25
Separator ";" line 3, position 26
Identifier "i" line 3, position 28
Operator "++" line 3, position 29
Separator ")" line 3, position 31
Separator "{" line 3, position 33
BasicType "int" line 4, position 7
Identifier "j" line 4, position 11
Operator "=" line 4, position 13
Separator "(" line 4, position 15
DecimalInteger "7" line 4, position 16
Operator "-" line 4, position 18
Identifier "i" line 4, position 20
Separator ")" line 4, position 21
Operator "<<" line 4, position 23
DecimalInteger "3" line 4, position 26
Separator ";" line 4, position 27
Identifier "x" line 5, position 7
Operator "|=" line 5, position 9
Separator "(" line 5, position 12
Separator "(" line 5, position 13
HexInteger "0xFFL" line 5, position 14
Operator "<<" line 5, position 20
Identifier "j" line 5, position 23
Separator ")" line 5, position 24
Operator "&" line 5, position 26
Separator "(" line 5, position 28
Separator "(" line 5, position 29
BasicType "long" line 5, position 30
Separator ")" line 5, position 34
Identifier "a" line 5, position 36
Separator "[" line 5, position 37
Identifier "i" line 5, position 38
Separator "]" line 5, position 39
Operator "<<" line 5, position 41
Identifier "j" line 5, position 44
Separator ")" line 5, position 45
Separator ")" line 5, position 46
Separator ";" line 5, position 47
Separator "}" line 6, position 5
Keyword "return" line 7, position 5
Identifier "x" line 7, position 12
Separator ";" line 7, position 13
Separator "}" line 8, position 3

可以看到,a和x类型都是Identifier,这时候我们的分析程序是没法确定哪个是方法的参数的(顺便说一下,Understand得到的也是这种粒度)。

所以这种情况下还是得使用srcML:

如果直接在命令行下运行:

srcml.exe Example1.java > test.xml

得到的xml文件是:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<unit xmlns="http://www.srcML.org/srcML/src" revision="0.9.5" language="Java" filename="Example1.java">  <function><specifier>public</specifier> <specifier>static</specifier> <type><name>long</name></type> <name>toLong</name><parameter_list>(<parameter><decl><type><name><name>byte</name><index>[]</index></name></type> <name>a</name></decl></parameter>)</parameter_list> <block>{
    <decl_stmt><decl><type><name>long</name></type> <name>x</name> <init>= <expr><literal type="number">0</literal></expr></init></decl>;</decl_stmt>
    <for>for <control>(<init><decl><type><name>int</name></type> <name>i</name> <init>= <expr><literal type="number">0</literal></expr></init></decl>;</init> <condition><expr><name>i</name> <operator>&lt;</operator> <literal type="number">8</literal></expr>;</condition> <incr><expr><name>i</name><operator>++</operator></expr></incr>)</control> <block>{
      <decl_stmt><decl><type><name>int</name></type> <name>j</name> <init>= <expr><operator>(</operator><literal type="number">7</literal> <operator>-</operator> <name>i</name><operator>)</operator> <operator>&lt;&lt;</operator> <literal type="number">3</literal></expr></init></decl>;</decl_stmt>
      <expr_stmt><expr><name>x</name> <operator>|=</operator> <operator>(</operator><operator>(</operator><literal type="number">0xFFL</literal> <operator>&lt;&lt;</operator> <name>j</name><operator>)</operator> <operator>&amp;</operator> <operator>(</operator><operator>(</operator><name>long</name><operator>)</operator> <name><name>a</name><index>[<expr><name>i</name></expr>]</index></name> <operator>&lt;&lt;</operator> <name>j</name><operator>)</operator><operator>)</operator></expr>;</expr_stmt>
    }</block></for>
    <return>return <expr><name>x</name></expr>;</return><comment type="line">//test</comment>
  }</block></function></unit>

这时候就得仔细观察一下怎么用XPath定位到a了,其实观察的方法和之前用scrapy的时候类似:https://blog.csdn.net/qysh123/article/details/106655644

其实我们可以看到,如果用/function/parameter_list/parameter/decl/name这个路径就可以直接定位到a,如果要这样匹配,就得使用带namespace的XPath匹配方法,具体可以参考:https://blog.csdn.net/weixin_45069542/article/details/90229654

下面给出代码,是不是很简单:

import subprocess
from lxml import etree

output=subprocess.run(['srcml.exe', 'Example1.java'], capture_output=True, check=False)
root=etree.fromstring(output.stdout)

for func in root.xpath('//x:function',namespaces={'x':'http://www.srcML.org/srcML/src'}):
    func_name = func.xpath('./x:name/text()',namespaces={'x':'http://www.srcML.org/srcML/src'})[0]
    content = func.xpath('.//text()')
    content = [str(v).strip() for v in content]
    content = list(filter(None, content))
    print(content)
    parameter_list=[]
    for parameter in func.xpath('./x:parameter_list/x:parameter',namespaces={'x':'http://www.srcML.org/srcML/src'}):
        for name in parameter.xpath('./x:decl/x:name/text()',namespaces={'x':'http://www.srcML.org/srcML/src'}):
            parameter_list.append(name)
    
    normalized_list=[]
    for each_token in content:
        if(each_token in parameter_list):
            normalized_list.append('FPARAM')
        else:
            normalized_list.append(each_token)
    print(normalized_list)

两次print的输出分别是:

['public', 'static', 'long', 'toLong', '(', 'byte', '[]', 'a', ')', '{', 'long', 'x', '=', '0', ';', 'for', '(', 'int', 'i', '=', '0', ';', 'i', '<', '8', ';', 'i', '++', ')', '{', 'int', 'j', '=', '(', '7', '-', 'i', ')', '<<', '3', ';', 'x', '|=', '(', '(', '0xFFL', '<<', 'j', ')', '&', '(', '(', 'long', ')', 'a', '[', 'i', ']', '<<', 'j', ')', ')', ';', '}', 'return', 'x', ';', '//test', '}']
['public', 'static', 'long', 'toLong', '(', 'byte', '[]', 'FPARAM', ')', '{', 'long', 'x', '=', '0', ';', 'for', '(', 'int', 'i', '=', '0', ';', 'i', '<', '8', ';', 'i', '++', ')', '{', 'int', 'j', '=', '(', '7', '-', 'i', ')', '<<', '3', ';', 'x', '|=', '(', '(', '0xFFL', '<<', 'j', ')', '&', '(', '(', 'long', ')', 'FPARAM', '[', 'i', ']', '<<', 'j', ')', ')', ';', '}', 'return', 'x', ';', '//test', '}']

说实话确实挺方便的。用这种方法,我们基本上可以匹配到源代码(C++,Java,Python srcML均可以处理)中满足任意条件的token。

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值