如何使用Understand Perl API对BigCloneBench中的Java Method的起始、终止行进行提取

最新推荐文章于 2024-06-25 17:47:41 发布

蛐蛐蛐

最新推荐文章于 2024-06-25 17:47:41 发布

阅读量685

点赞数 1

分类专栏：科研工具论文点评 Understand

本文链接：https://blog.csdn.net/qysh123/article/details/112607363

版权

科研工具同时被 3 个专栏收录

137 篇文章

订阅专栏

论文点评

38 篇文章

订阅专栏

Understand

2 篇文章

订阅专栏

（2021年1月28日更新：经过测试，发现用下面的方法获得的起始终止行依然有问题，所以请大家不要参考下面的做法了，目前我的做法是：参考上一篇博客中说的FUNCTIONS这张表，里面有STARTLINE，ENDLINE，我们根据这两个来截取源代码文件即可。）

这个标题好长……，接上篇博客的内容，上篇博客中讲到，IJaDataset里面的源代码都是Java文件，但是数据库中的记录都是Method级别的，这时候就需要根据Java源文件提取出每个Method，并且确定每个Method的“STARTLINE”和“ENDLINE”，本来觉得用Understand应该很好实现，谁知道也试了好几种方法。

最简单的想法，看看每个Method的entity（Perl API中的ent）是否有这个属性，可惜没有。又仔细看了看API，看到这里：

http://documentation.scitools.com/html/perl/index.html#_lexeme__line_begin__

是可以确定每个token所在的行的，那我们是不是找出某个method的token，然后找出最大最小值即可？可惜，如果用：

foreach $method ($db->ents("Method")){
    $lexer=$method->lexer();
	@lexemes=$lexer->lexemes();
}

这种方法，是不能取到和每个method相关的lexeme的，因为这里说了，我们用$ent->lexer取到的是：

  Returns a lexer object for the specified file entity. The original
  source file must be readable and unchanged since the last database parse.
  If called in an array context, returns ($lexer, $status).

也就是说无论我们的$ent的粒度是什么，取到的都是其所在的file的所有lexemes，这个真是太反人类了……，没办法，只能另外再尝试方法。终于看到：

$lexeme->ent()

  Returns the entity associated with the lexeme, or undef if none.

也就是说我们可以取每个lexeme对应的ent，那么如果这个lexeme对应的是某个method（或者class），我们岂不是就可以提取出这个method（或这个class）的起始行号了？另外我们又可以很容易知道其总行数，那么ENDLINE也就自然可以得到了。

这里我们还是以/bcb_reduced/2/selected/145.java这个文件为例，我们写一个简单的perl脚本：

use Understand;

system( "und -quiet create -db temp3.udb -languages java add " . $ARGV[0] . " analyze" );
( $db, $status ) = Understand::open("temp3.udb");
die "Error status: ", $status, "\n" if $status;

open(OUTFILE, ">METHOD_START_LINE.csv");

foreach $file ($db->ents("file")){
    $lexer=$file->lexer();
	@lexemes=$lexer->lexemes();
    foreach $lexeme (@lexemes){
        $token=$lexeme->token();
        if($token eq "Identifier"){
            $entity=$lexeme->ent();
            if($entity){
				$loc=$entity->metric("CountLine"),"\n";
				if($loc){
					print $entity->name(),"\n";
                	print $loc,"\n";
					$name=$entity->longname();
					$file_name=$file->name();
					$start_line=$lexeme->line_begin();
					print OUTFILE $file_name.",".$name.",".$start_line."\n";
				}    
            }
        }
    }
}

命名为BlogExample.pl，然后将145.java放在Exmaple目录下，然后运行：

uperl BlogExample.pl Example

我们就可以得到METHOD_START_LINE.csv这个文件，这个文件的内容是：

145.java,edu.harvard.iq.safe.saasystem.util,1
145.java,edu.harvard.iq.safe.saasystem.util.PLNConfigurationFileReader,15
145.java,edu.harvard.iq.safe.saasystem.util.PLNConfigurationFileReader,17
145.java,edu.harvard.iq.safe.saasystem.util.PLNConfigurationFileReader.read,25

我们对照上一篇博客中的145.java文件，就可以看出来，通过这种方法，我们可以将源码中的class，method的起始行号提取出来，也就是通过上面的Identifier——ent()——CountLine这样的过滤，我们可以把所有仅和class method相关的lexeme找出来，然后就可以很准确地和BigCloneBench中的数据进行对应了。