代码克隆检测（Code Clone Detection）数据集BigCloneBench最新版的使用方法

最新推荐文章于 2024-11-30 18:29:29 发布

蛐蛐蛐

最新推荐文章于 2024-11-30 18:29:29 发布

阅读量7.9k

点赞数 17

分类专栏：科研工具深度学习论文点评

本文链接：https://blog.csdn.net/qysh123/article/details/112354932

版权

科研工具同时被 3 个专栏收录

137 篇文章

订阅专栏

深度学习

65 篇文章

订阅专栏

论文点评

38 篇文章

订阅专栏

本文详细讲述了在软件工程领域中如何使用BigCloneBench数据集进行代码克隆检测，过程中遇到的工具复杂性、文档混乱以及数据理解难题，强调了学术界资源在易用性和实用性上的不足。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

代码克隆检测（Code Clone Detection）是软件工程领域的一个重要方向，每年都有很多论文，其中很多论文都用到了BigCloneBench，这里简单总结一下这个数据集的使用，不得不吐槽一点：学术界的数据集和工具，易用性都太差了，发再多paper有啥用呢？只是在自己小圈子里自娱自乐，完全没有实际的impact和意义。

如果我们在google搜索bigclonebench，可以看到排名第一的就是一个GitHub上的repo：https://github.com/clonebench/BigCloneBench

这里：https://github.com/clonebench/BigCloneBench#bigclonebench-version-2-use-this-version

写得很清楚，建议大家使用Version 2，并且告诉我们说：

The latest BigCloneBench is distributed with BigCloneEval here. It is significantly larger than the ERA version, and is distributed as a h2database file with a much simpler schema. If you need the full BigCloneBench, with all the validation artifacts, please contact us and we can arrange to send you a copy (it is quite large).

IJaDataset with the expanded files is available here: IJaDataset 2.0 + BigCloneBench Samples.

BigCloneBench full database (postgresql) is available here: BigCloneBench_Postgresql.

吐槽一下啊，到底是使用BigCloneEval呢，还是使用下面这个postgresql的文件呢，后面我们会看到，这两者也是不一样的。不得不说，很多Repo的ReadME的可读性都太差了。不过既然说了最新版的是连同BigCloneEval发布的，所以我们暂时下这个版本：

https://github.com/jeffsvajlenko/BigCloneEval

我刚看到这个项目也是一脸懵逼，一上来介绍：

BigCloneEval is a framework for performing clone detection tool evaluation experiments with the BigCloneBench clone benchmark.

然后就开始讲怎么安装了。呵呵呵，如果不了解的人看到这里会莫名其妙，为啥我要用一个framework来做evaluation呢？相信很多朋友和我的目的一样，都是需要把BigCloneBench下载下来训练及评价自己的模型，仅此而已，你不是应该先讲数据集怎么用，然后如果别人关心的话，再介绍自己的工具吗？况且也没介绍清楚。耐着性子往下看，看到Step 2，Step 3分别需要下载最新的BigCloneBench和IJaDataset，后面这个又是什么鬼？

https://www.dropbox.com/s/z2k8r78l63r68os/BigCloneBench_BCEvalVersion.tar.gz?dl=0

https://www.dropbox.com/s/xdio16u396imz29/IJaDataset_BCEvalVersion.tar.gz?dl=0

耐着性子把这些步骤做完，看到后面：

To evaluate the recall of a clone detection tool you must complete the following steps:

Register the clone detection tool with BigCloneEval.
Detect clones in IJaDataset using the clone detection tool.
Import the detected clones into BigCloneEval.
Configure and execution the evaluation experiment.

These steps are performed using BigCloneEval's commands, which are located in the command/ directory as scripts. These scripts must be executed from within the command directory (the command directory must be the working directory).

大概明白了，这个BigCloneEval是一个evaluation的framework，我们如果想使用这个framework评价自己的工具或者模型，需要注册（Register）自己的clone detection tool，但是这里完全不讲怎么注册自己的工具，当然，你可以认为这里给出了例子：

./registerTool -n "NiCad" -d "Default configuration"

但是，如果对这个不了解，鬼知道怎么注册自己的工具，鬼知道这个NiCad是什么东西？

呵呵呵，还是那句话，这些ReadME写得都太垃圾了。但是我们需要使用人家的数据集，只好从自己的角度来猜猜是怎么回事了。按照前面说的，Step 2中下载的压缩包BigCloneBench_BCEvalVersion.tar.gz，解压以后是bcb.h2.db和bcb.trace.db，按照Step 2中的提示，

Extract the contents of BigCloneBench (BigCloneBench_BCEvalVersion.tar.gz) into the 'bigclonebenchdb' directory of the BigCloneEval distribution.

可以预期，这两个（bcb.h2.db和bcb.trace.db）是这个版本的BigCloneBench的h2数据库文件。另外Step 3中，IJaDataset_BCEvalVersion.tar.gz解压以后是一个名为“bcb_reduced”的文件夹，需要：

Extract the contents of IJaDataset (IJaDataset_BCEvalVersion.tar.gz) into the 'ijadataset' directory of the BigCloneEval distribution.

所以这两步就是把相关的数据放到BigCloneEval的对应文件夹下，可以猜想这两部分就是主要的数据了。那么我们要使用BigCloneBench，只要关注这两步中的文件即可。

所以，第一步先用h2把这个数据库文件打开来看看，在这个页面：https://www.h2database.com/html/main.html

中下载All Platforms (zip, 8 MB)

发现解压之后是传统的Java软件目录，在bin目录下有JAR包，我们运行：

java -jar h2-1.4.200.jar

即可以通过浏览器打开数据库页面了。那么问题又来了，我们怎么把bcb.h2.db这样的文件导入数据库呢，搜索了一下，看到这里有人介绍：

https://stackoverflow.com/questions/46000726/tool-to-view-test-h2-db-h2-database

需要指定一下db文件相对于主文件夹的位置，ubuntu系统中就是/home/%用户名%/，简单起见，我就把上面的bcb.h2.db放到主文件夹下了，然后启动h2，在启动后的web界面中的JDBC URL填写：

jdbc:h2:~/bcb

点击connect后就可以看到这个数据库中的表了。例如我们执行：

SELECT * FROM CLONES;

就可以看到类似下面的数据记录：

这个嘛，应该就是表示clone关系了吧，但具体哪一列表示clone呢，尝试一下：

SELECT count(*) FROM CLONES where functionality_id=4;

结果是：4664949，如果我们执行：

SELECT count(*) FROM CLONES where functionality_id=3;

结果是：891777，如果执行：

SELECT count(*) FROM CLONES where functionality_id=2;

结果是：409962，而id=1的时候呢，结果是0，感觉这个比例和我看过的论文里的比例差不多，所以盲猜这个functionality_id表示的就是Type-1到Type-4的Clone。

(2021年1月28日更新：简单分析了之后，发现这个functionality_id并不是表示Type 1到4的Clone，而是SYNTACTIC_TYPE表示Clone类型，例如我们运行：

SELECT count(*) FROM CLONES where SYNTACTIC_TYPE=1

得到结果是48116,

SYNTACTIC_TYPE=2的时候是4234，SYNTACTIC_TYPE=3的时候是8531803，SYNTACTIC_TYPE=4的时候是0，所有的clone pair是8584153，sigh，看来之前的盲猜果然猜错了啊）

姑且这么认为吧（使用大部分数据集的时候基本靠猜，我也是醉了，学术界就是这么的低效）。相信function_id这些对应的就是表FUNCTION里的内容了。我们运行：

SELECT * FROM FUNCTIONS;

得到下面这张表：

这个很明显就是function信息了，如果我们先运行：

SELECT count(*) FROM FUNCTIONS;

得到的结果是：22285855，有两千多万个function？我表示很怀疑啊。那怎么将这些function和源码对应起来呢，就想到了bcb_reduced，在ubuntu中查看其属性，发现有：55,671 项，共 607.3 MB，我就说这个数据集肯定没有那么多function啊。简单浏览了一下这个文件夹，我们以/bcb_reduced/2/selected/145.java（这个目录下的第一个文件为例）：

SELECT * FROM FUNCTIONS where name='145.java';

可以得到下面的结果：

而我们看一下145.java的内容：

package edu.harvard.iq.safe.saasystem.util;

import java.util.*;
import java.util.logging.*;
import java.io.*;
import javax.xml.stream.*;
import javax.xml.stream.events.XMLEvent;
import java.net.*;
import java.util.concurrent.*;

/**
 *
 * @author asone
 */
public class PLNConfigurationFileReader {

    static final Logger logger = Logger.getLogger(PLNConfigurationFileReader.class.getName());

    /**
     *
     * @param url
     * @param propertyName
     * @return
     */
    public static CopyOnWriteArraySet<String> read(URL url, String propertyName) {
        if ((propertyName == null) || (propertyName.equals(""))) {
            return null;
        }
        CopyOnWriteArraySet<String> peers = new CopyOnWriteArraySet<String>();
        XMLInputFactory xmlif = XMLInputFactory.newInstance();
        xmlif.setProperty("javax.xml.stream.isCoalescing", java.lang.Boolean.TRUE);
        xmlif.setProperty("javax.xml.stream.isNamespaceAware", java.lang.Boolean.TRUE);
        XMLStreamReader xmlr = null;
        BufferedInputStream stream = null;
        long starttime = System.currentTimeMillis();
        logger.info("Starting to parse the remote config xml[" + url + "]");
        int elementCount = 0;
        int topPropertyCounter = 0;
        int propertyTagLevel = 0;
        try {
            stream = new BufferedInputStream(url.openStream());
            xmlr = xmlif.createXMLStreamReader(stream, "utf8");
            int eventType = xmlr.getEventType();
            String curElement = "";
            String targetTagName = "property";
            String peerListAttrName = propertyName;
            boolean sentinel = false;
            boolean valueline = false;
            while (xmlr.hasNext()) {
                eventType = xmlr.next();
                switch(eventType) {
                    case XMLEvent.START_ELEMENT:
                        curElement = xmlr.getLocalName();
                        if (curElement.equals("property")) {
                            topPropertyCounter++;
                            propertyTagLevel++;
                            int count = xmlr.getAttributeCount();
                            if (count > 0) {
                                for (int i = 0; i < count; i++) {
                                    if (xmlr.getAttributeValue(i).equals(peerListAttrName)) {
                                        sentinel = true;
                                    }
                                }
                            }
                        }
                        if (sentinel && curElement.equals("value")) {
                            valueline = true;
                            String ipAd = xmlr.getElementText();
                            peers.add(ipAd);
                        }
                        break;
                    case XMLEvent.CHARACTERS:
                        break;
                    case XMLEvent.ATTRIBUTE:
                        if (curElement.equals(targetTagName)) {
                        }
                        break;
                    case XMLEvent.END_ELEMENT:
                        if (xmlr.getLocalName().equals("property")) {
                            if (sentinel) {
                                sentinel = false;
                                valueline = false;
                            }
                            elementCount++;
                            propertyTagLevel--;
                        } else {
                        }
                        break;
                    case XMLEvent.END_DOCUMENT:
                }
            }
        } catch (MalformedURLException ue) {
            logger.log(Level.WARNING, "specified url was not correct", ue);
        } catch (IOException ex) {
            logger.log(Level.WARNING, "remote config file cannot be read due IO problems", ex);
        } catch (XMLStreamException ex) {
            logger.log(Level.WARNING, "remote config file (" + url + ") cannot be read due to unexpected processing conditions", ex);
        } finally {
            if (xmlr != null) {
                logger.info("closing the xml stream reader");
                try {
                    xmlr.close();
                } catch (XMLStreamException ex) {
                }
            }
            if (stream != null) {
                logger.info("closing the stream");
                try {
                    stream.close();
                } catch (IOException ex) {
                }
            }
        }
        return peers;
    }
}

明显只能和最后一个记录对应起来啊（StartLine和EndLine分别是25和115），所以FUNCTION这个表里面，我们只需要关心Type为Selected这些记录即可。剩下的工作就是，需要把bcb_reduced这个文件夹中的源码都处理一下，提取出function，我记得之前也看到一篇博客说这个事情：https://www.cnblogs.com/xiaojieshisilang/p/10219344.html

呵呵呵呵呵，从恢复数据，到理解/处理数据，都得靠猜和自己摸索，再吐槽一遍：学术界的大部分工具和数据集都太垃圾了！！！

另外，由于我有Uderstand，所以从Java代码中提取function并确定其起始行号，是非常简单的事情，需要写一些代码，就能使用这个所谓的数据集了。相信绝对不是我一个人有这些感想，真的很无语。也希望我这个探索过程能帮助到一些有需求的朋友。

最后发两个链接，说明并不是我一个人有这样的疑问：

https://bbs.csdn.net/topics/396147037

https://github.com/jeffsvajlenko/BigCloneEval/issues/26