B2SFinder

最新推荐文章于 2024-08-20 22:02:01 发布

桃子小迷妹

最新推荐文章于 2024-08-20 22:02:01 发布

阅读量399

点赞数

分类专栏：论文

本文链接：https://blog.csdn.net/weixin_43846270/article/details/106803659

版权

论文专栏收录该内容

20 篇文章 1 订阅

订阅专栏

B2SFinder: Detecting Open-Source Software Reuse in COTS Software

binary to source
COTS：commercial off-the-shelf
OSS：open-source software

Two challenges of binary similarity detection

First, the fully-automatic compilation for all the candidate OSS projects is nontrivial
and usually requires manual work to find appropriate compiler flags to enable their successful compilation.
Second, any binary similarity analysis can be expensive.

Three problems addressed by the paper

How can we select as many code features as possible while ensuring that all the selected features are traceable in compiled binaries?
How can we precisely compute the matching scores with respect to different code features and their feature instances?
How can we exploit the code structures of OSS projects to improve reuse identification?

Design

A. Selecting Code Features
Two criteria for selecting a code feature

First, a code feature must be present simultaneously in both binary and source code.
Second, a code feature in source code should not be changed drastically during the compilation.

在这里插入图片描述
B. Matching Code Features

Exact Matching for String-Typed Features： Identical
Search-Based Matching for Integer-Typed Features：: Encode each as a bitstream according to the width of its data type and search directly for the bitstream in the .data and .rdata segments of a binary file.
Semantics-Based Matching for Control-Flow-Typed Features：
a switch/case feature in the source code of an OSS project：represent it as an unordered list of case label sets with a default branch appended（The default branch can be considered to match with any set of case labels in the binary of a COTS software application. ）
switch/case default：default一般用在最后，表示非以上的任何情况下而发生的情况

an if/else feature ：calculate the longest ordered common subsequence for each pair of if/else sequences, with one from source code and one from binary code, and then determine their equivalence according to the length of their common subsequence.

C. Determining the Importance-Weights of Feature Instances
To differentiate feature instances in terms of their contribution to feature matching, each feature instance is assigned a specificity weight and a frequency weight.
frequency weight： S-IDF (a variant of TF-IDF)
specificity weight：

a string-typed feature ：the number of its substrings, including URLs and copyright
information (among others)
an integer-typed feature：the entropy for its bitstream is used
a control-flow-type feature：the length of its constant sequence is used

D. Computing Matching Scores

E. Identifying Reuse Types

Identifying Partial Reuses: recognize independent libraries by building a compilation dependency layered graph (CDLG), and taking a single library instead of the whole OSS project as a code matching unit.
Identifying Pseudo Propagated Reuses: recognize the inclusive relationships among the candidate OSS projects by comparing the matched feature instances in these candidate OSS projects.

IMPLEMENTATION

在这里插入图片描述
A. Architecture
B. Feature Extraction from Large-Scale Source Projects
First, we search for files like CMAKELISTS.txt and autogen.sh in the OSS projects to automatically detect the MAKEFILEs used. Second, we parse these MAKEFILEs to obtain the gcc or libtool commands used without actually compiling an OSS project entirely. Finally, we locate the include paths and macros used from the arguments provided to the compiler flags -I and -D.
C. Storage Model for Large-Scale Code Features
To shorten the time complexity of feature matching , we make use of two data structures, an inverted index(string) and a Trie (a prefix tree)(two integer-typed features).