B2SFinder

B2SFinder: Detecting Open-Source Software Reuse in COTS Software

binary to source
COTS:commercial off-the-shelf
OSS:open-source software

Two challenges of binary similarity detection

  1. First, the fully-automatic compilation for all the candidate OSS projects is nontrivial
    and usually requires manual work to find appropriate compiler flags to enable their successful compilation.
  2. Second, any binary similarity analysis can be expensive.

Three problems addressed by the paper

  1. How can we select as many code features as possible while ensuring that all the selected features are traceable in compiled binaries?
  2. How can we precisely compute the matching scores with respect to different code features and their feature instances?
  3. How can we exploit the code structures of OSS projects to improve reuse identification?
    The workflow of B2SFINDER

Design

A. Selecting Code Features
Two criteria for selecting a code feature

  1. First, a code feature must be present simultaneously in both binary and source code.
  2. Second, a code feature in source code should not be changed drastically during the compilation.

在这里插入图片描述
B. Matching Code Features
在这里插入图片描述
Exact Matching for String-Typed Features: Identical
Search-Based Matching for Integer-Typed Features:: Encode each as a bitstream according to the width of its data type and search directly for the bitstream in the .data and .rdata segments of a binary file.
Semantics-Based Matching for Control-Flow-Typed Features:
a switch/case feature in the source code of an OSS project:represent it as an unordered list of case label sets with a default branch appended(The default branch can be considered to match with any set of case labels in the binary of a COTS software application. )
switch/case default:default一般用在最后,表示非以上的任何情况下而发生的情况

an if/else feature :calculate the longest ordered common subsequence for each pair of if/else sequences, with one from source code and one from binary code, and then determine their equivalence according to the length of their common subsequence.

C. Determining the Importance-Weights of Feature Instances
To differentiate feature instances in terms of their contribution to feature matching, each feature instance is assigned a specificity weight and a frequency weight.
frequency weight: S-IDF (a variant of TF-IDF)
specificity weight:

  1. a string-typed feature :the number of its substrings, including URLs and copyright
    information (among others)
  2. an integer-typed feature:the entropy for its bitstream is used
  3. a control-flow-type feature:the length of its constant sequence is used

D. Computing Matching Scores

E. Identifying Reuse Types

  1. Identifying Partial Reuses: recognize independent libraries by building a compilation dependency layered graph (CDLG), and taking a single library instead of the whole OSS project as a code matching unit.
  2. Identifying Pseudo Propagated Reuses: recognize the inclusive relationships among the candidate OSS projects by comparing the matched feature instances in these candidate OSS projects.

IMPLEMENTATION

在这里插入图片描述
A. Architecture
B. Feature Extraction from Large-Scale Source Projects

First, we search for files like CMAKELISTS.txt and autogen.sh in the OSS projects to automatically detect the MAKEFILEs used. Second, we parse these MAKEFILEs to obtain the gcc or libtool commands used without actually compiling an OSS project entirely. Finally, we locate the include paths and macros used from the arguments provided to the compiler flags -I and -D.
C. Storage Model for Large-Scale Code Features
To shorten the time complexity of feature matching , we make use of two data structures, an inverted index(string) and a Trie (a prefix tree)(two integer-typed features).

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值