B2SFinder: Detecting Open-Source Software Reuse in COTS Software
binary to source
COTS:commercial off-the-shelf
OSS:open-source software
Two challenges of binary similarity detection
- First, the fully-automatic compilation for all the candidate OSS projects is nontrivial
and usually requires manual work to find appropriate compiler flags to enable their successful compilation. - Second, any binary similarity analysis can be expensive.
Three problems addressed by the paper
- How can we select as many code features as possible while ensuring that all the selected features are traceable in compiled binaries?
- How can we precisely compute the matching scores with respect to different code features and their feature instances?
- How can we exploit the code structures of OSS projects to improve reuse identification?
Design
A. Selecting Code Features
Two criteria for selecting a code feature
- First, a code feature must be present simultaneously in both binary and source code.
- Second, a code feature in source code should not be changed drastically during the compilation.
B. Matching Code Features
Exact Matching for String-Typed Features: Identical
Search-Based Matching for Integer-Typed Features:: Encode each as a bitstream according to the width of its data type and search directly for the bitstream in the .data and .rdata segments of a binary file.
Semantics-Based Matching for Control-Flow-Typed Features:
a switch/case feature in the source code of an OSS project:represent it as an unordered list of case label sets with a default branch appended(The default branch can be considered to match with any set of case labels in the binary of a COTS software application. )
switch/case default:default一般用在最后,表示非以上的任何情况下而发生的情况
an if/else feature :calculate the longest ordered common subsequence for each pair of if/else sequences, with one from source code and one from binary code, and then determine their equivalence according to the length of their common subsequence.
C. Determining the Importance-Weights of Feature Instances
To differentiate feature instances in terms of their contribution to feature matching, each feature instance is assigned a specificity weight and a frequency weight.
frequency weight: S-IDF (a variant of TF-IDF)
specificity weight:
- a string-typed feature :the number of its substrings, including URLs and copyright
information (among others) - an integer-typed feature:the entropy for its bitstream is used
- a control-flow-type feature:the length of its constant sequence is used
D. Computing Matching Scores
E. Identifying Reuse Types
- Identifying Partial Reuses: recognize independent libraries by building a compilation dependency layered graph (CDLG), and taking a single library instead of the whole OSS project as a code matching unit.
- Identifying Pseudo Propagated Reuses: recognize the inclusive relationships among the candidate OSS projects by comparing the matched feature instances in these candidate OSS projects.
IMPLEMENTATION
A. Architecture
B. Feature Extraction from Large-Scale Source Projects
First, we search for files like CMAKELISTS.txt and autogen.sh in the OSS projects to automatically detect the MAKEFILEs used. Second, we parse these MAKEFILEs to obtain the gcc or libtool commands used without actually compiling an OSS project entirely. Finally, we locate the include paths and macros used from the arguments provided to the compiler flags -I and -D.
C. Storage Model for Large-Scale Code Features
To shorten the time complexity of feature matching , we make use of two data structures, an inverted index(string) and a Trie (a prefix tree)(two integer-typed features).