General guidelines
example_index
sequentially labels the canonical versions of the two datasets. The indices correspond to rows in these files.
- MSR:
MSR_data_cleaned.csv
https://drive.google.com/file/d/1-0VhnHBp9IGh90s2wCNjeCMuy70HPl8X/view?usp=sharing - Devign:
function.json
https://drive.google.com/uc?id=1x6hoF7G-tSYxg8AFybggypLZgMGDNHfF
For example, the 1st item infunction.json
will haveexample_index == 0
, 50th item hasexample_index == 49
, …
For example, the 1st row inMSR_data_cleaned.csv
will haveexample_index == 0
, 50th row hasexample_index == 49
, …
For combined dataset, both dataset
and example_index
are needed to uniquely identify a dataset example.
DEPRECATED: For MSR, include both before
and after
versions of code for each example_index
in the same dataset partition.
We are now excluding all after
examples from MSR.
dataset_size
Meaning of filenames:
imbalanced_XXX.csv
: proportion XXX sampled from the imbalanced version of combined dataset (Devign + MSR).imbalanced_1.0.csv
contains the entire combined dataset.balanced_XXX.csv
: balanced version ofimbalanced_XXX.csv
(excludes rows from MSR wherevul == 0
).
Label proportions:
imbalanced balanced
target
0.0 0.897009 0.524416
1.0 0.102991 0.475584
Meaning of columns:
- dataset: MSR or Devign, which dataset this example came from
- example_index: numerical index for the example which is unique within a dataset (but not unique in combined datasets)
- project: metadata, name of the project this code came from
- commit_id: metadata, name of the commit id this code came from
- split: dataset split to which this example is allocated
cross-project and project_diversity
Note: fold_XXX_holdout
is identical for the same fold index in both cross-project and project-diversity settings. They are provided separately in both settings for convenience.
Label proportions of first fold:
cross_project_dataset
total: 199536
target
0 0.945373
1 0.054627
diverse_dataset
total: 92582
target
0 0.941911
1 0.058089
nondiverse_dataset
total: 91590
target
0 0.952113
1 0.047887
cross-project
Meaning of filenames:
fold_XXX
denotes that the file belongs to fold numberXXX
.fold_XXX_holdout.csv
is the holdout set, which is the set of ~10k examples whose projects are distinct from the projects infold_XXX_dataset.csv
.fold_XXX_dataset.csv
is the mixed-project dataset, which is split into train/validation/test without separating projects.
Steps for the experiment.
For each fold fold_XXX
:
- Train on the
train
split infold_XXX_dataset.csv
and usevalid
split for validation. - Evaluate on both the
test
split offold_XXX_dataset.csv
and all the examples infold_XXX_holdout.csv
.
project_diversity
Meaning of filenames:
nondiverse.csv
is the nondiverse set containing only examples fromChrome
, which is split into training/validation splits. This set is the same for all folds, so it is not identified byfold_XXX
.fold_XXX
denotes that the file belongs to fold numberXXX
.fold_XXX_diverse.csv
is the diverse set which excludesChrome
and the projects inholdout
, and is split into training/validation splits.fold_XXX_holdout.csv
is the holdout set, which is the set of ~10k examples whose projects are distinct from the projects in the diverse and nondiverse sets.
Steps for the experiment:
- Train/validate the model on
train
/valid
splits respectively innondiverse.csv
. - Train/validate for each
fold_XXX_diverse.csv
. - Evaluate both nondiverse and diverse models on
fold_XXX_holdout.csv
for each fold. Only evaluate the diverse model on its respective holdout set.
bug_type
Each file contains the examples from a set of CWEs which have similar semantics.
All files are mutually exclusive.
All files contain train/valid/test splits.
Steps for the experiment.
For each file bugtype_XXX
:
- Train on the
train
split inbugtype_XXX.csv
and usevalid
split for validation. - Evaluate on the
test
split ofbugtype_XXX.csv
. - Evaluate on the
test
split of each otherbugtype_YYY.csv
.