Awesome
Embedding Java Classes with code2vec: Improvements from Variable Obfuscation
This repository contains the java obfuscation tool created with Spoon and the dataset pipeline as described in:
Rhys Compton, Eibe Frank, Panos Patros, and Abigail Koay - Embedding Java Classes with code2vec: Improvements from Variable Obfuscation, in MSR '20 [ArXiv Preprint]
Also included are all models and data used in the paper for reproducing/further research.
Table of Contents
- Downloadable Assets
- Requirements
- Usage - Obfuscator
- Usage - Dataset Pipeline
- Trained code2vec Models
- Datasets
- Citation
Downloadable Assets
Requirements
- Java 8+
- Python 3
Usage: Obfuscator
cd java-obfuscator
- Locate a folder of
.java
files (e.g., from the code2seq repository) - Alter the input and output directories in
obfs-script.sh
, as well as the number of threads of your machine. If you're running this on a particularly large folder (e.g., millions of files) then you may need to increase theNUM_PARTITIONS
to 3 or 4, otherwise memory issues can occur, grinding the obfuscator to a near halt. - Run
obfs-script.sh
i.e.$ source obfs-script.sh
This will result in a new obfuscated folder of .java
files, that can be used to train a new obfuscated code2vec model (or any model that performs learning from source code for that matter).
Usage: Dataset Pipeline
The pipeline uses a trained code2vec model as a feature extractor, converting a classification dataset of .java
files into a numerical form (.arff
by default), that can then be used as input for any standard classifier.
All of the model-related code (common.py
, model.py
, PathContextReader.py
) as well as the JavaExtractor
folder is code from the original code2vec repository. This was used for invoking the trained code2vec models to create method embeddings - using the code2vec model as a feature extractor.
The dataset should be in the form of those supplied with this paper i.e.:
dataset_name
|-- class1
|-- file1.java
|-- file2.java
...
|-- class2
|-- file251.java
|-- file252.java
...
...
To run the dataset pipeline and create class-level embeddings for a dataset of Java files:
cd pipeline
pip install -r requirements.txt
- Download a
.java
dataset (from the datasets supplied or your own) and put in thejava_files/
directory - Download a code2vec model checkpoint and put the checkpoint folder in the
models/
directory - Change the paths and definitions in
model_defs.py
and number of models inscripts/create_datasets.sh
to match your setup - Run
create_datasets.sh
(source scripts/create_datasets.sh
). This will loop through each model and create class-level embeddings for the supplied datasets. The resulting datasets will be in.arff
format in theweka_files/
folder.
You can now perform class-level classification on the dataset using any off-the-shelf WEKA classifier. Note that the dataset contains the original filename as a string attribute for debugging purposes; you'll likely need to remove this attribute before you pass the dataset into a classifier.
Config
By default the pipeline will use the full range of values for each parameter, which creates a huge number of resulting .arff
datasets (>1000). To reduce the number of these, remove (or comment out) some of the items in the arrays in reduction_methods.py
and selection_methods.py
(at the end of the file). Our experiments showed that the SelectAll
selection method and NoReduction
reduction method performed best in most cases so you may want to just keep these.
Trained code2vec Models
The models are all available for download: Zenodo Link.
The .java
datasets used to train each of the models (different versions of java-large
from the code2seq repository), as well as the preprocessed code2vec-ready versions of those datasets are also available: Google Drive Link
Datasets
The .java
datasets collated for this research are all available for download: Zenodo Link.
For the interactive embedding visualisation links below, best results are often seen by UMAP.
Class distributions shown below generated by WEKA
OpenCV/Spring
2 categories, 305 instances
Algorithm Classification
7 categories, 182 instances
Code Author Attribution
13 categories, 1062 instances
Bug Detection
2 categories, 31135 instances*
Duplicate File Detection
2 categories, 1669 instances
Duplicate Function Detection
2 categories, 1277 instances
Malware Classification
Can't share dataset for security reasons, however, you can request it from the original authors: http://amd.arguslab.org/
3 categories, 20927 instances*
Notes
*
- 2000 samples per class were randomly sampled during experiments, so the results in the paper are reported on a smaller dataset. The downloadable dataset is the full version.
Citation
Embedding Java Classes with code2vec: Improvements from Variable Obfuscation
@inproceedings{10.1145/3379597.3387445,
author = {Compton, Rhys and Frank, Eibe and Patros, Panos and Koay, Abigail},
title = {Embedding Java Classes with Code2vec: Improvements from Variable Obfuscation},
year = {2020},
isbn = {9781450375177},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3379597.3387445},
doi = {10.1145/3379597.3387445},
booktitle = {Proceedings of the 17th International Conference on Mining Software Repositories},
pages = {243–253},
numpages = {11},
keywords = {machine learning, code obfuscation, neural networks, code2vec, source code},
location = {Seoul, Republic of Korea},
series = {MSR '20}
}