Home

Awesome

<!-- README -->

This is the official documentation for the JEMMA project

JEMMA is an Extensible Java dataset for Many ML4Code Applications. It is primarily a dataset of Java code entities at multiple granularities, their properties, and representations. To help users interact and work with the data seamlessly, we have added Workbench capabilities to it as well.

This repository hosts the Workbench part of JEMMA, while the raw data is hosted on Zenodo which can be downloaded at any moment while using the Workbench. The following sections provide more details.


Contents


<a id="setup-instructions"></a>

Setup Instructions

<!-- > Getting started with jemma -->

First steps: Install jemma locally

1. $ git clone https://github.com/giganticode/jemma.git 
2. $ cd jemma/ 
3. $ pip install -r requirements.txt
4. $ pip install -e .

Next steps: Downloading all the datasets <br> Sign-up to Zenodo.org and generate an API num_token [IMPORTANT!]

5. $ cd jemma/download/ 
6. $ nano config.ini (& replace the dummy `access_token` with your API key)
7. $ python3 download.py 
8. $ python3 sanity_checks.py

Getting to know JEMMA Datasets

JEMMA Metadata

Link to metadatacolumns
projectsproject_id
project_path
project_name
packagesproject_id
package_id
package_path
package_name
classesproject_id
package_id
class_id
class_path
class_name
methodsproject_id
package_id
class_id
method_id
method_name
start_line
end_line

JEMMA Representations

Representation CodeRepresentation NameLink to dataset
TEXTraw_source_codehttps://doi.org/10.5281/zenodo.5813705
TKNAcode_tokens (spaced)https://doi.org/10.5281/zenodo.5813717
TKNBcode_tokens (comma)https://doi.org/10.5281/zenodo.5813730
C2VCcode2vec*https://doi.org/10.5281/zenodo.5813993
C2SQcode2seq*https://doi.org/10.5281/zenodo.5814059
FTGRfeature_graph*https://doi.org/10.5281/zenodo.5813933

JEMMA Properties

Property Code $~~~~~~~~~$Property Name $~~~~~~~~~$Link to dataset
RSLKresource_leakhttps://doi.org/10.5281/zenodo.1096082
NLDFnull_dereferencehttps://doi.org/10.5281/zenodo.1096080
NMLCnum_local_callshttps://doi.org/10.5281/zenodo.7020084
NMNCnum_non_local_callshttps://doi.org/10.5281/zenodo.7019960
NUCCnum_unique_calleeshttps://doi.org/10.5281/zenodo.7019176
NUPCnum_unique_callershttps://doi.org/10.5281/zenodo.7019128
CMPXcyclomatic_complexityhttps://doi.org/10.5281/zenodo.5813084
MXINmax_indenthttps://doi.org/10.5281/zenodo.5813081
NAMEmethod_namehttps://doi.org/10.5281/zenodo.5813308
NMLTnum_literalshttps://doi.org/10.5281/zenodo.5813054
NMOPnum_operatorshttps://doi.org/10.5281/zenodo.5813055
NMPRnum_parametershttps://doi.org/10.5281/zenodo.5813053
NMRTnum_returnshttps://doi.org/10.5281/zenodo.5813034
NMTKnum_tokenshttps://doi.org/10.5281/zenodo.5813032
NTIDnum_identifiershttps://doi.org/10.5281/zenodo.5813029
NUIDnum_unique_identifiershttps://doi.org/10.5281/zenodo.5813028
SLOCsource_lines_of_codehttps://doi.org/10.5281/zenodo.5813094
TLOCtotal_lines_of_codehttps://doi.org/10.5281/zenodo.5813102
<!-- \textit{Properties:} \texttt{[TLOC]} & \url{} & 335.5 MB\Tstrut{}\\ \textit{Properties:} \texttt{[SLOC]} & \url{} & 335.0 MB \\ \textit{Properties:} \texttt{[NUID]} & \url{} & 335.6 MB \\ \textit{Properties:} \texttt{[NTID]} & \url{} & 336.7 MB \\ \textit{Properties:} \texttt{[NMTK]} & \url{} & 342.5 MB \\ \textit{Properties:} \texttt{[NMRT]} & \url{} & 333.3 MB \\ \textit{Properties:} \texttt{[NMPR]} & \url{} & 333.3 MB \\ \textit{Properties:} \texttt{[NMOP]} & \url{} & 334.5 MB \\ \textit{Properties:} \texttt{[NMLT]} & \url{} & 333.4 MB \\ \textit{Properties:} \texttt{[NAME]} & \url{} & 432.0 MB \\ \textit{Properties:} \texttt{[MXIN]} & \url{} & 267.0 MB \\ \textit{Properties:} \texttt{[CMPX]} & \url{} & 267.1 MB\Bstrut{}\\ \textit{Properties:} \texttt{[NUPC]} & \url{} & 333.3 MB \\ \textit{Properties:} \texttt{[NUCC]} & \url{} & 333.6 MB \\ \textit{Properties:} \texttt{[NMNC]} & \url{} & 334.0 MB \\ \textit{Properties:} \texttt{[NMLC]} & \url{} & 333.2 MB \\ % \textit{Properties:} \texttt{[NMTC]} & \url{https://doi.org/10.5281/zenodo.7019246} & 334.0 MB\Bstrut{}\\ \textit{Properties:} \texttt{[NLDF]} & \url{} & 333.6 MB \\ \textit{Properties:} \texttt{[RSLK]} & \url{} & 334.0 MB\Bstrut{}\\ -->

JEMMA Callgraphs

Link to callgraphs datacolumns
Callgraphscaller_project_id
caller_class_id
caller_method_id
call_direction
callee_project_id
callee_class_id
callee_method_id
<!-- | | *call_type* | -->

Working with JEMMA Workbench

List of API calls


projects














classes










methods







basic utils








task utils









Use Case Tutorials

[COMING SOON]


Submitting Pull Requests

In order to contribute new data to the JEMMA Datasets, users must fork this repository and clone it locally. Once JEMMA is cloned locally, users can run the processing scripts on local projects, which will generate a set of csv files: metadata, representations, properties, call-graphs---which is the new data.

The freshly generated csvs are to be included in the next commit. It is advised that users review the data before committing. Users can then push the changes to their fork of the JEMMA repository, and submit a new pull request for the data files which were generated.

Once a pull request (data contribution) is submitted, the generated data will be validated for errors and inconsistencies, and then integrated into our original dataset if approved. The new dataset will be subsequently updated on zenodo, which lets us host multiple versions.

Here's the step-by-step procedure for submitting a pull request to JEMMA:

  1. Fork the JEMMA repository
  2. Clone the JEMMA repository to your local workspace
  3. Create a new branch
  4. Make your changes (run the JEMMA processing scripts)
  5. Commit the changes (commit new files generated)
  6. Push the changes to your JEMMA fork
  7. Create a pull request on Github

When you encounter issues

This is the alpha release of JEMMA. We have tested it with several use cases. However, there might still be bugs in the implementation that we hope to iron out in the next few months.

If you encounter any of these bugs, please open a respective GitHub Issue!