

NatGen: Generative Pre-training by "Naturalizing" Source Code.

Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar T Devanbu, Baishakhi Ray. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’22), November 14-18, 2022, Singapore, Singapore. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3540250.3549162.

<br/> <p align="center"> <a href="https://github.com/saikat107/NatGen/issues-raw"> <img src="https://img.shields.io/github/issues-raw/saikat107/NatGen"/> </a> &nbsp; <a href="https://github.com/saikat107/NatGen/issues-closed-raw"> <img src="https://img.shields.io/github/issues-closed-raw/saikat107/NatGen" /> </a> &nbsp; <a href="https://github.com/saikat107/NatGen/issues-pr-raw"> <img src="https://img.shields.io/github/issues-pr-raw/saikat107/NatGen"/> </a> &nbsp; <a href="https://github.com/saikat107/NatGen/issues-pr-closed-raw"> <img src="https://img.shields.io/github/issues-pr-closed-raw/saikat107/NatGen"/> </a> &nbsp; <a href="https://github.com/saikat107/NatGen/network/members"> <img src="https://img.shields.io/github/forks/saikat107/NatGen"/> </a> &nbsp; <a href="https://github.com/saikat107/NatGen/stargazers"> <img src="https://img.shields.io/github/stars/saikat107/NatGen"/> </a> &nbsp; <a href="https://github.com/saikat107/NatGen/LICENSE"> <img src="https://img.shields.io/github/license/saikat107/NatGen"/> </a> &nbsp; <img src="https://img.shields.io/github/languages/count/saikat107/NatGen"/> &nbsp; <img src="https://img.shields.io/github/languages/top/saikat107/NatGen"/> &nbsp; <img src="https://img.shields.io/github/last-commit/saikat107/NatGen"/> </p>

<p align="center">The paperSlide Deck</p>

Getting Started

Environment Requirements


To setup the environment. Please uncomment line 35 and 36 (or run those code in your shell).

bash run setup.sh

Download and preprocess the training data

cd scripts/pretraining;
bash process_data.sh

Data processing takes several parameters. These parameters are passed through a configuration json file. The configuration file should be in configs/pretraining/data_config directory.

Pretrain the model

cd scripts/pretraining;
bash train.sh <EXPERIMENT_NAME> <GPUS>

Adjust the per_device_train_batch_size and gradient_accumulation_steps and number of GPUS using to get the final effective batch size in the training arguments json file. per_device_train_batch_size * gradient_accumulation_steps * number of gpus. We use distributed training to pre-train.

We reused source code from various open source code repositories

  1. CodeT5
  2. Microsoft CodeXGLUE Out sincere thanks to the authors of these repositories for open-sourcing their work.


If you use this repository, please cite,

    author = {Chakraborty, Saikat and Ahmed, Toufique and Ding, Yangruibo and Devanbu, Premkumar T. and Ray, Baishakhi},
    title = {NatGen: Generative Pre-Training by “Naturalizing” Source Code},
    year = {2022},
    isbn = {9781450394130},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3540250.3549162},
    doi = {10.1145/3540250.3549162},
    booktitle = {Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
    pages = {18–30},
    numpages = {13},
    keywords = {Neural Network, Semantic Preserving Transformation, Source Code Transformer, Source Code Pre-training},
    location = {Singapore, Singapore},
    series = {ESEC/FSE 2022}