Home

Awesome

Jam-CGPT: Distilled GPT for Source Code Summarization

Code for Distilled GPT for Source Code Summarization

Proposed by:

This repository contains all the code and detailed instructions to rebuild Jam-CGPT models in our HuggingFace Automatic Program Comprehension Lab hub.

Quick link

To-do list

To set up your local environment, run the following command. We recommend the use of a virtual environment for running the experiments.

pip install -r requirements.txt

Finetuning

These steps will show you how to fine-tune Jam-CGPT in our paper.

Step 1: Download the finetuning dataset

You can download all of the datasets in our paper in our Hugginface repo. Please put train.bin and val.bin to the same dir as --dataset in config/finetune_model_350m_dataset_170k.py.

Step 2: Download the models for finetuning

Please download the checkpoint files named ckpt_pretrain.pt in our Hugginface repo for finetuning and put the checkpoint to the same dir as --out_dir in config/finetune_model_350m_dataset_170k.py

Step 3: Finetuning model

CUDA_DEVICE_ORDER='PCI_BUS_ID' CUDA_VISIBLE_DEVICES='0' OMP_NUM_THREADS=2 time torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:4000 --nnodes=1 --nproc_per_node=1 train.py config/finetune_model_350m_dataset_170k.py 

Inference

After you download the test set named jam_cgpt_test.tar.gz in our Hugginface repo, you can simiply run command below for inference.

CUDA_DEVICE_ORDER='PCI_BUS_ID' CUDA_VISIBLE_DEVICES='0' OMP_NUM_THREADS=2 time torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:4000 --nnodes=1 --nproc_per_node=1 sample_jam_cgpt.py config/finetune_model_350m_dataset_170k.py  --prediction_filename=predict_data170k_model350m.txt
--outdir: directory of the model that you want to use for inference
--prediction_filename: prediction file name 
--outfilename: checkpoint file name

Note that you need to download checkpoint files our Hugginface repo and put the checkpoint files to the same dir as --out_dir in config/finetune_model_350m_dataset_170k.py if you just want to make an inference with our models.

Metrics

We provide scripts for calculating the metrics that we report on the paper. The following commands are for METEOR, and USE score respectively.

python3 meteor.py jam_cgpt_predictions/predict_data170k_model350m.txt --coms-filename=cgptcom.test --data=./data/jam_cgpt_170k
python3 use_score_v.py jam_cgpt_predictions/predict_170k_100mparameters.txt --gpu=0 --coms-filename=cgptcom.test --data=./data/jam_cgpt_170k

Dataset

We also release all of our raw datasets for the experiments in our Hugginface repo and the scripts for compiling the raw data to bin files in this Github repo. Before running the command, please create three dir: pkls, bins, and tmp. Then, you can simply run the following command to generate train.bin and val.bin.

python3 data/jam_cgpt_170k/prepare_fc_raw.py

Pretraining

We also release the config file for pretaining the jam-cgpt 38m model and 110m model --train_jam_cgpt_raw_38m.py and train_jam_cgpt_raw_110m.py. You can find the script for pretraining the 350m model and instructions for pretraining in Jam repo. Data for pretraining is in our Hugginface jam repo. Please cite the use of the dataset as follows:

@inproceedings{su2023language,
      title={A Language Model of Java Methods with Train/Test Deduplication}, 
      author={Chia-Yi Su and Aakash Bansal and Vijayanta Jain and Sepideh Ghanavati and Collin McMillan},
      month={December},
      year={2023},
      booktitle={Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
      location = {San Francisco, CA, USA},
      series = {ESEC/FSE 2023}
}

Citation

This work was accepted to Automated Software Engineering, an academic journal. If you use this work in an academic paper, please cite the following:

@misc{su2024distilled,
      title={Distilled GPT for Source Code Summarization}, 
      author={Chia-Yi Su and Collin McMillan},
      year={2024},
      journal={Automated Software Engineering}
}

Preprint PDF available here: https://arxiv.org/abs/2308.14731