Home

Awesome

PPOCoder

Official Implementation of Execution-based Code Generation using Deep Reinforcement Learning

Overview

The utilization of programming language (PL) models, pretrained on large-scale code corpora, as a means of automating software engineering processes has demonstrated considerable potential in streamlining various code generation tasks such as code completion, code translation, and program synthesis. However, current approaches mainly rely on supervised fine-tuning objectives borrowed from text generation, neglecting specific sequence-level features of code, including but not limited to compilability as well as syntactic and functional correctness. To address this limitation, we propose PPOCoder, a new framework for code generation that combines pretrained PL models with Proximal Policy Optimization (PPO) deep reinforcement learning and employs execution feedback as the external source of knowledge into the model optimization. PPOCoder is transferable across different code generation tasks and PLs.

<!-- <p align="center"> <img src="https://github.com/reddy-lab-code-research/PPOCoder/blob/main/images/PPOCoder_v4.gif" width="100%" /> <br> <b>Overview of the PPOCoder with actor and critic models</b>: The action is sampled from the policy based on the given source data $x$ (NL or PL). Then, a reward is obtained for each action to guide and control policy updates. The reward function is composed of four elements: (a) compiler feedback; (b) syntactic matching score based on ASTs; (c) semantic matching score based on DFGs; and (d) KL-divergence penalty between active policy and the reference pretrained model. The critic model estimates value based on the obtained reward and PPOCoder will be optimized with PPO, which takes into account both value and policy optimization. </p> --> <p align="center"> <img src="https://github.com/reddy-lab-code-research/PPOCoder/blob/main/images/ppocoder_overview.jpg" width="80%" /> <br> <b>Overview of the PPOCoder with actor and critic models</b>: The action is sampled from the policy based on the given source data $x$ (NL or PL). Then, a reward is obtained for each action to guide and control policy updates. The reward function is composed of four elements: (a) compiler feedback; (b) syntactic matching score based on ASTs; (c) semantic matching score based on DFGs; and (d) KL-divergence penalty between active policy and the reference pretrained model. The critic model estimates value based on the obtained reward and PPOCoder will be optimized with PPO, which takes into account both value and policy optimization. </p>

Environment Installation

To run the code, install the dependencies in requirements.txt.

pip install -r requirements.txt

Datasets

We finetune/evaluate models on the following major dataset benchmarks for different code generation tasks:

We preprocess the data and construct input/output sequences in the same manner as outlined in the original benchmark papers. Unzip and place all benchmarks in the data folder.

Run

We have created run.sh script to execute PPO-based PL model fine-tuning based on the compiler signal. To run the script for different code generation tasks, configure the following parameters:

ParametersDescriptionExample Values
l1Source Languagejava
l2Target Languagecpp
aspAction Space Size5
nsNumber of Synthetic Samples10
data_pathPath to the original data samplesdata/xlcost/java-cpp/
output_pathPath to save generations and outputssaved_results/java-cpp/
baseline_output_dirPath to the base finetuned CodeT5 (before RL) outputsbaselines/saved_models/java-cpp/
load_model_pathPath to the base finetuned CodeT5 model (before RL) for each downstream taskbaselines/saved_models/java-cpp/pytorch_model.bin
max_source_lengthMaxmim Source Length400
max_target_lengthMaxmim Target Length400
train_batch_sizeTraining Batch Size32
test_batch_sizeTesting Batch Size48
lrLearning Rate1e-6
kl_coefInitial coefficient of the KL divergence penalty in the reward0.1
kl_targetTarget of the KL which adaptively controls the KL coefficient1
vf_coefCoefficient of the vf error in the ppo loss1e-3
runIndex of the run1

Running run.sh saves generated programs in a .txt file and the model weights at the end of each epoch.

<!-- ```bash cd PPOCoder python rl_run.py --run 1 \ #int: run ID --l1 java \ #str: source language --l2 cpp \ #str: target language --asp 5 \ #int: action space size --ns 10 \ #int: number of synthetic samples --data_path DATA-PATH \ #str: directory of the dataset --output_path OUTPUT-PATH \ #str: directory of the output --load_model_path LOAD-MODEL-PATH\ #str: path of the base model (before RL) --baseline_out_dir BASELINE-PATH \ #str: path of the baseline experiments --max_source_length 400 \ #int: maximum length in the source language --max_target_length 400 \ #int: maximum length in the target language --train_batch_size 32 \ #int: batch size in the training --test_batch_size 48 \ #int: batch size in the testing --lr 1e-6 \ #float: starting learning rate (before scheduler) --kl_coef 0.1 \ #float: initial coefficient of the KL divergence penalty in the reward --kl_target 1 \ #float: target of the KL which adaptively controls the KL coefficient --vf_coef 1e-3 #float: coefficient of the vf error in the ppo loss ``` You can apply this code on different tasks by modifying differnet parameters. -->

Citation

If you find the paper or the repo useful, please cite it with

<pre> @article{shojaee2023ppocoder, title={Execution-based code generation using deep reinforcement learning}, author={Shojaee, Parshin and Jain, Aneesh and Tipirneni, Sindhu and Reddy, Chandan K}, journal={arXiv preprint arXiv:2301.13816}, year={2023} } </pre>