

From Commit Message Generation to History-Aware Commit Message Completion


This repository provides a replication package for our paper :scroll: From Commit Message Generation to History-Aware Commit Message Completion, ASE 2023.

How to use



Step 1: Prepare raw data

:star2: Useful links: our dataset and/or the repo we used for data preparation.

<details> <summary>:yellow_heart: click here for more information on required data format</summary>

This project expects each dataset part to be stored in a separate JSONLines files:

 ├── ...  # data directory
 │   ├── train.jsonl
 │   ├── val.jsonl
 │   └── test.jsonl
 └── ...

In our case, each input example is commit. Also note that commits from each author should be in chronological order. Specifically, the following keys are expected in each row:


Step 2: Choose configuration

Model architecture

This project supports the following models:

For details refer to classes provided in src/model/configurations or base configs provided in conf/model/base_configs.py.

You can find specific configs for the following models in conf/model/configs.py:

Input type

This project explores two kinds of input for a commit message completion task: diff and commit message history.

Step 3: Train

  1. Define configuration for training at conf/train_config.py.
  2. Choose one of available model configs or add your own.
  3. Note that you have to define missing parameters from InputConfig. You can do it via CLI or just rewrite them. Below is the example how to define parameters via CLI.

To launch training of model defined as XXXModelConfig and registered via ConfigStore.store(name="XXX", group="model", node=XXXModelConfig), run the following command (with actual values instead of X's):

python train.py +model=XXX ++input.train_with_history=X ++input.encoder_input_type=X
Additional steps for RACE model

Experiments with RACE model require a slightly different procedure.

  1. Fine-tune CodeT5 model. Refer to the instruction above for details.

  2. Use encoder from fine-tuned CodeT5 checkpoint to perform retrieval.

    Define configuration in conf/retrieval_config.py. You have to either provide a local path to checkpoint in ckpt_path or use W&B artifact. In the latter case, artifact name will be inferred from model configuration.

    An example with a local path:

    python retrieve.py ++ckpt_path=<local_path>

    An example with a W&B artifact:

    python retrieve.py +model=codet5 ++input.train_with_history=X ++input.encoder_input_type=X
  3. Initialize RACE with fine-tuned CodeT5 weights and use retrieved examples to train the model.

    For checkpoint, you have to either provide a path to checkpoint in :hugs: Transformers format as name_or_path in RACEConfig or define logger.checkpoint in train config correctly to download it from W&B Artifacts.

    For retrieved examples, you have to either provide them locally or define logger.retrieval in train config correctly to download it from W&B Artifacts.

    To provide retrieved examples locally, place them inside root dataset directory in a folder retrieval_with_history or retrieval_without_history (depending whether the encoder used for retrieval was trained with history or not).

     ├── ...  # data directory
     │   ├── retrieval_with_history
     │   │    ├── train_predictions.jsonl
     │   │    ├── val_predictions.jsonl
     │   │    ├── test_predictions.jsonl
     │   ├── retrieval_without_history
     │   │    ├── train_predictions.jsonl
     │   │    ├── val_predictions.jsonl
     │   │    ├── test_predictions.jsonl
     │   ├── train.jsonl
     │   ├── val.jsonl
     │   └── test.jsonl
     └── ...

Step 4: Evaluate

Step 4.1: Generating predictions
  1. Define configuration for evaluation at conf/eval_config.py.

  2. Note that you have to either provide local path to checkpoint in ckpt_path or use W&B artifact.

    In the latter case, artifact name will be inferred from model configuration. Choose one of available model configs or add your own.

  3. Note that you have to define all parameters from InputConfig. You can do it via CLI or just rewrite them. Below is the example how to define parameters via CLI.

To launch evaluation of a model defined as XXXModelConfig and registered via ConfigStore.store(name="XXX", group="model", node=XXXModelConfig), run the following command:

python eval.py +model=XXX ++input.train_with_history=X ++input.encoder_input_type=X ++input.generate_with_history=X ++input.context_ratio=X
Step 4.2: Compute metrics
  1. Define configuration for metrics computation at conf/metrics_config.py.

  2. Note that you have to either provide local path to model predictions in preds_path or use W&B artifact and define the following parameters from ArtifactMetricsConfig: name, version. You can do it via CLI or just rewrite them. Below are the examples how to define parameters via CLI.

To launch metrics computation for local predictions:

python compute_metrics.py ++preds_path=XXX

To launch metrics computation for W&B artifact with predictions:

python compute_metrics.py ++logger.artifact_config.name=XXX ++logger.artifact_config.version=XXX