Awesome
From Commit Message Generation to History-Aware Commit Message Completion
<p align="center"> | <a href="https://commit-chronicle.github.io/">:globe_with_meridians: Website</a> | <a href="https://arxiv.org/abs/2308.07655">:scroll: Preprint</a> | <a href="https://huggingface.co/datasets/JetBrains-Research/commit-chronicle"> :hugs: Dataset</a> | <a href="https://huggingface.co/JetBrains-Research/cmg-codet5-without-history#available-checkpoints">:hugs: Models</a> | </p>This repository provides a replication package for our paper :scroll: From Commit Message Generation to History-Aware Commit Message Completion, ASE 2023.
- Code
- Models experiments – this repository
- The most actual version –
main
branch - The exact replication package for CMG experiments for our ASE 2023 paper –
appendix_cmg
tag - The exact replication package for LLM experiments for our ASE 2023 paper –
appendix_llm
tag
- The most actual version –
- Data collection and processing – separate repo
- Models experiments – this repository
- Dataset: also available on Zenodo
- Models checkpoints: also available on Zenodo
- Other
- Check
appendix
folder for:- models predictions;
- comprehensive metrics for all our experiments;
- implementations of frequent filters from CMG research;
- and other details mentioned in the paper!
- Check
How to use
Requirements
- :snake: Python
- :floppy_disk: Dependencies
- This project provides dependencies for two Python dependency managers:
- Poetry:
poetry.lock
,pyproject.toml
(preferred) - pip:
requirements.txt
(obtained throughpoetry export --with dev,retrieval --output requirements.txt
)
- Poetry:
- This project provides dependencies for two Python dependency managers:
Usage
Step 1: Prepare raw data
<details> <summary>:yellow_heart: click here for more information on required data format</summary>:star2: Useful links: our dataset and/or the repo we used for data preparation.
This project expects each dataset part to be stored in a separate JSONLines files:
├── ... # data directory
│ ├── train.jsonl
│ ├── val.jsonl
│ └── test.jsonl
└── ...
In our case, each input example is commit. Also note that commits from each author should be in chronological order. Specifically, the following keys are expected in each row:
author
: Unique identifier for the author of commit.message
: Commit message.mods
: A list of modification made in a commit. Each modification should contain the following keys:change_type
: Type of modification (string, one ofMODIFY
,ADD
,DELETE
,RENAME
,COPY
,UNKNOWN
).old_path
: Path to file before the commit (None
whenchange_type
isADD
).new_path
: Path to file after the commit (None
whenchange_type
isDELETE
).diff
: Output of thegit diff
command for this specific file.
Step 2: Choose configuration
Model architecture
This project supports the following models:
For details refer to classes provided in src/model/configurations
or base configs provided in conf/model/base_configs.py
.
You can find specific configs for the following models in conf/model/configs.py
:
- distilGPT-2
- randomly initialized Transformer
- CodeT5 from :scroll: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
- CodeReviewer from :scroll: Automating Code Review Activities by Large-Scale Pre-training
- RACE + T5 from :scroll: RACE: Retrieval-Augmented Commit Message Generation
Input type
This project explores two kinds of input for a commit message completion task: diff and commit message history.
-
For decoder, there is only one supported option: concatenate commit message history with a current commit message and pass to context.
-
For seq2seq models, there are three supported options:
- Diff-only: pass diff to encoder, pass a current message to decoder.
- History-only: pass history to encoder, pass a current message to decoder.
- Diff + history: pass diff to encoder, pass commit message history concatenated with a current message to decoder.
Step 3: Train
- Define configuration for training at
conf/train_config.py
. - Choose one of available model configs or add your own.
- Note that you have to define missing parameters from
InputConfig
. You can do it via CLI or just rewrite them. Below is the example how to define parameters via CLI.
To launch training of model defined as XXXModelConfig
and registered via ConfigStore.store(name="XXX", group="model", node=XXXModelConfig)
, run the following command (with actual values instead of X's):
python train.py +model=XXX ++input.train_with_history=X ++input.encoder_input_type=X
Additional steps for RACE model
Experiments with RACE model require a slightly different procedure.
-
Fine-tune CodeT5 model. Refer to the instruction above for details.
-
Use encoder from fine-tuned CodeT5 checkpoint to perform retrieval.
Define configuration in
conf/retrieval_config.py
. You have to either provide a local path to checkpoint inckpt_path
or use W&B artifact. In the latter case, artifact name will be inferred from model configuration.An example with a local path:
python retrieve.py ++ckpt_path=<local_path>
An example with a W&B artifact:
python retrieve.py +model=codet5 ++input.train_with_history=X ++input.encoder_input_type=X
-
Initialize RACE with fine-tuned CodeT5 weights and use retrieved examples to train the model.
For checkpoint, you have to either provide a path to checkpoint in :hugs: Transformers format as
name_or_path
inRACEConfig
or definelogger.checkpoint
in train config correctly to download it from W&B Artifacts.For retrieved examples, you have to either provide them locally or define
logger.retrieval
in train config correctly to download it from W&B Artifacts.To provide retrieved examples locally, place them inside root dataset directory in a folder
retrieval_with_history
orretrieval_without_history
(depending whether the encoder used for retrieval was trained with history or not).├── ... # data directory │ ├── retrieval_with_history │ │ ├── train_predictions.jsonl │ │ ├── val_predictions.jsonl │ │ ├── test_predictions.jsonl │ ├── retrieval_without_history │ │ ├── train_predictions.jsonl │ │ ├── val_predictions.jsonl │ │ ├── test_predictions.jsonl │ ├── train.jsonl │ ├── val.jsonl │ └── test.jsonl └── ...
Step 4: Evaluate
Step 4.1: Generating predictions
-
Define configuration for evaluation at
conf/eval_config.py
. -
Note that you have to either provide local path to checkpoint in
ckpt_path
or use W&B artifact.In the latter case, artifact name will be inferred from model configuration. Choose one of available model configs or add your own.
-
Note that you have to define all parameters from
InputConfig
. You can do it via CLI or just rewrite them. Below is the example how to define parameters via CLI.
To launch evaluation of a model defined as XXXModelConfig
and registered via ConfigStore.store(name="XXX", group="model", node=XXXModelConfig)
, run the following command:
python eval.py +model=XXX ++input.train_with_history=X ++input.encoder_input_type=X ++input.generate_with_history=X ++input.context_ratio=X
Step 4.2: Compute metrics
-
Define configuration for metrics computation at
conf/metrics_config.py
. -
Note that you have to either provide local path to model predictions in
preds_path
or use W&B artifact and define the following parameters fromArtifactMetricsConfig
:name
,version
. You can do it via CLI or just rewrite them. Below are the examples how to define parameters via CLI.
To launch metrics computation for local predictions:
python compute_metrics.py ++preds_path=XXX
To launch metrics computation for W&B artifact with predictions:
python compute_metrics.py ++logger.artifact_config.name=XXX ++logger.artifact_config.version=XXX