Home

Awesome

Fine-tuning SantaCoder for Code/Text Generation💻

Fine-tune SantaCoder on Code and Text Generation datasets. For example on new programming languages from The Stack dataset, or on a code-to-text dataset like GitHub-Jupyter. SantaCoder is a 1B parameters model pre-trained on Python, Java & JavaScript, we suggest fine-tuning on programming languages close to them, otherwise, the model might not converge well.

Setup & Fine-Tuning with The Stack

We provide code to fine-tune the pre-trained SantaCoder model on code/text datasets such as The Stack dataset. Check this repository for fine-tuning models on other code tasks such as code classification.

  1. To begin with, we should clone the repository locally, install all the required packages and log into HuggingFace Hub and Weight & Biases.

First, you can clone this repo with:

git clone https://github.com/bigcode/santacoder-finetuning.git
cd santacoder-finetuning

Second, install the required packages. The packages are listed in the requirements.txt file and can be installed with

pip install -r requirements.txt

Third, make sure you are logged to HuggingFace Hub and Weights & Biases

huggingface-cli login
wandb login
  1. Next, take a look at the train.py script to get an understanding of how it works. In short, the script does the following:

    • Load the given dataset
    • Load the model with given hyperparameters
    • Pre-process the dataset to input into the model
    • Run training
    • Run evaluation
  2. The following examples show how you can launch fine-tuning for The Stack dataset. Here we will run the script on the Ruby subset of the dataset for demonstration purposes. Note that:

python train.py \
        --model_path="bigcode/santacoder" \
        --dataset_name="bigcode/the-stack-dedup" \
        --subset="data/shell" \
        --data_column "content" \
        --split="train" \
        --seq_length 2048 \
        --max_steps 30000 \
        --batch_size 2 \
        --gradient_accumulation_steps 8 \
        --learning_rate 5e-5 \
        --num_warmup_steps 500 \
        --eval_freq 3000 \
        --save_freq 3000 \
        --log_freq 1 \
        --num_workers="$(nproc)" \
	--no_fp16

To launch the training on multiple GPUs use the following command (we just add python -m torch.distributed.launch --nproc_per_node number_of_gpus):

python -m torch.distributed.launch \
        --nproc_per_node number_of_gpus train.py \
        --model_path="bigcode/santacoder" \
        --dataset_name="bigcode/the-stack-dedup" \
        --subset="data/shell" \
        --data_column "content" \
        --split="train" \
        --seq_length 2048 \
        --max_steps 30000 \
        --batch_size 2 \
        --gradient_accumulation_steps 8 \
        --learning_rate 5e-5 \
        --num_warmup_steps 500 \
        --eval_freq 3000 \
        --save_freq 3000 \
        --log_freq 1 \
        --num_workers="$(nproc)" \
	--no_fp16

Note: The checkpoints saved from this training command will have argument use_cache in the file config.json as False, for fast inference you should change it to True like in this commit or add it each time you're loading the model.

If you want to fine-tune on other text datasets, you just need to change data_column argument to the name of the column containing the code/text you want to fine-tune on.

For example, We fine-tuned the model on the GitHub-Jupyter dataset on 4 A100 using the following command:

python -m torch.distributed.launch \
        --nproc_per_node 4 train.py \
        --model_path="bigcode/santacoder" \
        --dataset_name="codeparrot/github-jupyter-code-to-text" \
        --data_column "content" \
        --split="train" \
        --seq_length 2048 \
        --max_steps 1000 \
        --batch_size 2 \
        --gradient_accumulation_steps 4 \
        --learning_rate 5e-5 \
        --num_warmup_steps 100 \
        --eval_freq 100 \
        --save_freq 100 \
        --log_freq 1 \
        --num_workers="$(nproc)" \
        --no_fp16

The resulting model can be found here with an associated space.

Can I use another Model: Yes! you can use other CLM models on the hub such as GPT2, CodeParrot, CodeGen, InCoder... Just make sure to change the seq_length and eos_token_id arguments.

How to upload my trained checkpoint

To upload your trained checkpoint, you have to create a new model repository on the 🤗 model hub, from this page: https://huggingface.co/new

You can also follow the more in-depth instructions here if needed.

Having created your model repository on the hub, you should clone it locally:

git lfs install

git clone https://huggingface.co/username/your-model-name

Then and add the following files that fully define a SantaCoder checkpoint into the repository. You should have added the following files.

Note: As previously stated, the checkpoints saved from this training with gradient checkpointing and no caching command will have argument use_cache in the file config.json as False, for fast inference you should change it to True like in this commit.

You can get the tokenizer files by cloning the model repo and copying them to your directory. Santacoder currently has a custom modeling file + config file on the hub, but they will be included with the saved checkpoints if you used the transformers branch in requirements.txt.

Having added the above files, you should run the following to push files to your model repository.

git add . && git commit -m "Add model files" && git push

The next important step is to create the model card. For people to use your fine-tuned model it is important to understand:

All these questions should be answered in a model card which is the first thing people see when visiting your model on the hub under https://huggingface.co/{your_username}/{your_modelname}.

Don't hesitate to also create a Gradio Demo for your model to showcase its capabilities 🚀. You can find more information on how to do that here.

Acknowledgments

This is inspired by the Wave2vec fine-tuning week by Hugging Face.