Awesome

OpenAlpaca: A Fully Open-Source Instruction-Following Model Based On OpenLLaMA

Team: Yixuan Su*, Tian Lan*, and Deng Cai (The first two members* contributed equally.)

This is the repo for the OpenAlpaca project, which aims to build and share an instruction-following model based on OpenLLaMA. We note that, following OpenLLaMA, OpenAlpaca is permissively licensed under the Apache 2.0 license. We also highlight that the training of OpenAlpaca only takes around 30 minutes on 8xA100 GPUs.

This repo contains

The <a href='#weights'>weights</a> for the fine-tuned model.
The <a href='#data'>data</a> used for fine-tuning the model.
The <a href='#example_usage'>example usage</a> of OpenAlpaca.
The <a href='#code'>code</a> for fine-tuning the model.

Usage and License Notices: OpenAlpaca follows the distribution permission of OpenLLaMA, i.e. the Apache 2.0 license, which means OpenAlpaca can be used in any academic or commercial purposes for free.

News:

[2023/05/27] Update the training scripts and release models based on the lasted checkpoints of OpenLLaMA.
[2023/05/04] Open-sourced OpenAlpaca.

Model Weights:

Model Name	Model Card	Maximum Length	Model Description
`openllmplayground/openalpaca_3b_600bt_preview`	[Link]	1536	`The OpenAlpaca model fine-tuned from the previewed version of OpenLLaMA-3B that is trained with 600 billion tokens.`
`openllmplayground/openalpaca_7b_700bt_preview`	[Link]	1024	`The OpenAlpaca model fine-tuned from the previewed version of OpenLLaMA-7B that is trained with 700 billion tokens.`

Data:

The data, i.e. openalpaca.json, we use to fine-tune the model contains ~15k instances and is constructed from the databricks-dolly-15k dataset by removing samples that are too long. Following the original databricks-dolly-15k dataset, our data is also licensed under the CC BY-SA 3.0 license which allows it to be used in any academic and commerical purposes.

Format: Following Stanford Alpaca, our json file is a list of dictionaries, each one contains the following fields.

instruction: it describes the task the model should perform.
input: optional context or input for the task (e.g. the document for summarization task).
output: the answer to the instruction (and the optional input) which is written by human.

We use the following prompts to fine-tune the OpenAlpaca model:

for examples with an empty input field:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:

for examples with a non-empty input field:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:

Reproduce the data: To reproduce the data, simply run python3 process_dataset.py.

Example Usage:

[Note] We would like to note that, unlike LLaMA, OpenAlpaca uses the token id of 1 as the bos (begining of the sequence) token. This follows the definition of OpenLLaMA. Please refer to the authors' orginal implementations for more information.

Below shows an example on how to use OpenAlpaca.

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer

# the previewed version of OpenAlpaca
model_path = r'openllmplayground/openalpaca_3b_600bt_preview'
tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path).cuda()
tokenizer.bos_token_id, tokenizer.eos_token_id = 1,2 # see https://github.com/openlm-research/open_llama#preview-weights-release-and-usage

# same prompt as provided in https://crfm.stanford.edu/2023/03/13/alpaca.html
instruction = r'What is an alpaca? How is it different from a llama?'
'''
instruction = r'Write an e-mail to congratulate new Standford admits and mention that you are excited about meeting all of them in person.'
instruction = r'What is the capital of Tanzania?'
instruction = r'Write a well-thought out abstract for a machine learning paper that proves that 42 is the optimal seed for training neural networks.'
'''

prompt_no_input = f'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:'
tokens = tokenizer.encode(prompt_no_input)

tokens = torch.LongTensor(tokens).unsqueeze(0)
instance = {'input_ids': tokens,
                    'top_k': 50,
                    'top_p': 0.9,
                    'generate_len': 128}
                    
length = len(tokens[0])
with torch.no_grad():
    rest = model.generate(
            input_ids=tokens, 
            max_length=length+instance['generate_len'], 
            use_cache=True, 
            do_sample=True, 
            top_p=instance['top_p'], 
            top_k=instance['top_k']
        )
        
output = rest[0][length:]
string = tokenizer.decode(output, skip_special_tokens=True)
print(f'[!] Generation results: {string}')

[Model Output]

[!] Generation results: Alpacas are a species of South American camelid, the smallest of the three living species native 
to South America (llamas and guanaco are the other two). Alpacas are slightly larger than llamas at 50 to 70 pounds 
(22 to 31 kilograms). Their tails have a tuft of hair at the end, whereas llamas' tails are just knobby. Alpacas have 
brown or black coats.

Fine-tuning the Model:

1. Environment Setup:

The fine-tuning of OpenAlpaca takes on a machine with 8xA100 (40G) GPUs and a CUDA version of 11.7.

To install the required environment, simply run the following command.

pip install -r requirements.txt

If any error occurs when installing torch, you can install torch manually with the command below.

pip install torch==1.13.1+cu117 -f https://download.pytorch.org/whl/torch/

2. Model Training:

In our experiments, we train our model using DeepSpeed with Zero-3 on 8xA100 GPUs. To start the training of 3B model, run the following command.

cd ./scripts/
chmod +x train_openalpaca_3b.sh 
cd ..
./scripts/train_openalpaca_3b.sh

To start the training of 7B model, run the following command.

cd ./scripts/
chmod +x train_openalpaca_7b.sh 
cd ..
./scripts/train_openalpaca_7b.sh

The key arguments of the training script are as follows:

--max_length: The maximum sequence length of training instances.
--data_path: The path of training data.
--save_path: The path to save the fine-tuned OpenAlpaca checkpoint.

The table below shows the hyperparameters of the learning process.

Model Size	Batch Size	Learning Rate	Epoch Number	Maximum length
3B	64	2e-5	3	1536
7B	64	2e-5	3	1024

The batch_size and learning_rate can be adjusted in ./dsconfig/openllama.json. The epoch_number can be adjusted in ./config/openllama.yaml.

After the training completes, you find the tokenizer, configuration, and deepspeed checkpoints in --save_path. Running the following command to convert the deepspeed checkpints to torch models.

python {--save_path}/zero_to_fp32.py {--save_path} {--save_path}/pytorch_model.bin

Then, you can find the torch model pytorch_model.bin in --save_path.

The resulting checkpoint pytorch_model.bin is quite large. If you would like to split it into multiple shards, you can run the command below.

./scripts/make_shards.sh

**** After spliting, the directory of saved checkpoints should look like:

.
└── ./ckpt/openalpaca/             
    ├── config.json
    ├── generation_config.json
    ├── pytorch_model-00001-of-00003.bin
    ├── pytorch_model-00002-of-00003.bin
    ├── pytorch_model-00003-of-00003.bin
    ├── pytorch_model.bin.index.json
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── tokenizer.model

Now the model is good to go! Enjoy playing with OpenAlpaca!

Future Plans:

The current model is fine-tuned on the previewed version of OpenLLaMA. We expect the performance of the base OpenLLaMA model to improve as the training continues. We will update the version of OpenAlpaca so long as newer checkpoint is released by the authors of OpenLLaMA.
We also plan to do a rigorous evaluation of OpenAlpaca and compare it with other publicly accessible models.

Reference:

If you found OpenAlpaca useful in your research or applications, please kindly cite using the following BibTeX:

@misc{openalpaca,
  author = {Yixuan Su and Tian Lan and Deng Cai},
  title = {OpenAlpaca: A Fully Open-Source Instruction-Following Model Based On OpenLLaMA},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/yxuansu/OpenAlpaca}},
}

@software{openlm2023openllama,
  author = {Xinyang Geng and Hao Liu},
  title = {OpenLLaMA: An Open Reproduction of LLaMA},
  month = May,
  year = 2023,
  url = {https://github.com/openlm-research/open_llama}
}

@misc{alpaca,
  author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
  title = {Stanford Alpaca: An Instruction-following LLaMA model},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
}

@article{touvron2023llama,
  title={Llama: Open and efficient foundation language models},
  author={Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie{-}Anne Lachaux and Timoth{\'{e}}e Lacroix and Baptiste Rozi{\`{e}}re and Naman Goyal and Eric Hambro and Faisal Azhar and Aur{\'{e}}lien Rodriguez and Armand Joulin and Edouard Grave and Guillaume Lample},
  journal={arXiv preprint arXiv:2302.13971},
  year={2023}
}

Acknowledgements:

This repo benefits from OpenLLaMA, Alpaca, and Databricks. Thanks for their wonderful works!