Awesome
<h1 align="center"> <p>🐫 MBZUAI Bactrian-X</p></h1>
<h3 align="center">
<p>A Multilingual Replicable Instruction-Following Model</p>
</h3>
<p align="center"> <a href="https://haonan-li.github.io/" target="_blank">Haonan Li</a>*, <a href="http://www.fajrikoto.com" target="_blank">Fajri Koto</a>*, <a href="https://twitter.com/WuMinghao_nlp" target="_blank">Minghao Wu</a>, <a href="https://afaji.github.io/" target="_blank">Alham Fikri Aji</a>, <a href="https://people.eng.unimelb.edu.au/tbaldwin/" target="_blank">Timothy Baldwin</a> (*equal contribution) </p>
:fire: News
<!---
-->
Overview
<h3 align="center">
<img src="https://github.com/fajri91/eval_picts/blob/master/BactrianX_full.jpg" width="1000" align="center">
</h3>
Bactrian-X dataset contains 3.4M pairs of instructions and responses in 52 languages.
The instructions were obtained from alpaca-52k, and dolly-15k, and tranlated into 52 languages (52 languages x 67k instances = 3.4M instances).
The responses in 52 languages were generated from gpt-3.5-turbo
model.
Bactrian-X models are a series of LLM models fine-tuned (using low-rank adaptation/LoRA) on Bactrian-X dataset.
<!--
Specifically, this repository contains:
- The [67K instruction data](#data-and-model-release) in 52 languages.
- Multilingual [Bactrian-X](#data-and-model-release), trained on combined language-instruction pairs (3.4M instances).
- 52 monolingual Bactrian models, trained on each of the 52 languages (67k instances).
- The code for [training the model](#model-training-and-inference) using [low-rank adaptation (LoRA)](https://arxiv.org/pdf/2106.09685.pdf).
-->
Usage and License Notices: Bactrian-X is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
Dataset
We curate our Bactrian instruction dataset with the following steps:
- Collecting English instructions: The English instructions are obtained from alpaca-52k and dolly-15k, and they are saved to instructions.json.
- Translating the English instructions into foreign languages: The instructions (and the corresponding inputs, if any) are translated into 51 languages using the Google Translate API (conducted in April 2023).
- Generating the responses: We generate output from
gpt-3.5-turbo
for the instructions in each language (conducted in April 2023).
<!--
we noticed that the performance of the languages that not covered by the original LLM pretraining may not satifactory.
So we recommand users to choose the model by considering whether the languages were covered.
The datasets, Bactrian-ISO code, and the LLM models langauge coverage were listed below.
| No | Languages | Code and Data | LLaMA | Bloom | mT5 |
| ---|---------------- | ------------------------------------------------------------------------ | ------ | --------- | -------- |
| 1 | Afrikaans | [af_ZA](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/af/train) | | | ✓ |
| 2 | Arabic | [ar_AR](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ar/train) | | ✓ | ✓ |
| 3 | Azerbaijani | [az_AZ](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/az/train) | | | ✓ |
| 4 | Bengali | [bn_IN](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/bn/train) | | | ✓ |
| 5 | Czech | [cs_CZ](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/cs/train) | ✓ | | ✓ |
| 6 | German | [de_DE](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/de/train) | ✓ | | ✓ |
| 7 | English | [en_XX](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/en/train) | ✓ | ✓ | ✓ |
| 8 | Spanish | [es_XX](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/es/train) | ✓ | ✓ | ✓ |
| 9 | Estonian | [et_EE](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/et/train) | | | ✓ |
| 10 | Persian | [fa_IR](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/fa/train) | | | ✓ |
| 11 | Finnish | [fi_FI](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/fi/train) | | | ✓ |
| 12 | French | [fr_XX](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/fr/train) | ✓ | ✓ | ✓ |
| 13 | Galician | [gl_ES](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/gl/train) | | | ✓ |
| 14 | Gujarati | [gu_IN](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/gu/train) | | ✓ | ✓ |
| 15 | Hebrew | [he_IL](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/he/train) | | | |
| 16 | Hindi | [hi_IN](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/hi/train) | | ✓ | ✓ |
| 17 | Croatian | [hr_HR](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/hr/train) | ✓ | | |
| 18 | Indonesian | [id_ID](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/id/train) | | ✓ | ✓ |
| 19 | Italian | [it_IT](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/it/train) | ✓ | | ✓ |
| 20 | Japanese | [ja_XX](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ja/train) | | | ✓ |
| 21 | Georgian | [ka_GE](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ka/train) | | | ✓ |
| 22 | Kazakh | [kk_KZ](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/kk/train) | | | ✓ |
| 23 | Khmer | [km_KH](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/km/train) | | | ✓ |
| 24 | Korean | [ko_KR](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ko/train) | | | ✓ |
| 25 | Lithuanian | [lt_LT](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/lt/train) | | | ✓ |
| 26 | Latvian | [lv_LV](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/lv/train) | | | ✓ |
| 27 | Macedonian | [mk_MK](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/mk/train) | | | ✓ |
| 28 | Malayalam | [ml_IN](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ml/train) | | ✓ | ✓ |
| 29 | Mongolian | [mn_MN](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/mn/train) | | | ✓ |
| 30 | Marathi | [mr_IN](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/mr/train) | | ✓ | ✓ |
| 31 | Burmese | [my_MM](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/my/train) | | | ✓ |
| 32 | Nepali | [ne_NP](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ne/train) | | ✓ | ✓ |
| 33 | Dutch | [nl_XX](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/nl/train) | ✓ | | ✓ |
| 34 | Polish | [pl_PL](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/pl/train) | ✓ | | ✓ |
| 35 | Pashto | [ps_AF](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ps/train) | | | ✓ |
| 36 | Portuguese | [pt_XX](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/pt/train) | ✓ | ✓ | ✓ |
| 37 | Romanian | [ro_RO](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ro/train) | ✓ | | ✓ |
| 38 | Russian | [ru_RU](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ru/train) | ✓ | | ✓ |
| 39 | Sinhala | [si_LK](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/si/train) | | | ✓ |
| 40 | Slovene | [sl_SI](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/sl/train) | ✓ | | ✓ |
| 41 | Swedish | [sv_SE](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/sv/train) | ✓ | | |
| 42 | Swahili | [sw_KE](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/sw/train) | | ✓ | ✓ |
| 43 | Tamil | [ta_IN](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ta/train) | | ✓ | ✓ |
| 44 | Telugu | [te_IN](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/te/train) | | ✓ | ✓ |
| 45 | Thai | [th_TH](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/th/train) | | | ✓ |
| 46 | Tagalog | [tl_XX](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/tl/train) | | | |
| 47 | Turkish | [tr_TR](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/tr/train) | | | ✓ |
| 48 | Ukrainian | [uk_UA](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/uk/train) | ✓ | | ✓ |
| 49 | Urdu | [ur_PK](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ur/train) | | ✓ | ✓ |
| 50 | Vietnamese | [vi_VN](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/vi/train) | | ✓ | ✓ |
| 51 | Xhosa | [xh_ZA](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/xh/train) | | ✓ | ✓ |
| 52 | Chinese | [zh_CN](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/zh/train) | | ✓ | ✓ |
-->
Models
With our dataset and Low-Rank Adaptation (LoRA), we present a family of multilingual and monolingual models based on LLaMA and BLOOM.
Our instruction-tuned multilingual Bactrian-X models are available at:
Note: We are continually updating this repository. The number of languages will be more than 52 in the future, and the current models are mostly only 7B in size. We welcome any collaborators who are willing to contribute larger models.
Hands-on Bactrian-X
Setting up the Environment
conda create -n bactrian python=3.9
conda activate bactrian
pip install -r requirements.txt
Training
Models are trained with the following hyperparameters:
Hyper-parameter | Bactrian-X |
---|
batch_size | 128 |
num_epochs | 4 |
learning_rate | 3e-4 |
cutoff_len | 768 |
lora_r | 64 |
lora_alpha | 16 |
Below is a command to train a LLaMA-7B adapter with our datasets in specific language(s). Replace <lang_iso>
with a list of (one or more) ISO-639-2 language codes separated by commas (e.g., en,zh
for English
and Chinese
), and <your_output_dir>
to specify where to store the outputs.
# Script to train on 4x Nvidia A100 80GB gpus
WORLD_SIZE=4
CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun --nproc_per_node=4 --master_port=1234 finetune.py \
--model_name_or_path decapoda-research/llama-7b-hf \
--lang <lang_iso> \
--output_dir <your_output_dir> \
--load_in_8bit \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--num_train_epochs 8 \
--model_max_length 768 \
--learning_rate 3e-4 \
--val_set_size 2000 \
--warmup_steps 200 \
--lora_r 64 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--lora_target_modules 'q_proj,k_proj,v_proj,o_proj' \
--group_by_length
Inference
This is example code that loads both the foundation model and Bactrian LoRA weights from the Hugging Face model hub, and runs a Gradio interface for inference on a specified input.
python generate.py \
--load_8bit \
--base_model 'decapoda-research/llama-7b-hf' \
--lora_weights 'MBZUAI/bactrian-x-llama-7b-lora' \
--share_gradio
Checkpoint export
To merge the LoRA weights back into the base model for export to Hugging Face format and to PyTorch state_dicts
, go to Alpaca-LoRA.
This should help users who want to run inference in projects like llama.cpp or alpaca.cpp.
Output Examples
Please check output examples here.
Citation
Please cite the repo if you use the data, model or code in this repo. A paper will be released very soon.
@misc{li2023bactrianx,
title={Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation},
author={Haonan Li and Fajri Koto and Minghao Wu and Alham Fikri Aji and Timothy Baldwin},
year={2023},
eprint={2305.15011},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Naturally, you should also cite the original LLaMA paper, the Self-Instruct paper, and the Stanford Alpaca repo.
Acknowledgements
We are standing on the shoulders of giants and would like to especially acknowledge the previous efforts of the following works.:
- Stanford Alpaca
- Alpaca-LoRA
- Low-Rank Adaptation (LoRA)
- PEFT
- LLM.int8()