Home

Awesome

<div align=center><img src="assets/bachelor.png" width="200"></div>

OpenBA🎓

This is the official code for OpenBA: An Open-Sourced 15B Bilingual Asymmetric Seq2Seq Model Pre-trained from Scratch

Code License Data License Model License

[中文版] [English]

News🔥

Content📝

<p align="center" width="100%"> <a target="_blank"><img src="assets/downstream.png" style="width: 100%; min-width: 300px; display: block; margin: auto;"></a> </p>

Open Source Checklist

We are excited to unveil two distinguished versions of our model, with another on the horizon:

Overview of Training process

<p align="center" width="100%"> <a target="_blank"><img src="assets/training_process.png" style="width: 100%; min-width: 300px; display: block; margin: auto;"></a> </p>

Evaluation Results

C-EVAL

Model performance on C-Eval benchmark, where #Param. denotes the model parameters, $*$ denotes chain-of-thought and Avg. denotes average accuracy. We report the 5-shot and 0-shot performance with diagonal bar division.

Model#Param.STEMSocial ScienceHumanitiesOthersAvg.Avg.(Hard)
LLaMA65B37.845.636.137.138.831.7
ChatGLM6B33.348.341.338.038.929.2
Baichuan7B38.252.046.239.342.831.5
MOSS-moon-sft16B31.637.033.432.133.128.4
GLM-130B130B36.755.847.743.044.030.7
OpenBA15B34.846.641.141.539.831.1

BBH

Model performance on the BBH benchmark, where #Param. denotes the model parameters. We report the accuracy score for all the models.

Model#Param.BBH
ChatGLM6B31.3
Baichuan7B31.9
BatGPT15B34.1
MOSS16B29.3
OpenBA15B34.1

Reading Comprehension

Model performance on BELEBELE benchmark, where #Param. denotes the model parameters, $\dagger$ denotes 5-shot setting, $\ddagger$ denotes full fine-tuning in English and $*$ denotes the zero-shot setting for instructed models. We report the accuracy score for all the models.

Model#Param.eng_Latnzho_Hanszho_HantAvg.
Falcon $(†)$40B77.266.062.268.5
LLaMA $(†)$70B82.564.657.768.2
InfoXLM $(‡)$550M79.374.672.475.4
XLM-V $(‡)$1.2B76.271.067.171.4
LLaMA2-Chat $(*)$70B78.862.459.366.8
OpenBA $(*)$15B78.675.273.775.8

Machine Translation

Model performance on Flores subset containing 50 sentences sampled from Flores benchmark, where #Param. denotes the model parameters. We report BLEU for all the models.

Model#Param.Zh $\Rightarrow$ EnEn $\Rightarrow$ Zh
ChatGLM6B17.232.5
Alpaca7B15.19.8
Alpaca-LoRA7B16.414.5
PARROT7B19.624.8
BatGPT15B23.138.7
MOSS16B17.232.5
OpenBA15B23.337.4

Usage🚀

DEMO

You should first install the requirements below:

pip install transformers==4.31.0 torch>=2.0 sentencepiece

NOTICE: Just make sure that the version of the transformers library is no higher than 4.33.2 !

For inference, note that we restore the task token <S> and special token <extra_id_0> in length adaptation and fine-tuning stage, so you may format your instruction input as <S> {your input} <extra_id_0> to get a better answer.

Below is a sentence completion example using OpenBA-LM.

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
>>> tokenizer = AutoTokenizer.from_pretrained("OpenBA/OpenBA-LM", trust_remote_code=True)
>>> model = AutoModelForSeq2SeqLM.from_pretrained("OpenBA/OpenBA-LM", trust_remote_code=True).half().cuda()
>>> model = model.eval()
>>> query = "<S>" + "苏州处太湖平原,沿江为高沙平原,河" + "<extra_id_0>"
>>> inputs = tokenizer(query, return_tensors="pt").to("cuda")
>>> outputs = model.generate(**inputs, do_sample=True, max_new_tokens=32)
>>> response = tokenizer.decode(outputs[0], skip_special_tokens=True)
>>> print(response)
流两侧为河淤平原,苏州平原是江苏平原主体,地势低平,土地肥沃,气候温和

Below is a instruction example using OpenBA-Flan.

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
>>> tokenizer = AutoTokenizer.from_pretrained("OpenBA/OpenBA-Flan", trust_remote_code=True)
>>> model = AutoModelForSeq2SeqLM.from_pretrained("OpenBA/OpenBA-Flan", trust_remote_code=True).half().cuda()
>>> model = model.eval()
>>> query = "<S>" + "介绍一下中国的四大名著,并分别概括其主要内容" + "<extra_id_0>"
>>> inputs = tokenizer(query, return_tensors="pt").to("cuda")
>>> outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
>>> response = tokenizer.decode(outputs[0], skip_special_tokens=True)
>>> print(response)
中国的四大名著分别是《红楼梦》、《西游记》、《水浒传》和《三国演义》。它们分别包括故事情节、文化内涵和历史背景等方面的不同特点。《红楼梦》是一部中国古典小说,讲述了贾宝玉、林黛玉、薛宝钗等一群人物在贾府的生活和爱情故事。《西游记》是中国著名小说,描述了孙悟空、猪八戒、沙悟净等一众妖魔鬼怪的冒险历程和故事。《水浒传》是一部中国古典小说,描述了宋江等一百零八位好汉的反抗故事。《三国演义》是中国古代著名小说,讲述了三国时期的历史和战争故事。这些小说在文学、历史、哲学和文化等方面都有着不同的影响和地位。

You can run the chat demo as follows:

python gradio_chat_demo.py # run chat demo
python gradio_code_demo.py # run code demo

Training

Our training code are put in folder training. Based on Megatron-LM, we made the following implementations:

For pre-training, relevant requirements should be installed beforehand as stated in Megatron-LM, then you can simply run the following command to process texts into bytes, which can be read faster by a MMap Dataset:

cd training
bash scripts/data_process_span_corr.sh  # process pre-train data
bash scripts/data_process_flan.sh  # process fine-tune data

The you can run distributed training across multi nodes by

bash scripts/run_pretrain.sh  # pre-train
bash scripts/run_stretch.sh  # length adaptation
bash scripts/run_flan.sh   # fine-tune

Details

Model Structure

Generally, the OpenBA model follows the standard encoder-decoder architecture. However, it is worth noting that the encoder and decoder serve different roles, where the encoder endows the model with strong comprehension capability, and the decoder brings the model with generative ability. Existing works indicate that an encoder-decoder model with more encoder layers can achieve powerful performance. To fill the gap of deeper decoder-based LLM, we also design an asymmetric structure, where the hyper-parameters are listed in the table below.

EncoderDecoderAttn Heads$d_{model}$$d_{ff}$#Param.(B)Vocab SizeTraining TokensPos Emb
12364040961638414.6251000380BRoPE

Data Collection

<p align="center" width="100%"> <a target="_blank"><img src="assets/data.png" style="width: 100%; min-width: 300px; display: block; margin: auto;"></a> </p> The composition of Data collection. Figure (a) represents the composition ratio of the pre-training dataset. Figure (b) represents the composition of the bilingual Flan dataset. Figure (c) represents the finer-grained composition of the Chinese Flan dataset.

Disclaimers📌

The use of the OpenBA-LM should adhere to societal norms and not be used for any activities that jeopardize national or social security or violate the law. Additionally, we also request users not to use the OpenBA-LM for internet services that have not undergone appropriate security review and documentation. We hope that all users will abide by this principle to ensure that technological development occurs in a regulated and legal environment.

We have done our best to ensure the compliance of the data used during the model training process. However, despite our significant efforts, unforeseen issues may still arise due to the complexity of the model and data. If misleading or harmful statements are generated through the use of the models included in this project or their modified versions while providing services, the responsibility lies with the service provider and is not associated with this project.

Citation

Please add the citation if our paper or code helps you.

@article{li2023openba,
  title={OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch},
  author={Li, Juntao and Tang, Zecheng and Ding, Yuyang and Wang, Pinzheng and Guo, Pei and You, Wangjie and Qiao, Dan and Chen, Wenliang and Fu, Guohong and Zhu, Qiaoming and others},
  journal={arXiv preprint arXiv:2309.10706},
  year={2023}
}