Home

Awesome

中文 | English

<div id="top"></div>

Alpaca-CoT

Alpaca-CoT: An Instruction-Tuning Platform with Unified Interface for Instruction Collection, Parameter-efficient Methods, and Large Language Models

LICENSE torch data model wandb colab

This is the repository for the Alpaca-CoT project, which aims to build an instruction finetuning (IFT) platform with extensive instruction collection (especially the CoT datasets) and a unified interface for various large language models and parameter-efficient methods. We are constantly expanding our instruction-tuning data collection, and integrating more LLMs and more parameter-efficient methods. In addition, we created a new branch tabular_llm to build a Tabular LLM for solving Table Intelligence Tasks.

You are warmly welcome to provide us with any non-collected instruction-tuning datasets (or their sources). We will uniformly format them, train the Alpaca model (and other LLMs in the early future) with these datasets, open source the model checkpoints, and conduct extensive empirical studies. We hope that our project can make a modest contribution to the open-source process of large language models, and reduce its threshold for NLP researchers to get started.

<img src="./figures/wechat.jpg" width = "100" height = "100" align=right /> You can also choose to join our group chat (WeChat) and communicate with more people with the same interests. At present, the number of group members is too large to join the group directly through the group QR code. You need to connect with me first to get into the group.

News

<details><summary> - more </summary> <p> </p> </details>

Overview

img

LLaMA [1] is a great work that demonstrates the amazing zero-shot and few-shot ability. It significantly reduces the cost of training, finetuning, and using competitive large language models, i.e., LLaMA-13B outperforms GPT-3(175B) and LLaMA-65B is competitive with PaLM-540B. Recently, to boost the instruction-following ability of LLaMA, Stanford Alpaca [2] finetuned LLaMA-7B on 52K instruction-following data generated by the Self-Instruct [3] techniques. However, at present, the LLM research community still faces three challenges: 1. Even LLaMA-7b still has high requirements for computing resources; 2. There are few open source datasets for instruction finetuning; and 3. There is a lack of empirical study on the impact of various types of instruction on model abilities, such as the ability to respond to Chinese instruction and the CoT reasoning.

To this end, we propose this project, which leverages various improvements that were subsequently proposed, with the following advantages:

To the best of our knowledge, this work is the first to study CoT reasoning based on LLaMA and Alpaca. Therefore, we abbreviate our work to Alpaca-CoT.

Data Collection

The relative size of collected datasets can be shown by this graph:

img

Referring to this (@yaodongC), we labeled each collected dataset according to the following rules:

(Lang)Lingual-Tags:

(Task)Task-Tags:

(Gen)Generation-method:

Statistics

DatasetNumsLangTaskGenTypeSrcUrl
Chain of Thought74771EN/CNMTHGinstruct with cot reasoningannotating CoT on existing datadownload
GPT4all806199ENMTCOLcode, stories and dialogsdistillation from GPT-3.5-turbodownload
GPTeacher29013ENMTSIgeneral, roleplay, toolformerGPT-4 & toolformerdownload
Guanaco534610MLMTSIvarious linguistic taskstext-davinci-003download
HC337175EN/CNTSMIXdialogue evaluationhuman or ChatGPTdownload
alpaca52002ENMTSIgeneral instructtext-davinci-003download
Natural Instructions5040134MLMTCOLdiverse nlp taskshuman annotated datasets collectiondownload
belle_cn1079517CNTS/MTSIgeneral, mathematical reasoning, dialoguetext-davinci-003download
instinwild52191EN/CNMTSIgeneration, open-qa, mind-stormtext-davinci-003download
prosocial dialog165681ENTSMIXdialogueGPT-3 rewrites questions + humans feedback manuallydownload
finance_en68912ENTSCOLfinancial related qaGPT3.5download
xP378883588MLMTCOLa collection of prompts & datasets across 46 of languages & 16 NLP taskshuman annotated datasets collectiondownload
firefly1649398CNMTCOL23 nlp taskshuman annotated datasets collectiondownload
instruct888969ENMTCOLaugmented of GPT4All, Alpaca, open-source Meta datasetsaugmentation performed using the advanced NLP tools provided by AllenAIdownload
Code Alpaca20022ENTSSIcode generation, editing, optimizationtext-davinci-003download
Alpaca_GPT452002EN/CNMTSIgeneral instructgenerated by GPT-4 using Alpacadownload
webGPT18994ENTSMIXinformation retrieval (IR) QAfine-tuned GPT-3, each instruction has two outputs, select better onedownload
dolly 2.015015ENTSHGclosed QA , summarization and etc, Wikipedia as referenceshuman annotateddownload
baize653699ENMTCOLa collection from Alpaca, Quora, StackOverFlow and MedQuAD questionshuman annotated datasets collectiondownload
hh-rlhf284517ENTSMIXdialoguedialog between human and RLHF modelsdownload
OIG(part)49237ENMTCOLcreated from various tasks, such as question and answeringusing data augmentation, human annotated datasets collectiondownload
GAOKAO2785CNMTCOLMultiple-choice, Fill-in-the-blank and Open-ended questions from examinationhuman annotateddownload
camel760620ENMTSIRole-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biologgpt-3.5-turbodownload
FLAN-Muffin1764800ENMTCOL60 nlp taskshuman annotated datasets collectiondownload
COIG(FlagInstruct)298428CNMTCOLcollect fron Exam, Translated, Human Value Alignment Instructions and Counterfactural Correction Multi-round Chatusing automatic tool and manual verificationdownload
GPT4Tools71446ENMTSIa collection of tool-related instructionsgpt-3.5-turbodownload
ShareChat1663241ENMTMIXgeneral instructcrowdsourcing to collect conversations between people and ChatGPT (ShareGPT)download
Auto CoT5816ENMTCOLarithmetic, commonsense, symbolic, and other logical reasoning taskshuman annotated datasets collectiondownload
MOSS1583595EN/CNTSSIgeneral instructtext-davinci-003download
ultrachat28247446ENQuestions about the World, Writing and Creation, Assistance on Existent Materialstwo separate gpt-3.5-turbodownload
Chinese-medical792099CNTSCOLQuestions about medical advicecrawldownload
CSL396206CNMTCOLpaper text generation, keyword extraction, text summarization and text classificationcrawldownload
pCLUE1200705CNMTCOLgeneral instructdownload
news_commentary252776CNTSCOLtranslatedownload
StackLLaMAtodoEN

Download

You can download all the formatted data here. Then you should put them in the data folder.

You can download all checkpoints trained on various types of instruction data from here. Then, after setting LoRA_WEIGHTS (in generate.py) to the local path, you can directly execute the model inference.

Data Formatting

All data in our collection is formatted into the same templates, where each sample is as follows:

[
{"instruction": instruction string,
"input": input string, # (may be empty)
"output": output string}
]

Note that, for CoT datasets, we first use the template provided by FLAN to change the original dataset into various Chain-of-Thoughts forms, and then convert it to the above format. The formatting script can be found here.

Multi-interface Unified Platform

Setup

pip install -r requirements.txt

Note that, make sure python>=3.9 when finetuning ChatGLM.

PEFT

pip install -e ./peft

Instruction Finetuning

In order for researchers to conduct systematic IFT research on LLMs, we have collected different types of instruction data, integrated multiple LLMs, and unified interfaces, making it easy to customize the desired collocation:

Single GPU

python3 uniform_finetune.py --model_type llama --model_name_or_path decapoda-research/llama-7b-hf \
    --data alpaca-belle-cot --lora_target_modules q_proj v_proj \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1

Note: for multiple datasets, you can use --data like --data ./data/alpaca.json ./data/finance.json <path2yourdata_1>

python3 uniform_finetune.py   --model_type chatglm --model_name_or_path THUDM/chatglm-6b \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --lora_r 32 --lora_alpha 32 --lora_dropout 0.1 --per_gpu_train_batch_size 2 \
    --learning_rate 2e-5 --epochs 1

Note that load_in_8bit is not yet suitable for ChatGLM, so batch_size must be smaller than others.

python3 uniform_finetune.py   --model_type bloom --model_name_or_path bigscience/bloomz-7b1-mt \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
python3 uniform_finetune.py   ---model_type moss --model_name_or_path fnlp/moss-moon-003-sft  \
    --data alpaca --lora_target_modules q_proj v_proj --per_gpu_train_batch_size 1 \
    --learning_rate 3e-4 --epochs 3
python3 uniform_finetune.py   --model_type internlm --model_name_or_path internlm/internlm-7b \
    --data alpaca --lora_target_modules q_proj v_proj --lora_r 32 --lora_alpha 32 \
    --lora_dropout 0.1 --per_gpu_train_batch_size 1 --learning_rate 2e-5 --epochs 1 \
    --compute_dtype="fp32"

Note that you can also pass the local path (where LLM weights saved) to --model_name_or_path. And the data type --data can be freely set according to your interests.

Multiple GPUs

torchrun --nnodes 1 --nproc_per_node $ngpu uniform_finetune.py $args --data $data 
python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy uniform_finetune.py \
    --model_type llama --model_name_or_path decapoda-research/llama-7b-hf \
    --data alpaca-belle-cot --lora_target_modules q_proj v_proj \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
    uniform_finetune.py   --model_type chatglm --model_name_or_path THUDM/chatglm-6b \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --lora_r 32 --lora_alpha 32 --lora_dropout 0.1 --per_gpu_train_batch_size 2 \
    --learning_rate 2e-5 --epochs 1

Note that load_in_8bit is not yet suitable for ChatGLM, so batch_size must be smaller than others.

python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
    uniform_finetune.py   --model_type bloom --model_name_or_path bigscience/bloomz-7b1-mt \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
    uniform_finetune.py   --model_type internlm --model_name_or_path internlm/internlm-7b \
    --data alpaca --lora_target_modules q_proj v_proj --lora_r 32 --lora_alpha 32 \
    --lora_dropout 0.1 --per_gpu_train_batch_size 1 --learning_rate 2e-5 --epochs 1 \
    --compute_dtype="fp32"

Inference

python3 generate.py  --data alpaca-belle-cot --model_type llama

python3 generate.py  --data alpaca-belle-cot --model_type chatglm

python3 generate.py  --data alpaca-belle-cot --model_type bloom

More details of instruction finetuing and inference can be found here where we modified from. Note that the folders saved-xxx7b are the save path for LoRA weights, and LLaMA weights are automatically downloaded from Hugging Face.

Inference Hyper-parameter Explanation

top_p=0.9,
        #Moderately increase the probability threshold of nucleus sampling to increase the quantity of candidate tokens and increase generation diversity.

temperature=1.0,
        #The previous low temperature parameter could lead to a severe polarization in the probability distribution of generated words, which degenerates the generation strategy into greedy decoding.

do_sample=True,
        #do_sample parameter is set to False by default. After setting to True, the generation methods turn into beam-search multinomial sampling decoding strategy.

no_repeat_ngram_size=6,
        #Configure the probability of the next repeating n-gram to 0, to ensure that there are no n-grams appearing twice. This setting is an empirical preliminary exploration.

repetition_penalty=1.8,
        #For words that have appeared before, in the subsequent prediction process, we reduce the probability of their reoccurrence by introducing the repetition_penalty parameter. This setting is an empirical preliminary exploration.

Parameter merging

python3 merge.py --model_type llama --size 7b --lora_dir xxx --merged_dir yyy

Local chatting

python3 server.py --model_type chatglm --size 6b --lora_dir xxx

Batch predicting

python3 predict.py --model_type chatglm --size 6b --data for_dict_data --lora_dir xxx --result_dir yyy

Web service building

python3 web.py --model_type chatglm --size 6b --lora_dir xxx

Empirical Study of Instruction-tuning Open LLMs in Chinese (As of June 25th)

<details><summary>Note: The following experimental results are all obtained from ___An Empirical Study of Instruction-tuning Large Language Models in Chinese___.</summary> <p>

1. Benchmarks

This paper selects two evaluation benchmarks, Belle-eval and MMCU, to comprehensively evaluate LLM competencies in Chinese.

Belle-eval is constructed by self-instruct with ChatGPT, which has 1,000 diverse instructions that involve 10 categories covering common NLP tasks (e.g., QA) and challenging tasks (e.g., code and math). We use ChatGPT to rate the model responses based on the golden answers. This benchmark is considered to be as the assessment of AGI (instruction-following) capability.

MMCU is a collection of Chinese multiple choice questions in four professional disciplines of medicine, law, psychology and education (e.g., Gaokao examination). It allows LLMs to take exams in human society in a multiple-choice test manner, making it suitable for evaluating the breadth and depth of knowledge of LLMs across multiple disciplines.

<p align="center"> <img src="./figures/chinesellms-benchmarks.png" width="35%"> </p>

Data statistics of Belle-eval and MMCU are shown in the table above.

2. Main Factors

We conduct experiments to study the three main factors in instruction-tuning LLMs: LLM bases, Parameter-efficient Methods, Chinese Instruction Datasets.

2.1 LLM Bases

For open LLMs, we test existing LLMs and LLMs fine-tuned with LoRA on Alpaca-GPT4 on Belle-eval and MMCU, respectively.

<p align="center"> <img src="./figures/chinesellms-llms1.png" width="80%"> <img src="./figures/chinesellms-llms2.png" width="40%"> </p>

Table 2 shows the scores of open LLMs on Belle-eval. Table 3 shows the accuracy of LLMs on MMCU. They fine-tune all the open LLMs with the same parameter-efficient method LoRA and the same instruction dataset Alpaca-GPT4.

Experimental Results:

  1. Evaluation of Existing LLMs

    Performance on Belle-eval

    (1) For base LLMs, Bloom performs the best.

    (2) For sft LLMs, ChatGLM outperforms others by large margins, thanks to the fact that it is trained with the most Chinese tokens and HFRL.

    (3) The Open QA, Math, CloseQA and Extract categories are still very challenging for existing open LLMs.

    (4) Vicuna and moss-sft have clear improvements compared to their bases, LLaMA and moss-base, respectively.

    (5) In contrast, the performance of sft models, Bloomz and Bloomz-mt, is reduced compared to the base model Bloom, because they tend to generate a shorter response.

    Performance on MMCU

    (1) All base LLMs perform poorly because it is almost difficult to generate content in the specified format before fine-tuning, e.g., outputting option numbers.

    (2) All sft LLMs outperform their corresponding base LLMs, respectively. In particular, Bloomz performs the best (even beats ChatGLM) because it can generate option number directly as required without generating other irrelevant content, which is also due to the data characteristics of its supervised fine-tuning dataset xP3.

    (3) Among the four disciplines, law is the most challenging for LLMs.

    <p align="center"> <img src="./figures/chinesellms-llms3.png" width="40%">
</p>

The performance results of LLMs after instruction-tuning on Alpaca-GPT4-zh are shown in Figure 1.

  1. Instruction-tuning Different LLMs

    (1) On Belle-eval, the performance improvement of sft LLMs brought by instruction-tuning is not as significant as that of base LLMs, except for sft Bloomz and Bloomz-mt.

    (2) Vicuna and ChatGLM encounter performance drops after instruction-tuning, because Vicuna is trained from real human-ChatGPT conversations, with better quality than Alpaca-GPT4. ChatGLM adopts HFRL, which may be no longer suitable for further instruction-tuning.

    (3) On MMCU, most LLMs achieve performance boosts after instruction-tuning, with the exception of Bloomz and Bloomz-mt, which have unexpectedly significantly decreased performance.

    (4) After instruction-tuning, Bloom has significant improvements and performs well on both benchmarks. Although ChatGLM beats Bloom consistently, it suffers performance drop during instruction-tuning. Therefore, among all open LLMs, Bloom is most suitable as a foundation model in the subsequent experiments for Chinese instruction-tuning exploration.

2.2 Parameter-efficient Methods

For parameter-efficient methods other than LoRA, the paper collects a range of parameter-efficient methods to instruction-tune Bloom on the Alpaca-GPT4 dataset.

<p align="center"> <img src="./figures/chinesellms-para1.png" width="40%"> <img src="./figures/chinesellms-para2.png" width="40%"> </p>

Experimental Results:

  1. Comparison of Parameter-efficient Methods

    (1) SadapterH performs the best among all parameter-efficient methods, which can be used as an alternative to LoRA.

    (2) P-tuning and prompt-tuning underperform others by large margins, indicating that only adding trainable layers in the embedding layer are not enough to support LLMs for generation tasks.

    (3) Although AdaLoRA is an improvement of LoRA, its performance has a clear drop, possibly because the LoRA's trainable parameters for LLMs are not suitable for further reduction.

    (4) Comparing the upper and lower parts, it can be seen that increasing the number of trainable parameters for sequential adapters (i.e., SadapterP and SadapterH) does not bring gain, while the opposite phenomenon is observed for parallel adapters(i.e., P-adapter)

  2. Training Loss

    (1) Prompt-tuning and P-tuning converge the slowest and has the highest losses after convergence. This shows that embedding-only adapters are not suitable for instruction-tuning LLMs.

    (2) The initial loss of AdaLoRA is very high because it requires simultaneous learning of parameter budget allocation, which makes the model unable to fit the training data well.

    (3) The other methods can quickly converge on training data and fit it well.

2.3 Chinese instruction Datasets

For the impact of various types of Chinese instruction datasets, authors gather popular open Chinese instructions (as shown in Table 5) to fine-tune Bloom with LoRA.

<p align="center"> <img src="./figures/chinesellms-data1.png" width="80%"> <img src="./figures/chinesellms-data2.png" width="80%"> <img src="./figures/chinesellms-data3.png" width="40%"> </p>

Table 6 and Table 7 show Bloom's fine-tuning on different instruction datasets.

Experimental Results:

  1. Performance on Belle-eval

    (1) the instruction data constructed by ChatGPT (e.g., using self-instruction methods or collecting real human-ChatGPT conversations) consistently enhances the instruction-following ability with 3.1 ∼ 11-point score increases.

    (2) Among these datasets, Belle has the best performance due to the largest amount of instruction data. However, the performance of models trained on moss-sft-data, containing more data built in a similar way, is unsatisfactory.

    (3) The performance brought by the Alpaca-GPT4 instructions is the second best, with only 49K being comparable to the 1.54M Belle.

    (4) Instinwild brings the least performance gains among them because the seed instructions it crawls from Tweet ("in wild") are not as comprehensive as those (like Alpaca) carefully designed by humans.

    (5) These ChatGPT-based data mainly have a significant improvement effect on open generation tasks such as Brain Storm and Generation, while there is a significant decrease in tasks that require high reading comprehension skills, such as Close QA and Extract.

    (6) These instruction datasets cause damage to the model's instruction-following ability, because the form and intent of each NLP or examination dataset are unitary, which can easily be overfitted.

    (7) Among them, COIG-trans performs the best because it involves over 2000 different tasks with a wide variety of task instructions. In contrast, xP3 and COIG-ccmc have the worst negative impact on model performance. Both of them only cover a few types of tasks (translation and QA for the former, counterfactual correction conversations for the latter), which hardly cover the popular instructions and tasks for humans.

  2. Performance on MMCU

    (1) Instruction-tuning on each dataset can always result in performance improvement.

    (2) Among the ChatGPT-based data shown in the upper part, ShareGPT-zh underperforms others by large margins. This may be due to the fact that real users rarely ask multiple choice questions about academic topics.

    (3) Among the dataset-collection data shown in the lower part, HC3 and COIG-ccmc results in the lowest accuracy because the unique questions of HC3 are only 13K, and the task format of COIG-ccmc is significantly different from MMCU.

    (4) COIG-exam brings the greatest accuracy improvement, benefiting from the similar task format as MMCU.

3. Other Factors

Four Other Factors: CoT, Expansion of Chinese Vocabulary, Language of Prompts and Human-value Alignment

3.1 CoT

For CoT, authors compare the performance before and after adding CoT data during instruction-tuning.

Experiment Settings:

We collect 9 CoT datasets and their prompts from FLAN, and then translate them into Chinese using Google Translate. They compare the performance before and after adding CoT data during instruction-tuning.

First note the way to add CoT data as "Alpaca-GPT4+CoT". In addition, add a sentence "先思考,再决定" ("think step by step" in Chinese) at the end of each instruction, to induce the model to respond to instructions based on the CoT, and label this way as "Alpaca-GPT4+CoT*".

<p align="center"> <img src="./figures/chinesellms-cot.png" width="40%"> </p>

Experimental Results:

  1. "Alpaca-GPT4+CoT" outperforms "Alpaca-GPT4" in Code and Math tasks that require strong reasoning ability. Besides, there is also a significant improvement in the MMCU Education task.

  2. As shown in the line of "Alpaca-GPT4+CoT*", the simple sentence can further improve the performance of reasoning tasks Code and Education, while the Math performance is slightly inferior to "Alpaca-GPT4+CoT". This may require further exploring of more robust prompts.

3.2 Expansion of Chinese Vocabulary

For expansion of Chinese vocabulary, authors test the influence of the number of Chinese tokens in the tokenizer’s vocabulary on LLMs’ ability to express Chinese. For example, if a Chinese character is in the vocabulary, it can be represented by a single token, otherwise it may require multiple tokens to represent it.

Experiment Settings: Authors mainly conduct experiments on LLaMA, which uses SentencePiece(32K vocabulary size of Chinese characters) covering fewer Chinese characters than Bloom(250K).

<p align="center"> <img src="./figures/chinesellms-voc.png" width="45%"> </p>

Experimental Results:

  1. Pre-training on more Chinese corpus with expansion of Chinese vocabulary is consistently helpful for instruction-following ability.

  2. And counterintuitively, "llama-voc-pre-l" (100B) is inferior to "llama-voc-pre" (20B) on MMCU, which shows that pre-training on more data may not necessarily lead to higher performance for academic exams.

3.3 Language of Prompts

For the language of prompts, authors test the suitability of instruction fine-tuning for using Chinese prompts.

<p align="center"> <img src="./figures/chinesellms-lan.png" width="60%"> </p>

Figure 4 shows the results of using Chinese and English prompts based on LLaMA and Bloom. When instruction-tuning LLaMA, using Chinese prompts can improve the performance on both benchmarks compared to English prompts, while the opposite phenomenon can be observed on Bloom.

Experimental Results:

  1. For models with weaker Chinese abilities(e.g., LLaMA), using Chinese prompts can effectively help respond in Chinese.

  2. For models with good Chinese abilities (e.g., Bloom), using prompts in English (the language they are better at) can better guide the model to understand the process of fine-tuning with instructions.

3.4 Human-value Alignment

To avoid LLMs generating toxic content, aligning them with human values is a crucial issue. We add human-value alignment data built by COIG into instruction-tuning to explore its impact.

<p align="center"> <img src="./figures/chinesellms-human.png" width="30%"> </p>

Figure 5 compares the results of instruction-tuning with and without human-value alignment.

Experimental Results: The human-value alignment results in a slight performance drop. How to balance the harmlessness and performance of LLMs is a research direction worth exploring in the future.

</p> </details>

Quantitative Analysis

<details><summary>Note: The following figure shows the statistics of the dataset collected as of March 26, which is only displayed as a motivation of data collection. More datasets have been collected, such as financial related instruction datasets.</summary> <p>

data collection statistics The current collection of instruction-finetuning datasets consists mainly of three parts:

Ablation of CoT and Chinese Instructions

ablation-cot "w/o CoT" and "w/o CN" denote models that exclude CoT data and Chinese instructions from their instruction finetuning data, respectively.

The above table shows two examples (involving with numerical calculations) that require a certain amount of reasoning ability to respond correctly. As shown in the middle column, Ours w/o CoT fails to generate the correct response, which shows that once the finetuning data does not contain CoT data, the model's reasoning ability significantly decreases. This further demonstrates that CoT data is essential for LLM models.

ablation-cot

The above table shows two examples that require the ability to respond to Chinese instructions. As shown in the right column, either the generated content of Ours w/o CN is unreasonable, or the Chinese instructions are answered in English by Ours w/o CN. This shows that removing Chinese data during finetuning will cause the model to be unable to handle Chinese instructions, and further demonstrates the need to collect Chinese instruction finetuning data.

ablation-cot

The above table shows a relatively difficult example, which requires both a certain accumulation of knowledge of Chinese history and a logical and complete ability to state historical events. As shown in this table, Ours w/o CN can only generate a short and erroneous response, because due to the lack of Chinese finetuning data, the corresponding knowledge of Chinese history is naturally lacking. Although Ours w/o CoT lists some relevant Chinese historical events, its logic of expression is self-contradictory, which is caused by the lack of CoT data. `

In summary, the models finetuned from our complete dataset (English, Chinese, and CoT instruction data) can significantly improve model reasoning and Chinese instruction following abilities.

The Effect of CoT Data

CoT-comparison Samples of each odd number of rows do not apply the CoT prompt, such as "step-by-step reasoning." Both Ours(w/CoT) and Alpaca are based on LLaMA-7B, and the only difference between them two is that the instruction-finetuning data of Ours(w/CoT) has a extra CoT data than that of Alpaca.

From the above table, we find that:

The Effect of Chinese Instruction Data

Quantitative comparison of responses to Chinese instructions. CN_compare_CN

Our model is finetuned from a 7B LLaMA on 52K English instructions and 0.5M Chinese instructions. Stanford Alpaca (our reimplementation) is finetuned from a 7B LLaMA on 52K English instructions. BELLE is finetuned from a 7B BLOOM on 2B Chinese instructions.

From the above table, several observations can be found:

Quantitative comparison of responses to English instructions. The purpose of this subsection is to explore whether finetuning on Chinese instructions has a negative impact on Alpaca. CN_compare_EN

From the above table, we find that:

</p> </details>

Citation

Please cite the repo if you use the data collection, code, and experimental findings in this repo.

@misc{si2023empirical,
      title={An Empirical Study of Instruction-tuning Large Language Models in Chinese}, 
      author={Qingyi Si and Tong Wang and Zheng Lin and Xu Zhang and Yanan Cao and Weiping Wang},
      year={2023},
      eprint={2310.07328},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

For data and models, please cite the original data, parameter-efficient methods and LLMs source as well.

We would like to express our special gratitude to APUS AilMe Lab for sponsoring the 8 A100 GPUs for the experiments.

<p align="right">(<a href="#top">back to top</a>)</p>

All Thanks To Our Contributors

<a href="https://github.com/PhoebusSi/Alpaca-CoT/graphs/contributors"> <img src="https://contrib.rocks/image?repo=PhoebusSi/Alpaca-CoT" /> </a>