Home

Awesome

<!-- --- language: - en - zh datasets: - survivi/Llama-3-SynE-Dataset library_name: transformers pipeline_tag: text-generation --- --> <!-- --- language: - en - zh task_categories: - text-generation --- --> <p align="center"> <img src="https://github.com/RUC-GSAI/Llama-3-SynE/blob/main/assets/llama-3-syne-logo.png" width="400"/> </p> <p align="center"> ๐Ÿ“„ <a href="https://arxiv.org/abs/2407.18743"> Report </a>&nbsp | &nbsp ๐Ÿค— <a href="https://huggingface.co/survivi/Llama-3-SynE">Model on Hugging Face</a>&nbsp | &nbsp ๐Ÿ“Š <a href="https://huggingface.co/datasets/survivi/Llama-3-SynE-Dataset">CPT Dataset</a> </p> <p align="center"> ๐Ÿ” <a href="https://github.com/RUC-GSAI/Llama-3-SynE/blob/main/README.md">English</a>&nbsp | &nbsp<a href="https://github.com/RUC-GSAI/Llama-3-SynE/blob/main/README_zh.md">็ฎ€ไฝ“ไธญๆ–‡</a> </p> <!-- <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/651a29d566e78720a78317ec/I2rqZ19OY2qvW1V6nOakg.png" width="400"/> </p> <p align="center"> ๐Ÿ“„ <a href="https://arxiv.org/abs/2407.18743"> Report </a>&nbsp | &nbsp ๐Ÿ’ป <a href="https://github.com/RUC-GSAI/Llama-3-SynE">GitHub Repo</a> </p> <p align="center"> ๐Ÿ” <a href="https://huggingface.co/survivi/Llama-3-SynE/blob/main/README.md">English</a>&nbsp | &nbsp<a href="https://huggingface.co/survivi/Llama-3-SynE/blob/main/README_zh.md">็ฎ€ไฝ“ไธญๆ–‡</a> </p> > Here is the Llama-3-SynE model. The continual pre-training dataset is also available [here](https://huggingface.co/datasets/survivi/Llama-3-SynE-Dataset). --> <!-- <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/651a29d566e78720a78317ec/I2rqZ19OY2qvW1V6nOakg.png" width="400"/> </p> <p align="center"> ๐Ÿ“„ <a href="https://arxiv.org/abs/2407.18743"> Report </a>&nbsp | &nbsp ๐Ÿ’ป <a href="https://github.com/RUC-GSAI/Llama-3-SynE">GitHub Repo</a> </p> <p align="center"> ๐Ÿ” <a href="https://huggingface.co/datasets/survivi/Llama-3-SynE-Dataset/blob/main/README.md">English</a>&nbsp | &nbsp<a href="https://huggingface.co/datasets/survivi/Llama-3-SynE-Dataset/blob/main/README_zh.md">็ฎ€ไฝ“ไธญๆ–‡</a> </p> > Here is the continual pre-training dataset. The Llama-3-SynE model is available [here](https://huggingface.co/survivi/Llama-3-SynE). -->

News

Model Introduction

Llama-3-SynE (<ins>Syn</ins>thetic data <ins>E</ins>nhanced Llama-3) is a significantly enhanced version of Llama-3 (8B), achieved through continual pre-training (CPT) to improve its Chinese language ability and scientific reasoning capability. By employing a meticulously designed data mixture and curriculum strategy, Llama-3-SynE successfully enhances new abilities while maintaining the original modelโ€™s performance. This enhancement process involves utilizing existing datasets and synthesizing high-quality datasets specifically designed for targeted tasks.

Key features of Llama-3-SynE include:

Model List

ModelTypeSeq LengthDownload
Llama-3-SynEBase8K๐Ÿค— Huggingface

BenchMark

We divide all evaluation benchmarks into two groups. The first group is major benchmarks, which aim to evaluate the comprehensive capacities of LLMs. Note that we include commonly used math and code benchmarks in this group because it is standard practice to use these benchmarks for evaluating various general-purpose LLMs.

The second group is scientific benchmarks, which have a broader coverage of multidisciplinary scientific knowledge.

We report the eight-shot performance on GSM8K, ASDiv, and MAWPS, five-shot for C-Eval, CMMLU, MMLU, MATH, GaoKao, SciQ, SciEval, SAT-Math, and AQUA-RAT, three-shot for MBPP. For HumanEval and ARC, we report the zero-shot evaluation performance. The best and second best are in bold and <ins>underlined</ins>, respectively.

Major Benchmarks

ModelsMMLUC-EvalCMMLUMATHGSM8KASDivMAWPSSAT-MathHumanEvalMBPP
Llama-3-8B66.6049.4351.0316.2054.4072.1089.3038.64<ins>36.59</ins>47.00
DCLM-7B64.0141.2440.8914.1039.2067.1083.40<ins>41.36</ins>21.9532.60
Mistral-7B-v0.363.5442.7443.7212.3040.5067.5087.5040.4525.6136.00
Llama-3-Chinese-8B64.10<ins>50.14</ins><ins>51.20</ins>3.600.801.900.6036.829.7614.80
MAmmoTH2-8B64.8946.5645.9034.1061.7082.80<ins>91.50</ins><ins>41.36</ins>17.6838.80
Galactica-6.7B37.1326.7225.535.309.6040.9051.7023.187.312.00
Llama-3-SynE (ours)<ins>65.19</ins>58.2457.34<ins>28.20</ins><ins>60.80</ins><ins>81.00</ins>94.1043.6442.07<ins>45.60</ins>

On Chinese evaluation benchmarks (such as C-Eval and CMMLU), Llama-3-SynE significantly outperforms the base model Llama-3 (8B), indicating that our method is very effective in improving Chinese language capabilities.

On English evaluation benchmarks (such as MMLU, MATH, and code evaluation benchmarks), Llama-3-SynE demonstrates comparable or better performance than the base model, indicating that our method effectively addresses the issue of catastrophic forgetting during the CPT process.

Scientific Benchmarks

"PHY", "CHE", and "BIO" denote the physics, chemistry, and biology sub-tasks of the corresponding benchmarks.

ModelsSciEval PHYSciEval CHESciEval BIOSciEval Avg.SciQGaoKao MathQAGaoKao CHEGaoKao BIOARC EasyARC ChallengeARC Avg.AQUA-RAT
Llama-3-8B46.9563.4574.5365.4790.9027.9232.8543.8191.3777.7384.51<ins>27.95</ins>
DCLM-7B56.7164.3972.0366.2592.5029.0631.4037.1489.5276.3782.9420.08
Mistral-7B-v0.348.1759.4168.8961.5189.4030.4830.9241.4387.3374.7481.0423.23
Llama-3-Chinese-8B48.1767.3473.90<ins>67.34</ins>89.2027.6430.4338.5788.2270.4879.3527.56
MAmmoTH2-8B49.3969.36<ins>76.83</ins>69.6090.2032.19<ins>36.23</ins><ins>49.05</ins>92.8584.3088.5727.17
Galactica-6.7B34.7643.3954.0746.2771.5023.6527.0524.7665.9146.7656.3320.87
Llama-3-SynE (ours)<ins>53.66</ins><ins>67.81</ins>77.4569.60<ins>91.20</ins><ins>31.05</ins>51.2169.52<ins>91.58</ins><ins>80.97</ins><ins>86.28</ins>28.74

On scientific evaluation benchmarks (such as SciEval, GaoKao, and ARC), Llama-3-SynE significantly outperforms the base model, particularly showing remarkable improvement in Chinese scientific benchmarks (for example, a 25.71% improvement in the GaoKao biology subtest).

Quick Start

Use the transformers backend for inference:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = "survivi/Llama-3-SynE"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, trust_remote_code=True
)
model.to("cuda:0")
model.eval()
prompt = "Hello world!"
inputs = tokenizer(prompt, return_tensors="pt")
inputs = inputs.to("cuda")
pred = model.generate(
    **inputs,
    max_new_tokens=2048,
    repetition_penalty=1.05,
    temperature=0.5,
    top_k=5,
    top_p=0.85,
    do_sample=True
)
pred = pred[0][len(inputs.input_ids[0]) :]
output = tokenizer.decode(pred, skip_special_tokens=True)
print(output)

Use the vLLM backend for inference:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_path = "survivi/Llama-3-SynE"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
sampling_params = SamplingParams(
    max_tokens=2048,
    repetition_penalty=1.05,
    temperature=0.5,
    top_k=5,
    top_p=0.85,
)
llm = LLM(
    model=model_path,
    tensor_parallel_size=1,
    trust_remote_code=True,
)
prompt = "Hello world!"
output = llm.generate(prompt, sampling_params)
output = output[0].outputs[0].text
print(output)

License

This project is built upon Meta's Llama-3 model. The use of Llama-3-SynE model weights must follow the Llama-3 license agreement. The code in this open-source repository follows the Apache 2.0 license.

Citation

If you find our work helpful, please consider citing the following paper:

@article{jie2024llama3syne,
  title={Towards Effective and Efficient Continual Pre-training of Large Language Models},
  author={Chen, Jie and Chen, Zhipeng and Wang, Jiapeng and Zhou, Kun and Zhu, Yutao and Jiang, Jinhao and Min, Yingqian and Zhao, Wayne Xin and Dou, Zhicheng and Mao, Jiaxin and others},
  journal={arXiv preprint arXiv:2407.18743},
  year={2024}
}