Home

Awesome

<p align="left"> <a href="README_CN.md">中文</a>&nbsp | English</a> </p> <br><br> <p align="center"> <img src="https://dscache.tencent-cloud.cn/upload/uploader/hunyuan-64b418fd052c033b228e04bc77bbc4b54fd7f5bc.png" width="400"/> <br> </p><p></p> <p align="center"> 🫣&nbsp<a href="https://huggingface.co/tencent/Tencent-Hunyuan-Large"><b>Hugging Face</b></a>&nbsp&nbsp | &nbsp&nbsp🖥️&nbsp&nbsp<a href="https://llm.hunyuan.tencent.com/" style="color: red;"><b>official website</b></a>&nbsp&nbsp|&nbsp&nbsp🕖&nbsp&nbsp <a href="https://cloud.tencent.com/product/hunyuan" ><b>HunyuanAPI</b></a>&nbsp&nbsp|&nbsp&nbsp🐳&nbsp&nbsp <a href="https://gitee.com/Tencent/Tencent-Hunyuan-Large" ><b>Gitee</b></a> </p><p align="center"> <a href="https://arxiv.org/abs/2411.02265" style="color: red;"><b>Technical Report</b></a>&nbsp&nbsp|&nbsp&nbsp <a href="https://huggingface.co/spaces/tencent/Hunyuan-Large"><b>Demo</b></a>&nbsp&nbsp&nbsp|&nbsp&nbsp <a href="https://cloud.tencent.com/document/product/851/112032" style="color: red;"><b>Tencent Cloud TI</b></a>&nbsp&nbsp&nbsp</p> <p><br></p> <p> <table align="center"> <tbody> <tr> <td align="center" colspan="3"><strong>Download Models</strong></td> </tr> <tr> <td align="center" style="width: 100px;" >Models</td> <td align="center" style="width: 500px;">Huggingface Download URL</td> <td align="center" style="width: 500px;">Tencent Cloud Download URL</td> </tr> <tr> <td style="width: 100px;">Hunyuan-A52B-Instruct-FP8</td> <td style="width: 500px;"><a href="https://huggingface.co/tencent/Tencent-Hunyuan-Large/tree/main/Hunyuan-A52B-Instruct-FP8" style="color: red;">Hunyuan-A52B-Instruct-FP8</a></td> <td style="width: 500px;"><a href="https://cdn-large-model.hunyuan.tencent.com/Hunyuan-A52B-Instruct-128k-fp8-20241116.zip" style="color: red;">Hunyuan-A52B-Instruct-FP8</a></td> </tr> <tr> <td style="width: 100px;">Hunyuan-A52B-Instruct</td> <td style="width: 500px;"><a href="https://huggingface.co/tencent/Tencent-Hunyuan-Large/tree/main/Hunyuan-A52B-Instruct" style="color: red;">Hunyuan-A52B-Instruct</a></td> <td style="width: 500px;"><a href="https://cdn-large-model.hunyuan.tencent.com/Hunyuan-A52B-Instruct-128k-20241116.zip" style="color: red;">Hunyuan-A52B-Instruct</a></td> </tr> <tr> <td style="width: 100px;">Hunyuan-A52B-Pretrain</td> <td style="width: 500px;"><a href="https://huggingface.co/tencent/Tencent-Hunyuan-Large/tree/main/Hunyuan-A52B-Pretrain" style="color: red;">Hunyuan-A52B-Pretrain</a></td> <td style="width: 500px;"><a href="https://cdn-large-model.hunyuan.tencent.com/Hunyuan-A52B-Pretrain-256k.zip" style="color: red;">Hunyuan-A52B-Pretrain</a></td> </tr> </tbody> </table> </p> <p></p>

Model Introduction

With the rapid development of artificial intelligence technology, large language models (LLMs) have made significant progress in fields such as natural language processing, computer vision, and scientific tasks. However, as the scale of these models increases, optimizing resource consumption while maintaining high performance has become a key challenge. To address this challenge, we have explored Mixture of Experts (MoE) models. The currently unveiled Hunyuan-Large (Hunyuan-MoE-A52B) model is the largest open-source Transformer-based MoE model in the industry, featuring a total of 389 billion parameters and 52 billion active parameters. This is currently the largest open-source Transformer-based MoE model in the industry, featuring a total of 389 billion parameters and 52 billion active parameters.

By open-sourcing the Hunyuan-Large model and revealing related technical details, we hope to inspire more researchers with innovative ideas and collectively advance the progress and application of AI technology. We welcome you to join our open-source community to explore and optimize future AI models together!

Introduction to Technical Advantages

Model

Inference Framework

Training Framework

 

Related News

Benchmark Evaluation

Hunyuan-Large pre-trained model achieves the best overall performance compared to both Dense and MoE based competitors having similar activated parameter sizes. For aggregated benchmarks such as MMLU, MMLU-Pro, and CMMLU, Hunyuan-Large consistently achieves the best performance, confirming its comprehensive abilities on aggregated tasks. Hunyuan-Large also shows superior performance in commonsense understanding and reasoning, and classical NLP tasks such as QA and reading comprehension tasks (e.g., CommonsenseQA, PIQA and TriviaQA).
For the mathematics capability, Hunyuan-Large outperforms all baselines in math datasets of GSM8K and MATH, and also gains the best results on CMATH in Chinese.We also observe that Hunyuan-Large achieves the overall best performance in all Chinese tasks (e.g., CMMLU, C-Eval).

ModelLLama3.1-405BLLama3.1-70BMixtral-8x22BDeepSeek-V2Hunyuan-Large
MMLU85.279.377.878.588.4
MMLU-Pro61.653.849.5-60.2
BBH85.981.678.978.986.3
HellaSwag--88.787.886.8
CommonsenseQA85.884.182.4-92.9
WinoGrande86.785.385.084.988.7
PIQA--83.683.788.3
NaturalQuestions--39.638.752.8
DROP84.879.680.480.188.9
ARC-C96.192.991.292.495.0
TriviaQA--82.179.989.2
CMMLU--60.084.090.2
C-Eval--59.681.791.9
C3--71.477.482.3
GSM8K89.083.783.779.292.8
MATH53.841.442.543.669.8
CMATH--72.378.791.3
HumanEval61.058.553.148.871.4
MBPP73.468.664.266.672.6

Hunyuan-Large-Instruct achieves consistent improvements on most types of tasks compared to LLMs having similar activated parameters, indicating the effectiveness of our post-training. Delving into the model performance in different categories of benchmarks, we find that our instruct model achieves the best performance on MMLU and MATH dataset.
Notably, on the MMLU dataset, our model demonstrates a significant improvement, outperforming the LLama3.1-405B model by 2.6%.
This enhancement is not just marginal but indicative of the Hunyuan-Large-Instruct’s superior understanding and reasoning capabilities across a wide array of language understanding tasks. The model’s prowess is further underscored in its performance on the MATH dataset, where it surpasses the LLama3.1-405B by a notable margin of 3.6%.
Remarkably, this leap in accuracy is achieved with only 52 billion activated parameters, underscoring the efficiency of our model.

ModelLLama3.1 405B Inst.LLama3.1 70B Inst.Mixtral 8x22B Inst.DeepSeekV2.5 ChatHunyuan-Large Inst.
MMLU87.383.677.880.489.9
CMMLU--61.0-90.4
C-Eval--60.0-88.6
BBH--78.484.389.5
HellaSwag--86.090.388.5
ARC-C96.994.890.0-94.6
GPQA_diamond51.146.7--42.4
MATH73.868.049.874.777.4
HumanEval89.080.575.089.090.0
AlignBench6.05.96.28.08.3
MT-Bench9.18.88.19.09.4
IFEval strict-prompt86.083.671.2-85.0
Arena-Hard69.355.7-76.281.8
AlpacaEval-2.039.334.330.950.551.8

 

Data

Hunyuan-Large provides model training-related processes. In this section, you can process the training data format for use in model training.

Training Data Format and Processing

Training data is processed into a messages format as follows. The default system prompt for training and inference is "You are a helpful assistant." Below are examples of single-turn and multi-turn data:

# Single-Turn Data Format

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "海水为什么是咸的" },
    {"role": "assistant", "content": "海水是咸的主要是因为其中含有许多溶解在水中的盐类和矿物质。这些盐类和矿物质来自于地球表面的岩石和土壤中的化学物质,随着时间的推移,它们被带到了海洋中。当海水蒸发时,水分蒸发掉了,但盐类和矿物质仍然留在水中,导致海水变得更加咸味。因此,海水的咸度是由其中的盐类和矿物质的含量决定的。"}
]

# Multi-Turn Data Format
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "知道重庆森林这部电影吗?" },
    {"role": "assistant", "content": "知道呀,是一部由王家卫导演的片子。" },
    {"role": "user", "content": " 而主演里更是有王菲,一上映便受到追捧。" },
    {"role": "assistant", "content": "所以此片获得了第14届香港电影金像奖最佳影片奖。"}
]

from tokenizers import ByteLevelBPETokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./models", use_fast=False, trust_remote_code=True)

ids = tokenizer.apply_chat_template(messages)

For more usage references, see the ./models/test.py file.

 

Quick Start

You can quickly get started by referring to the content in the <a href="examples/README.md">Quick Start Guide</a>.

Model Training

To simplify the Training process, HunyuanLLM provides a pre-built Docker image:

hunyuaninfer/hunyuan-large.

Hardware Requirements

Tested on H20, without enabling make_moe_param_leaf_module and using zero3+offload, with a max_seq_length of 2048, full fine-tuning requires at least 32 GPUs, and LoRA fine-tuning requires at least 8 GPUs.

Training Performance

With the minimum configuration (8 GPUs for LoRA fine-tuning), per_device_train_batch_size is set to 1, and gradient_accumulation_steps is set to 1, resulting in approximately 35 seconds per iteration.

Launch Method

Refer to: HuggingFace Transformers Trainer

Single-Machine Training

In the train directory, execute:

pip install -r requirements.txt
bash train.sh

Multi-Machine Training

To start training on multiple machines, follow the steps below and ensure that all machines are within the same cluster.

Configure Passwordless SSH Login Between Machines

The following steps use two machines as an example, with their IPs represented as ${ip1} and ${ip2}. These operations are performed within a Docker container.

First, configure passwordless SSH between containers on each machine.

ssh-keygen			# Generate id_rsa and id_rsa.pub for passwordless login
ssh-keygen -t rsa -A    # Generate /etc/ssh/ssh_host_rsa_key and ssh_host_ecdsa_key for starting 'SSH listen' later
/usr/sbin/sshd -p 36005 -o ListenAddress=0.0.0.0        # Start SSH listen
echo "Port 36005" > ~/.ssh/config   # Change SSH connection port to 36005
passwd root    # Set root password to avoid alerts from monitoring platforms

Note: The 36005 here is an example. You can choose any port, but ensure that the port is open and not occupied by other processes.

Next, within the container on each machine, execute:

cat ~/.ssh/id_rsa.pub

Copy the output SSH public key and paste it into the ~/.ssh/authorized_keys file, with one public key per line. This must be done on every machine. Ultimately, the ~/.ssh/authorized_keys file on each machine should be identical and contain the public keys of all machines.

It's important to note that during multi-node training, the code executed on each node must be consistent. It is recommended to mount a shared network drive. If mounting a shared drive is not possible, you need to manually copy the dataset, scripts, and code to the same directory on all machines.

Start Multi-Machine Training

Once the preparation steps are completed and dependencies are confirmed to be installed (if not, execute pip install -r requirements.txt to install), you can add the following configuration at the beginning of train.sh:

export HOST_GPU_NUM=8
# Current machine IP
export LOCAL_IP=${ip1}
# Multi-node machine IPs, separated by commas
export NODE_IP_LIST="${ip1}:8,${ip2}:8"
# Number of machine nodes
export NODES=2
export NODE_NUM=$((${NODES} * ${HOST_GPU_NUM}))

Note: Replace ${ip1} and ${ip2} with the actual IP addresses!

Then, on the machine with ${ip1}, execute bash train.sh in the train/ directory. Note that on the first run, you might see the following output:

The authenticity of host '[ip]:36005 ([ip]:36005)' can't be established.
ECDSA key fingerprint is xxxxxx.
ECDSA key fingerprint is MD5:xxxxxx.
Are you sure you want to continue connecting (yes/no)?

At this point, type yes to continue.

Key Parameters

The key parameters in the script are as follows:

Note:

What to Do If Out of Memory?

Refer to: DeepSpeed Configuration

You can try modifying the DeepSpeed configuration by removing the auto attribute from these parameters and reducing their values:

Merging LoRA Models

The saved LoRA weights cannot be merged into the zero3 model during training because, with zero3 enabled, model weights are split across different data parallel ranks. If you want to merge LoRA weights into the base model, you can do so offline to obtain the merged weight file. Execute merge_lora_weight.sh to merge the LoRA weights with the base model weights. The parameters include:

 

Inference and Deployment

HunyuanLLM uses TRT-LLM and vLLM for deployment. We are open sourcing the vLLM-backend deployment (see Reasoning with vLLM), and the TRT-LLM deployment (see Reasoning with TRT-LLM) will be available in the near future.

Using TRT-LLM for Inference

To be opened

Using vLLM for Inference

Docker:

To simplify the deployment process, HunyuanLLM provides a pre-built Docker image:

hunyuaninfer/hunyuan-large. You only need to download the model files and start the Docker container using the code below to begin model inference.

docker run --name hunyuanLLM_infer -itd --privileged --user root --net=host --ipc=host --gpus=8 hunyuaninfer/hunyuan-large:infer-open-source

Note: Docker container privilege management. The above code uses privileged mode (--privileged) to start the Docker container, which grants the container higher privileges, increasing the risk of data leakage and cluster security threats. It is recommended to avoid using privileged mode unless necessary to reduce security risks. For scenarios where privileged mode is required, conduct a thorough security assessment and implement appropriate security monitoring and hardening measures.

Configure Passwordless SSH Login Between Machines

The following steps use two machines as an example, with their IPs represented as ${ip1} and ${ip2}. These operations are performed within a Docker container.

First, run passwd on both machines to set a password, for example: Tmp123,./

Copy inference/login_ssh.py into the container and execute the following command, ensuring the IP and password are correctly entered.

python3 login_ssh.py --ips ${ip1},${ip2} --port 36000 --password=Tmp123,./

Note 📢: Before starting, be sure to verify multi-machine communication using VLLM's debugging script: https://docs.vllm.ai/en/latest/getting_started/debugging.html

BF16 Deployment

BF16 requires 16 H20 GPUs for deployment. After verifying that multi-machine communication is correct, execute the following steps:

Before running the commands, set the following environment variables:

${LOCAL_IP}: The IP corresponding to bond1 on the current machine
${MODEL_PATH}: Path to the Hunyuan LLM model

Step 1: Start Ray

Ray is an open-source library for parallel and distributed Python. In this section, we use Ray to achieve multi-machine communication.

Ray Component Configuration Hardening: The default configuration of Ray components does not enable authentication mechanisms for service ports (e.g., 6379, 8265), posing risks of unauthorized access and command execution. It is recommended to deploy Ray components only in trusted internal network environments or ensure strict access control list (ACL) policies are implemented for these ports to prevent unauthorized network access.

First, start Ray on each node (either in the background or by keeping the terminal running):

On the head node:

export VLLM_HOST_IP=${LOCAL_IP}
export NCCL_SOCKET_IFNAME=bond1
export GLOO_SOCKET_IFNAME=bond1
ray start --block --head --node-ip-address=${LOCAL_IP} --port=6379

On all worker nodes:

Note: Replace {HEAD NODE $LOCAL_IP} with the actual ${LOCAL_IP} of the head node.

export VLLM_HOST_IP=${LOCAL_IP}
export NCCL_SOCKET_IFNAME=bond1
export GLOO_SOCKET_IFNAME=bond1
ray start --block --address={HEAD NODE $LOCAL_IP}:6379 --node-ip-address=${LOCAL_IP}

If Ray fails to start, execute ray stop and then run the above commands again.

Step 2: Execute Inference

Method 1: Command Line Inference

Below is a code snippet demonstrating how to quickly request the chat model using vLLM:

Note: vLLM Component Remote Code Execution Protection. In the code below, if the trust-remote-code configuration option of the vLLM component is enabled, it will allow loading and executing code from remote model repositories, which may lead to the execution of malicious code. Unless explicitly required by business needs, it is recommended to keep this configuration option disabled to reduce potential security threats.

import os
from vllm import LLM, SamplingParams

model_path=os.environ.get('MODEL_PATH')

llm = LLM(model=model_path,
        tokenizer=model_path,
        trust_remote_code=True,
        max_model_len=10240,
        dtype='bfloat16',
        tensor_parallel_size=16,
        pipeline_parallel_size=1,
        disable_log_stats=False,
        gpu_memory_utilization=0.98,
        disable_custom_all_reduce=True,
        #distributed_executor_backend='ray',
        enforce_eager=True,
        max_num_seqs=8,
        use_v2_block_manager=True,
        quantization=None)

prompts = ["海水为什么是咸的"]

sampling_params = SamplingParams(
    temperature=0.7, top_p=0.6, max_tokens=200, top_k=20, repetition_penalty=1.05)

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Method 2: Service-Based Inference

Below we demonstrate how to deploy the model using vLLM in a service-based manner and make requests.

Run the following on the head node:

export VLLM_HOST_IP=${LOCAL_IP}
export NCCL_SOCKET_IFNAME=bond1
export GLOO_SOCKET_IFNAME=bond1

Next, start the service by running:

cd inference
sh run_server.sh

Tips: Troubleshooting, if you encounter the following error:

ray.exceptions.RaySystemError: System error: No module named 'transformers_modules' traceback: Traceback (most recent call last):
ModuleNotFoundError: No module named 'transformers_modules'

Copy the ~/.cache/huggingface/modules/ directory from the head node to the corresponding path on all worker nodes.

After successfully running run_server.sh, execute the request script:

sh openapi.sh

Be sure to modify ${LOCAL_IP} and ${MODEL_PATH} in openapi.sh to values match the corresponding service.

Quantized Model Deployment:

This section describes the process of deploying a quantized model using vLLM.

Image: The deployment image is the same as for BF16.

Int8 Quantized Model Deployment:

To deploy the Int8-weight-only version of the Hunyuan-L model, simply set the environment variables in run_server_int8.sh:

${MODEL_PATH}: Path to the BF16 model
${LOCAL_IP}: The IP corresponding to bond1 on the current machine

Then, start the Int8 service by running:

sh run_server_int8.sh

After successfully running run_server_int8.sh, execute the request script:

sh openapi.sh

FP8 Quantized Model Deployment:

To deploy the W8A8C8 version of the Hunyuan-L model, simply set the environment variables in run_server_fp8.sh:

${MODEL_PATH}: Path to the FP8 model
${LOCAL_IP}: The IP corresponding to bond1 on the current machine

Then, start the FP8 service by running:

sh run_server_fp8.sh

After successfully running run_server_fp8.sh, execute the request script:

sh openapi.sh

FP8 BENCHMARK

This part introduces the Benchmark of Hunyuan Large Instruct FP8 quantitative model.

DatasetBF16W8A8C8-FP8
ARC-C94.694.2
C-Eval88.689.2
CMMLU90.489.8
MMLU89.988.9

Inference Performance

This section presents the efficiency test results of deploying various models (original and quantized) using vLLM, including inference speed (tokens/s) under different batch sizes.

Inference FrameworkModelNumber of GPUs (H20)input_lengthbatch=1batch=4
vLLMHunyuan-Large16204820.275.5
vLLMHunyuan-Large(int8 weight only)8204819.373.6
vLLMHunyuan-Large(W8A8C8-FP8)8204819.874.9

Tokenizer

The tokenizer used in the HunYuan-Large model balances compression rate and effectiveness, ensuring that embeddings are sufficiently trained. The vocabulary includes 100K tokens integrated from tiktoken. Additionally, we trained an extra 29K Chinese tokens using a large amount of high-quality Chinese training data to enhance the model's Chinese capabilities and the tokenizer's compression rate. Combined, our new tokenizer improves the compression rate compared to the LLaMA3 tokenizer, increasing from 2.78 characters/token to 3.13 characters/token.

Hunyuan API

You can experience our Hunyuan-Large model on Tencent Cloud. For details, please visit: https://cloud.tencent.com/document/product/1729/97730.

Interactive Demo Web

The Hunyuan-Large web demo is now open. Visit https://huggingface.co/spaces/tencent/Hunyuan-Large to easily experience our model.

Training/Inference on TI

Tencent Cloud's TI Platform is a comprehensive machine learning platform tailored for AI engineers. With the Hunyuan-Large model already integrated, you can easily train and deploy it in just a few steps. Visit Chat with Hunyuan-Large to experience real-time conversations with the model, and explore Hunyuan-Large Best Practice on TI to create your own customized Hunyuan-Large model.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{sun2024hunyuanlargeopensourcemoemodel,
      title={Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent}, 
      author={Xingwu Sun and Yanfeng Chen and Yiqing Huang and Ruobing Xie and Jiaqi Zhu and Kai Zhang and Shuaipeng Li and Zhen Yang and Jonny Han and Xiaobo Shu and Jiahao Bu and Zhongzhi Chen and Xuemeng Huang and Fengzong Lian and Saiyong Yang and Jianfeng Yan and Yuyuan Zeng and Xiaoqin Ren and Chao Yu and Lulu Wu and Yue Mao and Tao Yang and Suncong Zheng and Kan Wu and Dian Jiao and Jinbao Xue and Xipeng Zhang and Decheng Wu and Kai Liu and Dengpeng Wu and Guanghui Xu and Shaohua Chen and Shuang Chen and Xiao Feng and Yigeng Hong and Junqiang Zheng and Chengcheng Xu and Zongwei Li and Xiong Kuang and Jianglu Hu and Yiqi Chen and Yuchi Deng and Guiyang Li and Ao Liu and Chenchen Zhang and Shihui Hu and Zilong Zhao and Zifan Wu and Yao Ding and Weichao Wang and Han Liu and Roberts Wang and Hao Fei and Peijie She and Ze Zhao and Xun Cao and Hai Wang and Fusheng Xiang and Mengyuan Huang and Zhiyuan Xiong and Bin Hu and Xuebin Hou and Lei Jiang and Jiajia Wu and Yaping Deng and Yi Shen and Qian Wang and Weijie Liu and Jie Liu and Meng Chen and Liang Dong and Weiwen Jia and Hu Chen and Feifei Liu and Rui Yuan and Huilin Xu and Zhenxiang Yan and Tengfei Cao and Zhichao Hu and Xinhua Feng and Dong Du and Tinghao She and Yangyu Tao and Feng Zhang and Jianchen Zhu and Chengzhong Xu and Xirui Li and Chong Zha and Wen Ouyang and Yinben Xia and Xiang Li and Zekun He and Rongpeng Chen and Jiawei Song and Ruibin Chen and Fan Jiang and Chongqing Zhao and Bo Wang and Hao Gong and Rong Gan and Winston Hu and Zhanhui Kang and Yong Yang and Yuhong Liu and Di Wang and Jie Jiang},
      year={2024},
      eprint={2411.02265},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.02265}, 
}
<br>

Contact Us

If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email (hunyuan_opensource@tencent.com).