Home

Awesome

<p align="center"> <a href=""><img src="https://img.shields.io/badge/python-3.7+-aff.svg"></a> <a href=""><img src="https://img.shields.io/badge/os-x86%2C%20aarch-pink.svg"></a> <a href="https://github.com/sophgo/LLM-TPU/graphs/contributors"><img src="https://img.shields.io/github/contributors/sophgo/LLM-TPU?color=9ea"></a> <a href="https://github.com/sophgo/LLM-TPU/issues"><img src="https://img.shields.io/github/issues/sophgo/LLM-TPU?color=9cc"></a> <a href="https://github.com/sophgo/LLM-TPU/commits"><img src="https://img.shields.io/github/commit-activity/y/sophgo/LLM-TPU?color=3af"></a> </p> <p align="center"> <a href="https://github.com/sophgo/LLM-TPU/forks"><img src="https://img.shields.io/github/forks/sophgo/LLM-TPU?color=9cc"></a> <a href="https://github.com/sophgo/LLM-TPU/stargazers"><img src="https://img.shields.io/github/stars/sophgo/LLM-TPU?color=9cc"></a> </p>

目录

介绍

本项目实现算能BM1684X芯片部署各类开源生成式AI模型,其中以LLM为主。通过TPU-MLIR编译器将模型转换成bmodel,并采用c++代码将其部署到PCIE环境或者SoC环境。在知乎上写了一篇解读,以ChatGLM2-6B为例,方便大家理解源码:ChatGLM2流程解析与TPU-MLIR部署

模型介绍

已部署过的模型如下(按照首字母顺序排列):

ModelINT4INT8FP16/BF16Huggingface Link
Baichuan2-7B:white_check_mark:LINK
ChatGLM3-6B:white_check_mark::white_check_mark::white_check_mark:LINK
CodeFuse-7B:white_check_mark::white_check_mark:LINK
DeepSeek-6.7B:white_check_mark::white_check_mark:LINK
Falcon-40B:white_check_mark::white_check_mark:LINK
Phi-3-mini-4k:white_check_mark::white_check_mark::white_check_mark:LINK
Qwen-7B:white_check_mark::white_check_mark::white_check_mark:LINK
Qwen-14B:white_check_mark::white_check_mark::white_check_mark:LINK
Qwen-72B:white_check_mark:LINK
Qwen1.5-0.5B:white_check_mark::white_check_mark::white_check_mark:LINK
Qwen1.5-1.8B:white_check_mark::white_check_mark::white_check_mark:LINK
Qwen1.5-7B:white_check_mark::white_check_mark::white_check_mark:LINK
Qwen2-7B:white_check_mark::white_check_mark::white_check_mark:LINK
Qwen2.5-7B:white_check_mark::white_check_mark::white_check_mark:LINK
Llama2-7B:white_check_mark::white_check_mark::white_check_mark:LINK
Llama2-13B:white_check_mark::white_check_mark::white_check_mark:LINK
Llama3-8B:white_check_mark::white_check_mark::white_check_mark:LINK
Llama3.1-8B:white_check_mark::white_check_mark::white_check_mark:LINK
LWM-Text-Chat:white_check_mark::white_check_mark::white_check_mark:LINK
Mistral-7B-Instruct:white_check_mark::white_check_mark:LINK
Stable Diffusion:white_check_mark:LINK
Stable Diffusion XL:white_check_mark:LINK
WizardCoder-15B:white_check_mark:LINK
Yi-6B-chat:white_check_mark::white_check_mark:LINK
Yi-34B-chat:white_check_mark::white_check_mark:LINK
Qwen-VL-Chat:white_check_mark::white_check_mark:LINK
InternVL2-4B:white_check_mark::white_check_mark:LINK
InternVL2-2B:white_check_mark::white_check_mark:LINK
MiniCPM-V-2_6:white_check_mark::white_check_mark:LINK

如果您想要知道转换细节和源码,可以到本项目models子目录查看各类模型部署细节。

如果您对我们的芯片感兴趣,也可以通过官网SOPHGO联系我们。

快速开始

克隆LLM-TPU项目,并执行run.sh脚本

git clone https://github.com/sophgo/LLM-TPU.git
./run.sh --model llama2-7b

详细请参考Quick Start

效果图

跑通后效果如下图所示

Command Table

目前用于演示的模型,全部命令如下表所示

ModelSoCPCIE
ChatGLM3-6B./run.sh --model chatglm3-6b --arch soc./run.sh --model chatglm3-6b --arch pcie
Llama2-7B./run.sh --model llama2-7b --arch soc./run.sh --model llama2-7b --arch pcie
Llama3-7B./run.sh --model llama3-7b --arch soc./run.sh --model llama3-7b --arch pcie
Qwen-7B./run.sh --model qwen-7b --arch soc./run.sh --model qwen-7b --arch pcie
Qwen1.5-1.8B./run.sh --model qwen1.5-1.8b --arch soc./run.sh --model qwen1.5-1.8b --arch pcie
Qwen2.5-7B\./run.sh --model qwen2.5-7b --arch pcie
LWM-Text-Chat./run.sh --model lwm-text-chat --arch soc./run.sh --model lwm-text-chat --arch pcie
WizardCoder-15B./run.sh --model wizardcoder-15b --arch soc./run.sh --model wizardcoder-15b --arch pcie
InternVL2-4B./run.sh --model internvl2-4b --arch soc./run.sh --model internvl2-4b --arch pcie
MiniCPM-V-2_6./run.sh --model minicpmv2_6 --arch soc./run.sh --model minicpmv2_6 --arch pcie

进阶功能

进阶功能说明:

功能目录功能说明
多芯ChatGLM3/parallel_demo支持ChatGLM3 2芯
Llama2/demo_parallel支持Llama2 4/6/8芯
Qwen/demo_parallel支持Qwen 4/6/8芯
Qwen1_5/demo_parallel支持Qwen1_5 4/6/8芯
投机采样Qwen/jacobi_demoLookaheadDecoding
Qwen1_5/speculative_sample_demo投机采样
prefill复用Qwen/prompt_cache_demo公共序列prefill复用
Qwen/share_cache_demo公共序列prefill复用
Qwen1_5/share_cache_demo公共序列prefill复用
模型加密Qwen/share_cache_demo模型加密
Qwen1_5/share_cache_demo模型加密

常见问题

请参考LLM-TPU常见问题及解答

资料链接