Home

Awesome

tinychat_logo

TinyChatEngine: On-Device LLM/VLM Inference Library

Running large language models (LLMs) and visual language models (VLMs) on the edge is useful: copilot services (coding, office, smart reply) on laptops, cars, robots, and more. Users can get instant responses with better privacy, as the data is local.

This is enabled by LLM model compression technique: SmoothQuant and AWQ (Activation-aware Weight Quantization), co-designed with TinyChatEngine that implements the compressed low-precision model.

Feel free to check out our slides for more details!

Code LLaMA Demo on NVIDIA GeForce RTX 4070 laptop:

coding_demo_gpu

VILA Demo on Apple MacBook M1 Pro:

vlm_demo_m1

LLaMA Chat Demo on Apple MacBook M1 Pro:

chat_demo_m1

Overview

LLM Compression: SmoothQuant and AWQ

SmoothQuant: Smooth the activation outliers by migrating the quantization difficulty from activations to weights, with a mathematically equal transformation (100*1 = 10*10).

smoothquant_intuition

AWQ (Activation-aware Weight Quantization): Protect salient weight channels by analyzing activation magnitude as opposed to the weights.

LLM Inference Engine: TinyChatEngine

overview

News

<!-- - **(2024/01)** 🔥We released TinyVoiceChat, a voice chatbot that can be deployed on your edge devices, such as MacBook and Jetson Orin Nano. Check out our [demo video](https://youtu.be/Bw5Dm3aWMnA?si=CCvZDmq3HwowEQcC) and follow the [instructions](#deploy-speech-to-speech-chatbot-with-tinychatengine-demo) to deploy it on your device! -->

Prerequisites

MacOS

For MacOS, install boost and llvm by

brew install boost
brew install llvm

For M1/M2 users, install Xcode from AppStore to enable the metal compiler for GPU support.

Windows with CPU

For Windows, download and install the GCC compiler with MSYS2. Follow this tutorial: https://code.visualstudio.com/docs/cpp/config-mingw for installation.

pacman -S --needed base-devel mingw-w64-x86_64-toolchain make unzip git

Windows with Nvidia GPU (Experimental)

Step-by-step to Deploy Llama-3-8B-Instruct with TinyChatEngine

Here, we provide step-by-step instructions to deploy Llama-3-8B-Instruct with TinyChatEngine from scratch.

<!-- ## Deploy speech-to-speech chatbot with TinyChatEngine [[Demo]](https://youtu.be/Bw5Dm3aWMnA?si=CCvZDmq3HwowEQcC) TinyChatEngine offers versatile capabilities suitable for various applications. Additionally, we introduce a sophisticated voice chatbot. Here, we provide very easy-to-follow instructions to deploy speech-to-speech chatbot (Llama-3-8B-Instruct) with TinyChatEngine. - Follow the instructions above to setup the basic environment, i.e., [Prerequisites](#prerequisites) and [Step-by-step to Deploy Llama-3-8B-Instruct with TinyChatEngine](#step-by-step-to-deploy-llama-3-8b-instruct-with-tinychatengine). - Run the shell script to set up the environment for speech-to-speech chatbot. ```bash cd llm ./voicechat_setup.sh ``` - Start the speech-to-speech chat locally. ```bash ./voicechat # chat.exe -v on Windows ``` - If you encounter any issues or errors during setup, please explore [here](llm/application/README.md) to follow the step-by-step guide to debug. -->

Deploy vision language model (VLM) chatbot with TinyChatEngine

<!-- TinyChatEngine supports not only LLM but also VLM. We introduce a sophisticated text/voice chatbot for VLM. Here, we provide easy-to-follow instructions to deploy vision language model chatbot (VILA-7B) with TinyChatEngine. We recommend using M1/M2 MacBooks for this VLM feature. -->

TinyChatEngine supports not only LLM but also VLM. We introduce a sophisticated chatbot for VLM. Here, we provide easy-to-follow instructions to deploy vision language model chatbot (VILA-7B) with TinyChatEngine. We recommend using M1/M2 MacBooks for this VLM feature.

<!-- - (Optional) To enable the speech-to-speech chatbot for VLM, please follow the [instruction above](#deploy-speech-to-speech-chatbot-with-tinychatengine-demo) to run the shell script to set up the environment. ```bash cd llm ./voicechat_setup.sh ``` -->

Backend Support

Precisionx86<br /> (Intel/AMD CPU)ARM<br /> (Apple M1/M2 & RPi)Nvidia GPU
FP32
W4A16
W4A32
W4A8
W8A8

Quantization and Model Support

The goal of TinyChatEngine is to support various quantization methods on various devices. For example, At present, it supports the quantized weights for int8 opt models that originate from smoothquant using the provided conversion script opt_smooth_exporter.py. For LLaMA models, scripts are available for converting Huggingface format checkpoints to our int4 wegiht format, and for quantizing them to specific methods based on your device. Before converting and quantizing your models, it is recommended to apply the fake quantization from AWQ to achieve better accuracy. We are currently working on supporting more models, please stay tuned!

Device-specific int4 Weight Reordering

To mitigate the runtime overheads associated with weight reordering, TinyChatEngine conducts this process offline during model conversion. In this section, we will explore the weight layouts of QM_ARM and QM_x86. These layouts are tailored for ARM and x86 CPUs, supporting 128-bit SIMD and 256-bit SIMD operations, respectively. We also support QM_CUDA for Nvidia GPUs, including server and edge GPUs.

PlatformsISAQuantization methods
Intel & AMDx86-64QM_x86
Apple M1/M2 Mac & Raspberry PiARMQM_ARM
Nvidia GPUCUDAQM_CUDA

TinyChatEngine Model Zoo

We offer a selection of models that have been tested with TinyChatEngine. These models can be readily downloaded and deployed on your device. To download a model, locate the target model's ID in the table below and use the associated script. Check out our model zoo here.

<table> <thead> <tr> <th>Models</th> <th>Precisions</th> <th>ID</th> <th>x86 backend</th> <th>ARM backend</th> <th>CUDA backend</th> </tr> </thead> <tbody> <tr> <td rowspan="2">LLaMA_3_8B_Instruct</td> <td>fp32</td> <td>LLaMA_3_8B_Instruct_fp32 </td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td> int4</td> <td> LLaMA_3_8B_Instruct_awq_int4</td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <td rowspan="2">LLaMA2_13B_chat</td> <td> fp32</td> <td> LLaMA2_13B_chat_fp32 </td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td>int4</td> <td>LLaMA2_13B_chat_awq_int4</td> <td> ✅ </td> <td> ✅ </td> <td> ✅ </td> </tr> <tr> <td rowspan="2">LLaMA2_7B_chat</td> <td>fp32</td> <td>LLaMA2_7B_chat_fp32 </td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td> int4</td> <td> LLaMA2_7B_chat_awq_int4</td> <td> ✅ </td> <td> ✅ </td> <td> ✅ </td> </tr> <tr> <td rowspan="2">LLaMA_7B</td> <td> fp32</td> <td> LLaMA_7B_fp32 </td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td>int4</td> <td>LLaMA_7B_awq_int4</td> <td> ✅ </td> <td> ✅ </td> <td> ✅ </td> </tr> <tr> <td rowspan="2">CodeLLaMA_13B_Instruct</td> <td> fp32</td> <td> CodeLLaMA_13B_Instruct_fp32 </td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td>int4</td> <td>CodeLLaMA_13B_Instruct_awq_int4</td> <td> ✅ </td> <td> ✅ </td> <td> ✅ </td> </tr> <tr> <td rowspan="2">CodeLLaMA_7B_Instruct</td> <td> fp32</td> <td> CodeLLaMA_7B_Instruct_fp32 </td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td>int4</td> <td>CodeLLaMA_7B_Instruct_awq_int4</td> <td> ✅ </td> <td> ✅ </td> <td> ✅ </td> </tr> <tr> <td rowspan="2">Mistral-7B-Instruct-v0.2</td> <td> fp32</td> <td> Mistral_7B_v0.2_Instruct_fp32 </td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td>int4</td> <td>Mistral_7B_v0.2_Instruct_awq_int4</td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <td rowspan="2">VILA-7B</td> <td> fp32</td> <td> VILA_7B_CLIP_ViT-L_fp32 </td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td> int4</td> <td> VILA_7B_awq_int4_CLIP_ViT-L </td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <td rowspan="2">LLaVA-v1.5-13B</td> <td> fp32</td> <td> LLaVA_13B_CLIP_ViT-L_fp32 </td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td> int4</td> <td> LLaVA_13B_awq_int4_CLIP_ViT-L </td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <td rowspan="2">LLaVA-v1.5-7B</td> <td> fp32</td> <td> LLaVA_7B_CLIP_ViT-L_fp32 </td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td> int4</td> <td> LLaVA_7B_awq_int4_CLIP_ViT-L </td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <td rowspan="2">StarCoder</td> <td> fp32</td> <td> StarCoder_15.5B_fp32 </td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td>int4</td> <td>StarCoder_15.5B_awq_int4</td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <td rowspan="3">opt-6.7B</td> <td>fp32</td> <td>opt_6.7B_fp32</td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td>int8</td> <td>opt_6.7B_smooth_int8</td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td> int4</td> <td> opt_6.7B_awq_int4</td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <td rowspan="3">opt-1.3B</td> <td>fp32</td> <td>opt_1.3B_fp32</td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td>int8</td> <td>opt_1.3B_smooth_int8</td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td> int4</td> <td> opt_1.3B_awq_int4</td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <td rowspan="3">opt-125m</td> <td>fp32</td> <td>opt_125m_fp32</td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td>int8</td> <td>opt_125m_smooth_int8</td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> <tr> <!-- No data for the first column here because it's merged with data1 --> <td> int4</td> <td> opt_125m_awq_int4</td> <td> ✅ </td> <td> ✅ </td> <td> </td> </tr> </tbody> </table>

For instance, to download the quantized LLaMA-2-7B-chat model: (for int4 models, use --QM to choose the quantized model for your device)

To deploy a quantized model with TinyChatEngine, compile and run the chat program.

make chat -j
# ./chat <model_name> <precision> <num_threads>
./chat LLaMA2_7B_chat INT4 8
make chat -j
# ./chat <model_name> <precision>
./chat LLaMA2_7B_chat INT4

Related Projects

TinyEngine: Memory-efficient and High-performance Neural Network Library for Microcontrollers

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Acknowledgement

llama.cpp

whisper.cpp

transformers