Machine Learning Compilation for Large Language Models (MLC LLM) is a high-performance universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration. The mission of this project is to enable everyone to develop, optimize and deploy AI models natively on everyone's devices with ML compilation techniques.

Universal deployment. MLC LLM supports the following platforms and hardware:

<table style="width:100%"> <thead> <tr> <th style="width:15%"> </th> <th style="width:20%">AMD GPU</th> <th style="width:20%">NVIDIA GPU</th> <th style="width:20%">Apple GPU</th> <th style="width:24%">Intel GPU</th> </tr> </thead> <tbody> <tr> <td>Linux / Win</td> <td>✅ Vulkan, ROCm</td> <td>✅ Vulkan, CUDA</td> <td>N/A</td> <td>✅ Vulkan</td> </tr> <tr> <td>macOS</td> <td>✅ Metal (dGPU)</td> <td>N/A</td> <td>✅ Metal</td> <td>✅ Metal (iGPU)</td> </tr> <tr> <td>Web Browser</td> <td colspan=4>✅ WebGPU and WASM </td> </tr> <tr> <td>iOS / iPadOS</td> <td colspan=4>✅ Metal on Apple A-series GPU</td> </tr> <tr> <td>Android</td> <td colspan=2>✅ OpenCL on Adreno GPU</td> <td colspan=2>✅ OpenCL on Mali GPU</td> </tr> </tbody> </table>

Scalable. MLC LLM scales universally on NVIDIA and AMD GPUs, cloud and gaming GPUs. Below showcases our single batch decoding performance with prefilling = 1 and decoding = 256.

Performance of 4-bit CodeLlama-34B and Llama2-70B on two NVIDIA RTX 4090 and two AMD Radeon 7900 XTX:

<p float="left"> <img src="site/img/multi-gpu/figure-1.svg" width="40%"/> <img src="site/img/multi-gpu/figure-3.svg" width="30%"/> </p>

Scaling of fp16 and 4-bit CodeLlama-34 and Llama2-70B on A100-80G-PCIe and A10G-24G-PCIe, up to 8 GPUs:

<p float="center"> <img src="site/img/multi-gpu/figure-2.svg" width="100%"/> </p>


Getting Started

Please visit our documentation for detailed instructions.

Model Support

MLC LLM supports a wide range of model architectures and variants. We have the following prebuilts which you can use off-the-shelf. Visit Prebuilt Models to see the full list, and Compile Models via MLC to see how to use models not on this list.

<table style="width:100%"> <thead> <tr> <th style="width:40%">Architecture</th> <th style="width:60%">Prebuilt Model Variants</th> </tr> </thead> <tbody> <tr> <td>Llama</td> <td>Llama-2, Code Llama, Vicuna, WizardLM, WizardMath, OpenOrca Platypus2, FlagAlpha Llama-2 Chinese, georgesung Llama-2 Uncensored</td> </tr> <tr> <td>GPT-NeoX</td> <td>RedPajama</td> </tr> <tr> <td>GPT-J</td> <td></td> </tr> <tr> <td>RWKV</td> <td>RWKV-raven</td> </tr> <tr> <td>MiniGPT</td> <td></td> </tr> <tr> <td>GPTBigCode</td> <td>WizardCoder</td> </tr> <tr> <td>ChatGLM</td> <td></td> </tr> <tr> <td>StableLM</td> <td></td> </tr> <tr> <td>Mistral</td> <td></td> </tr> <tr> <td>Phi</td> <td></td> </tr> </tbody> </table>

Universal Deployment APIs

MLC LLM provides multiple sets of APIs across platforms and environments. These include


The underlying techniques of MLC LLM include:

