Home

Awesome

bitnet.cpp

License: MIT version

bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU (with NPU and GPU support coming next).

The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. Please refer to the technical report for more details.

<img src="./assets/m2_performance.jpg" alt="m2_performance" width="800"/> <img src="./assets/intel_performance.jpg" alt="m2_performance" width="800"/>

The tested models are dummy setups used in a research context to demonstrate the inference performance of bitnet.cpp.

Demo

A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:

https://github.com/user-attachments/assets/7f46b736-edec-4828-b809-4be780a3e5b1

What's New:

Acknowledgements

This project is based on the llama.cpp framework. We would like to thank all the authors for their contributions to the open-source community. Also, bitnet.cpp's kernels are built on top of the Lookup Table methodologies pioneered in T-MAC. For inference of general low-bit LLMs beyond ternary models, we recommend using T-MAC.

Supported Models

❗️We use existing 1-bit LLMs available on Hugging Face to demonstrate the inference capabilities of bitnet.cpp. These models are neither trained nor released by Microsoft. We hope the release of bitnet.cpp will inspire the development of 1-bit LLMs in large-scale settings in terms of model size and training tokens.

<table> </tr> <tr> <th rowspan="2">Model</th> <th rowspan="2">Parameters</th> <th rowspan="2">CPU</th> <th colspan="3">Kernel</th> </tr> <tr> <th>I2_S</th> <th>TL1</th> <th>TL2</th> </tr> <tr> <td rowspan="2"><a href="https://huggingface.co/1bitLLM/bitnet_b1_58-large">bitnet_b1_58-large</a></td> <td rowspan="2">0.7B</td> <td>x86</td> <td>&#9989;</td> <td>&#10060;</td> <td>&#9989;</td> </tr> <tr> <td>ARM</td> <td>&#9989;</td> <td>&#9989;</td> <td>&#10060;</td> </tr> <tr> <td rowspan="2"><a href="https://huggingface.co/1bitLLM/bitnet_b1_58-3B">bitnet_b1_58-3B</a></td> <td rowspan="2">3.3B</td> <td>x86</td> <td>&#10060;</td> <td>&#10060;</td> <td>&#9989;</td> </tr> <tr> <td>ARM</td> <td>&#10060;</td> <td>&#9989;</td> <td>&#10060;</td> </tr> <tr> <td rowspan="2"><a href="https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens">Llama3-8B-1.58-100B-tokens</a></td> <td rowspan="2">8.0B</td> <td>x86</td> <td>&#9989;</td> <td>&#10060;</td> <td>&#9989;</td> </tr> <tr> <td>ARM</td> <td>&#9989;</td> <td>&#9989;</td> <td>&#10060;</td> </tr> <tr> <td rowspan="2"><a href="https://huggingface.co/collections/tiiuae/falcon3-67605ae03578be86e4e87026">Falcon3 Family</a></td> <td rowspan="2">1B-10B</td> <td>x86</td> <td>&#9989;</td> <td>&#10060;</td> <td>&#9989;</td> </tr> <tr> <td>ARM</td> <td>&#9989;</td> <td>&#9989;</td> <td>&#10060;</td> </tr> </table>

Installation

Requirements

Build from source

[!IMPORTANT] If you are using Windows, please remember to always use a Developer Command Prompt / PowerShell for VS2022 for the following commands

  1. Clone the repo
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
  1. Install the dependencies
# (Recommended) Create a new conda environment
conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp

pip install -r requirements.txt
  1. Build the project
# Download the model from Hugging Face, convert it to quantized gguf format, and build the project
python setup_env.py --hf-repo tiiuae/Falcon3-7B-Instruct-1.58bit -q i2_s

# Or you can manually download the model and run with local path
huggingface-cli download tiiuae/Falcon3-7B-Instruct-1.58bit --local-dir models/Falcon3-7B-Instruct-1.58bit
python setup_env.py -md models/Falcon3-7B-Instruct-1.58bit -q i2_s
<pre> usage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd] [--use-pretuned] Setup the environment for running inference optional arguments: -h, --help show this help message and exit --hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit} Model used for inference --model-dir MODEL_DIR, -md MODEL_DIR Directory to save/load the model --log-dir LOG_DIR, -ld LOG_DIR Directory to save the logging info --quant-type {i2_s,tl1}, -q {i2_s,tl1} Quantization type --quant-embd Quantize the embeddings to f16 --use-pretuned, -p Use the pretuned kernel parameters </pre>

Usage

Basic usage

# Run inference with the quantized model
python run_inference.py -m models/Falcon3-7B-Instruct-1.58bit/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv
<pre> usage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] [-cnv] Run inference optional arguments: -h, --help show this help message and exit -m MODEL, --model MODEL Path to model file -n N_PREDICT, --n-predict N_PREDICT Number of tokens to predict when generating text -p PROMPT, --prompt PROMPT Prompt to generate text from -t THREADS, --threads THREADS Number of threads to use -c CTX_SIZE, --ctx-size CTX_SIZE Size of the prompt context -temp TEMPERATURE, --temperature TEMPERATURE Temperature, a hyperparameter that controls the randomness of the generated text -cnv, --conversation Whether to enable chat mode or not (for instruct models.) (When this option is turned on, the prompt specified by -p will be used as the system prompt.) </pre>

Benchmark

We provide scripts to run the inference benchmark providing a model.

usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS]  
   
Setup the environment for running the inference  
   
required arguments:  
  -m MODEL, --model MODEL  
                        Path to the model file. 
   
optional arguments:  
  -h, --help  
                        Show this help message and exit. 
  -n N_TOKEN, --n-token N_TOKEN  
                        Number of generated tokens. 
  -p N_PROMPT, --n-prompt N_PROMPT  
                        Prompt to generate text from. 
  -t THREADS, --threads THREADS  
                        Number of threads to use. 

Here's a brief explanation of each argument:

For example:

python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4  

This command would run the inference benchmark using the model located at /path/to/model, generating 200 tokens from a 256 token prompt, utilizing 4 threads.

For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine:

python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M

# Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generate
python utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128