Home

Awesome

<h1 align="center"> mistral.rs </h1> <h3 align="center"> Blazingly fast LLM inference. </h3> <p align="center"> | <a href="https://ericlbuehler.github.io/mistral.rs/mistralrs/"><b>Rust Documentation</b></a> | <a href="https://github.com/EricLBuehler/mistral.rs/blob/master/mistralrs-pyo3/API.md"><b>Python Documentation</b></a> | <a href="https://discord.gg/SZrecqK8qw"><b>Discord</b></a> | <a href="https://matrix.to/#/#mistral.rs:matrix.org"><b>Matrix</b></a> | </p>

Mistral.rs is a fast LLM inference platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.

Please submit requests for new models here.

Get started fast 🚀

  1. Install

  2. Get models

  3. Deploy with our easy to use APIs

Quick examples

After following installation instructions

Mistal.rs supports several model categories:

Description

Fast:

Quantization:

Easy:

Powerful:

This is a demo of interactive mode with streaming running Phi 3 128k mini with quantization via ISQ to Q4K.

<!-- Mistral GGUF demo, old API --> <!-- https://github.com/EricLBuehler/mistral.rs/assets/65165915/3396abcd-8d44-4bf7-95e6-aa532db09415 -->

https://github.com/EricLBuehler/mistral.rs/assets/65165915/09d9a30f-1e22-4b9a-9006-4ec6ebc6473c

Support matrix

Note: See supported models for more information

ModelSupports quantizationSupports adaptersSupports device mappingSupported by AnyMoE
Mistral v0.1/v0.2/v0.3✅✅✅✅
Gemma✅✅✅✅
Llama 2/3✅✅✅✅
Mixtral✅✅✅
Phi 2✅✅✅✅
Phi 3✅✅✅✅
Phi 3.5 MoE✅✅
Qwen 2✅✅✅
Phi 3 Vision✅✅✅
Idefics 2✅✅✅
Gemma 2✅✅✅✅
Starcoder 2✅✅✅✅
LLaVa Next✅✅✅
LLaVa✅✅✅

APIs and Integrations

Rust Crate

Rust multithreaded/async API for easy integration into any application.

Python API

Python API for mistral.rs.

HTTP Server

OpenAI API compatible API server

Llama Index integration (Python)


Supported accelerators

Enabling features is done by passing --features ... to the build system. When using cargo run or maturin develop, pass the --features flag before the -- separating build flags from runtime flags.

Benchmarks

DeviceMistral.rs Completion T/sLlama.cpp Completion T/sModelQuant
A10 GPU, CUDA8683mistral-7b4_K_M
Intel Xeon 8358 CPU, AVX1123mistral-7b4_K_M
Raspberry Pi 5 (8GB), Neon23mistral-7b2_K
A100 GPU, CUDA131134mistral-7b4_K_M
RTX 6000 GPU, CUDA10396mistral-7b4_K_M

Note: All CUDA tests for mistral.rs conducted with PagedAttention enabled, block size = 32

Please submit more benchmarks via raising an issue!

Installation and Build

Note: You can use our Docker containers here. Learn more about running Docker containers: https://docs.docker.com/engine/reference/run/

Note: You can use pre-built mistralrs-server binaries here

  1. Install required packages

    • OpenSSL (Example on Ubuntu: sudo apt install libssl-dev)
    • <b>Linux only:</b> pkg-config (Example on Ubuntu: sudo apt install pkg-config)
  2. Install Rust: https://rustup.rs/

    Example on Ubuntu:

    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    source $HOME/.cargo/env
    
  3. <b>Optional:</b> Set HF token correctly (skip if already set or your model is not gated, or if you want to use the token_source parameters in Python or the command line.)

    • Note: you can install huggingface-cli as documented here.
    huggingface-cli login
    
  4. Download the code

    git clone https://github.com/EricLBuehler/mistral.rs.git
    cd mistral.rs
    
  5. Build or install

    • Base build command

      cargo build --release
      
    • Build with CUDA support

      cargo build --release --features cuda
      
    • Build with CUDA and Flash Attention V2 support

      cargo build --release --features "cuda flash-attn"
      
    • Build with Metal support

      cargo build --release --features metal
      
    • Build with Accelerate support

      cargo build --release --features accelerate
      
    • Build with MKL support

      cargo build --release --features mkl
      
    • Install with cargo install for easy command line usage

      Pass the same values to --features as you would for cargo build

      cargo install --path mistralrs-server --features cuda
      
  6. The build process will output a binary misralrs-server at ./target/release/mistralrs-server which may be copied into the working directory with the following command:

    Example on Ubuntu:

    cp ./target/release/mistralrs-server ./mistralrs_server
    
  7. Use our APIs and integrations

    APIs and integrations list

Getting models

There are 2 ways to run a model with mistral.rs:

Getting models from Hugging Face Hub

Mistral.rs can automatically download models from HF Hub. To access gated models, you should provide a token source. They may be one of:

This is passed in the following ways:

./mistralrs_server --token-source none -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3

Here is an example of setting the token source.

If token cannot be loaded, no token will be used (i.e. effectively using none).

Loading models from local files:

You can also instruct mistral.rs to load models fully locally by modifying the *_model_id arguments or options:

./mistralrs_server --port 1234 plain -m . -a mistral

Throughout mistral.rs, any model ID argument or option may be a local path and should contain the following files for each model ID option:

Running GGUF models locally

To run GGUF models fully locally, the only mandatory arguments are the quantized model ID and the quantized filename.

Chat template

The chat template can be automatically detected and loaded from the GGUF file if no other chat template source is specified including the tokenizer model ID.

you do not need to specify the tokenizer model ID argument and instead should pass a path to the chat template JSON file (examples here, you will need to create your own by specifying the chat template and bos/eos tokens) as well as specifying a local model ID. For example:

./mistralrs-server --chat-template <chat_template> gguf -m . -f Phi-3-mini-128k-instruct-q4_K_M.gguf

If you do not specify a chat template, then the --tok-model-id/-t tokenizer model ID argument is expected where the tokenizer_config.json file should be provided. If that model ID contains a tokenizer.json, then that will be used over the GGUF tokenizer.

Tokenizer

The following tokenizer model types are currently supported. If you would like one to be added, please raise an issue. Otherwise, please consider using the method demonstrated in examples below, where the tokenizer is sourced from Hugging Face.

Supported GGUF tokenizer types

Run with the CLI

Mistral.rs uses subcommands to control the model type. They are generally of format <XLORA/LORA>-<QUANTIZATION>. Please run ./mistralrs_server --help to see the subcommands.

Additionally, for models without quantization, the model architecture should be provided as the --arch or -a argument in contrast to GGUF models which encode the architecture in the file.

Architecture for plain models

Note: for plain models, you can specify the data type to load and run in. This must be one of f32, f16, bf16 or auto to choose based on the device. This is specified in the --dype/-d parameter after the model architecture (plain).

If you do not specify the architecture, an attempt will be made to use the model's config. If this fails, please raise an issue.

Architecture for vision models

Note: for vision models, you can specify the data type to load and run in. This must be one of f32, f16, bf16 or auto to choose based on the device. This is specified in the --dype/-d parameter after the model architecture (vision-plain).

Supported GGUF architectures

Plain:

With adapters:

Interactive mode:

You can launch interactive mode, a simple chat application running in the terminal, by passing -i:

./mistralrs_server -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3

Interactive mode for vision models:

You can launch interactive mode for vision models, a simple chat application running in the terminal, by passing -i:

./mistralrs_server --vi plain -m microsoft/Phi-3.5-vision-instruct -a phi3v

More quick examples:

To start an X-LoRA server with the exactly as presented in the paper:

./mistralrs_server --port 1234 x-lora-plain -o orderings/xlora-paper-ordering.json -x lamm-mit/x-lora

To start an LoRA server with adapters from the X-LoRA paper (you should modify the ordering file to use only one adapter, as the adapter static scalings are all 1 and so the signal will become distorted):

./mistralrs_server --port 1234 lora-gguf -o orderings/xlora-paper-ordering.json -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q8_0.gguf -a lamm-mit/x-lora

Normally with a LoRA model you would use a custom ordering file. However, for this example we use the ordering from the X-LoRA paper because we are using the adapters from the X-LoRA paper.

To start a server running Mistral from GGUF:

./mistralrs_server --port 1234 gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

To start a server running Llama from GGML:

./mistralrs_server --port 1234 ggml -t meta-llama/Llama-2-13b-chat-hf -m TheBloke/Llama-2-13B-chat-GGML -f llama-2-13b-chat.ggmlv3.q4_K_M.bin

To start a server running Mistral from safetensors.

./mistralrs_server --port 1234 plain -m mistralai/Mistral-7B-Instruct-v0.1 -a mistral

Structured selection with a .toml file

We provide a method to select models with a .toml file. The keys are the same as the command line, with no_kv_cache and tokenizer_json being "global" keys.

Example:

./mistralrs_server --port 1234 toml -f toml-selectors/gguf.toml

Supported models

Quantization support

ModelGGUFGGMLISQ
Mistral✅✅
Gemma✅
Llama✅✅✅
Mixtral✅✅
Phi 2✅✅
Phi 3✅✅
Phi 3.5 MoE✅
Qwen 2✅
Phi 3 Vision✅
Idefics 2✅
Gemma 2✅
Starcoder 2✅✅
LLaVa Next✅
LLaVa✅

Device mapping support

Model categorySupported
Plain✅
GGUF✅
GGML
Vision Plain✅

X-LoRA and LoRA support

ModelX-LoRAX-LoRA+GGUFX-LoRA+GGML
Mistral✅✅
Gemma✅
Llama✅✅✅
Mixtral✅✅
Phi 2✅
Phi 3✅✅
Phi 3.5 MoE
Qwen 2
Phi 3 Vision
Idefics 2
Gemma 2✅
Starcoder 2✅
LLaVa Next
LLaVa

AnyMoE support

ModelAnyMoE
Mistral 7B✅
Gemma✅
Llama✅
Mixtral
Phi 2✅
Phi 3✅
Phi 3.5 MoE
Qwen 2✅
Phi 3 Vision
Idefics 2
Gemma 2✅
Starcoder 2✅
LLaVa Next✅
LLaVa✅

Using derivative model

To use a derivative model, select the model architecture using the correct subcommand. To see what can be passed for the architecture, pass --help after the subcommand. For example, when using a different model than the default, specify the following for the following types of models:

See this section to determine if it is necessary to prepare an X-LoRA/LoRA ordering file, it is always necessary if the target modules or architecture changed, or if the adapter order changed.

It is also important to check the chat template style of the model. If the HF hub repo has a tokenizer_config.json file, it is not necessary to specify. Otherwise, templates can be found in chat_templates and should be passed before the subcommand. If the model is not instruction tuned, no chat template will be found and the APIs will only accept a prompt, no messages.

For example, when using a Zephyr model:

./mistralrs_server --port 1234 --log output.txt gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf

Adapter model support: X-LoRA and LoRA

An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the x-lora-* architecture, and LoRA support by selecting the lora-* architecture. Please find docs for adapter models here

Chat Templates and Tokenizer

Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation here.

Contributing

Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull request. If you want to add a new model, please contact us via an issue and we can coordinate how to do this.

FAQ

Credits

This project would not be possible without the excellent work at candle. Additionally, thank you to all contributors! Contributing can range from raising an issue or suggesting a feature to adding some new functionality.