Home

Awesome

llama.cpp

llama

License: MIT Server

Roadmap / Project status / Manifesto / ggml

Inference of Meta's LLaMA model (and others) in pure C/C++

Recent API changes

Hot topics


Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

The llama.cpp project is the main playground for developing new features for the ggml library.

<details> <summary>Models</summary>

Typically finetunes of the base models below are supported as well.

Instructions for adding support for new models: HOWTO-add-model.md

Text-only

Multimodal

</details> <details> <summary>Bindings</summary> </details> <details> <summary>UIs</summary>

(to have a project listed here, it should clearly state that it depends on llama.cpp)

</details> <details> <summary>Tools</summary> </details> <details> <summary>Infrastructure</summary> </details> <details> <summary>Games</summary> </details>

Supported backends

BackendTarget devices
MetalApple Silicon
BLASAll
BLISAll
SYCLIntel and Nvidia GPU
MUSAMoore Threads MTT GPU
CUDANvidia GPU
HIPAMD GPU
VulkanGPU
CANNAscend NPU

Building the project

The main product of this project is the llama library. Its C-style interface can be found in include/llama.h. The project also includes many example programs and tools using the llama library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server. Possible methods for obtaining the binaries:

Obtaining and quantizing models

The Hugging Face platform hosts a number of LLMs compatible with llama.cpp:

After downloading a model, use the CLI tools to run it locally - see below.

llama.cpp requires the model to be stored in the GGUF file format. Models in other data formats can be converted to GGUF using the convert_*.py Python scripts in this repo.

The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with llama.cpp:

To learn more about model quantization, read this documentation

llama-cli

A CLI tool for accessing and experimenting with most of llama.cpp's functionality.

llama-server

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

llama-perplexity

A tool for measuring the perplexity 12 (and other quality metrics) of a model over a given text.

llama-bench

Benchmark the performance of the inference for various parameters.

llama-run

A comprehensive example for running llama.cpp models. Useful for inferencing. Used with RamaLama 3.

llama-simple

A minimal example for implementing apps with llama.cpp. Useful for developers.

Contributing

Other documentation

Development documentation

Seminal papers and background on the models

If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:

References

Footnotes

  1. examples/perplexity/README.md ↩

  2. https://huggingface.co/docs/transformers/perplexity ↩

  3. RamaLama ↩