Home

Awesome

cortex.llamacpp

cortex.llamacpp is a high-efficiency C++ inference engine for edge computing.

It is a dynamic library that can be loaded by any server at runtime.

Repo Structure

.
├── base -> Engine interface
├── examples -> Server example to integrate engine
├── llama.cpp -> Upstream llama C++
├── src -> Engine implementation
├── third-party -> Dependencies of the cortex.llamacpp project

Build from source

This guide provides step-by-step instructions for building cortex.llamacpp from source on Linux, macOS, and Windows systems.

Clone the Repository

First, you need to clone the cortex.llamacpp repository:

git clone --recurse https://github.com/janhq/cortex.llamacpp.git

If you don't have git, you can download the source code as a file archive from cortex.llamacpp GitHub.

Build library with server example

Quickstart

Step 1: Downloading a Model

mkdir model && cd model
wget -O llama-2-7b-model.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf?download=true

Step 2: Start server

Step 3: Load model

curl http://localhost:3928/loadmodel \
  -H 'Content-Type: application/json' \
  -d '{
    "llama_model_path": "/model/llama-2-7b-model.gguf",
    "model_alias": "llama-2-7b-model",
    "ctx_len": 512,
    "ngl": 100,
    "model_type": "llm"
  }'

Step 4: Making an Inference

curl http://localhost:3928/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Who won the world series in 2020?"
      },
    ],
    "model": "llama-2-7b-model"
  }'

Table of parameters

ParameterTypeDescription
llama_model_pathStringThe file path to the LLaMA model.
nglIntegerThe number of GPU layers to use.
ctx_lenIntegerThe context length for the model operations.
embeddingBooleanWhether to use embedding in the model.
n_parallelIntegerThe number of parallel operations.
cont_batchingBooleanWhether to use continuous batching.
user_promptStringThe prompt to use for the user.
ai_promptStringThe prompt to use for the AI assistant.
system_promptStringThe prompt to use for system rules.
pre_promptStringThe prompt to use for internal configuration.
cpu_threadsIntegerThe number of threads to use for inferencing (CPU MODE ONLY)
n_batchIntegerThe batch size for prompt eval step
caching_enabledBooleanTo enable prompt caching or not
grp_attn_nIntegerGroup attention factor in self-extend
grp_attn_wIntegerGroup attention width in self-extend
mlockBooleanPrevent system swapping of the model to disk in macOS
grammar_fileStringYou can constrain the sampling using GBNF grammars by providing path to a grammar file
model_typeStringModel type we want to use: llm or embedding, default value is llm
model_aliasStringUsed as model_id if specified in request, mandatory in loadmodel
modelStringUsed as model_id if specified in request, mandatory in chat/embedding request
flash_attnBooleanTo enable Flash Attention, default is true
cache_typeStringKV cache type: f16, q8_0, q4_0, default is f16
use_mmapBooleanTo enable mmap, default is true