Home

Awesome

CTransformers PyPI tests build

Python bindings for the Transformer models implemented in C/C++ using GGML library.

Also see ChatDocs

Supported Models

ModelsModel TypeCUDAMetal
GPT-2gpt2
GPT-J, GPT4All-Jgptj
GPT-NeoX, StableLMgpt_neox
Falconfalcon✅
LLaMA, LLaMA 2llama✅✅
MPTmpt✅
StarCoder, StarChatgpt_bigcode✅
Dolly V2dolly-v2
Replitreplit

Installation

pip install ctransformers

Usage

It provides a unified interface for all models:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")

print(llm("AI is going to"))

Run in Google Colab

To stream the output, set stream=True:

for text in llm("AI is going to", stream=True):
    print(text, end="", flush=True)

You can load models from Hugging Face Hub directly:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")

If a model repo has multiple model files (.bin or .gguf files), specify a model file using:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")

<a id="transformers"></a>

🤗 Transformers

Note: This is an experimental feature and may change in the future.

To use it with 🤗 Transformers, create model and tokenizer using:

from ctransformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)

Run in Google Colab

You can use 🤗 Transformers text generation pipeline:

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256))

You can use 🤗 Transformers generation parameters:

pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)

You can use 🤗 Transformers tokenizers:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)  # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Load tokenizer from original model repo.

LangChain

It is integrated into LangChain. See LangChain docs.

GPU

To run some of the model layers on GPU, set the gpu_layers parameter:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)

Run in Google Colab

CUDA

Install CUDA libraries using:

pip install ctransformers[cuda]

ROCm

To enable ROCm support, install the ctransformers package using:

CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers

Metal

To enable Metal support, install the ctransformers package using:

CT_METAL=1 pip install ctransformers --no-binary ctransformers

GPTQ

Note: This is an experimental feature and only LLaMA models are supported using ExLlama.

Install additional dependencies using:

pip install ctransformers[gptq]

Load a GPTQ model using:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")

Run in Google Colab

If model name or path doesn't contain the word gptq then specify model_type="gptq".

It can also be used with LangChain. Low-level APIs are not fully supported.

Documentation

<!-- API_DOCS -->

Config

ParameterTypeDescriptionDefault
top_kintThe top-k value to use for sampling.40
top_pfloatThe top-p value to use for sampling.0.95
temperaturefloatThe temperature to use for sampling.0.8
repetition_penaltyfloatThe repetition penalty to use for sampling.1.1
last_n_tokensintThe number of last tokens to use for repetition penalty.64
seedintThe seed value to use for sampling tokens.-1
max_new_tokensintThe maximum number of new tokens to generate.256
stopList[str]A list of sequences to stop generation when encountered.None
streamboolWhether to stream the generated text.False
resetboolWhether to reset the model state before generating text.True
batch_sizeintThe batch size to use for evaluating tokens in a single prompt.8
threadsintThe number of threads to use for evaluating tokens.-1
context_lengthintThe maximum context length to use.-1
gpu_layersintThe number of layers to run on GPU.0

Note: Currently only LLaMA, MPT and Falcon models support the context_length parameter.

<kbd>class</kbd> AutoModelForCausalLM


<kbd>classmethod</kbd> AutoModelForCausalLM.from_pretrained

from_pretrained(
    model_path_or_repo_id: str,
    model_type: Optional[str] = None,
    model_file: Optional[str] = None,
    config: Optional[ctransformers.hub.AutoConfig] = None,
    lib: Optional[str] = None,
    local_files_only: bool = False,
    revision: Optional[str] = None,
    hf: bool = False,
    **kwargs
) → LLM

Loads the language model from a local file or remote repo.

Args:

Returns: LLM object.

<kbd>class</kbd> LLM

<kbd>method</kbd> LLM.__init__

__init__(
    model_path: str,
    model_type: Optional[str] = None,
    config: Optional[ctransformers.llm.Config] = None,
    lib: Optional[str] = None
)

Loads the language model from a local file.

Args:


<kbd>property</kbd> LLM.bos_token_id

The beginning-of-sequence token.


<kbd>property</kbd> LLM.config

The config object.


<kbd>property</kbd> LLM.context_length

The context length of model.


<kbd>property</kbd> LLM.embeddings

The input embeddings.


<kbd>property</kbd> LLM.eos_token_id

The end-of-sequence token.


<kbd>property</kbd> LLM.logits

The unnormalized log probabilities.


<kbd>property</kbd> LLM.model_path

The path to the model file.


<kbd>property</kbd> LLM.model_type

The model type.


<kbd>property</kbd> LLM.pad_token_id

The padding token.


<kbd>property</kbd> LLM.vocab_size

The number of tokens in vocabulary.


<kbd>method</kbd> LLM.detokenize

detokenize(tokens: Sequence[int], decode: bool = True) → Union[str, bytes]

Converts a list of tokens to text.

Args:

Returns: The combined text of all tokens.


<kbd>method</kbd> LLM.embed

embed(
    input: Union[str, Sequence[int]],
    batch_size: Optional[int] = None,
    threads: Optional[int] = None
) → List[float]

Computes embeddings for a text or list of tokens.

Note: Currently only LLaMA and Falcon models support embeddings.

Args:

Returns: The input embeddings.


<kbd>method</kbd> LLM.eval

eval(
    tokens: Sequence[int],
    batch_size: Optional[int] = None,
    threads: Optional[int] = None
) → None

Evaluates a list of tokens.

Args:


<kbd>method</kbd> LLM.generate

generate(
    tokens: Sequence[int],
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None,
    batch_size: Optional[int] = None,
    threads: Optional[int] = None,
    reset: Optional[bool] = None
) → Generator[int, NoneType, NoneType]

Generates new tokens from a list of tokens.

Args:

Returns: The generated tokens.


<kbd>method</kbd> LLM.is_eos_token

is_eos_token(token: int) → bool

Checks if a token is an end-of-sequence token.

Args:

Returns: True if the token is an end-of-sequence token else False.


<kbd>method</kbd> LLM.prepare_inputs_for_generation

prepare_inputs_for_generation(
    tokens: Sequence[int],
    reset: Optional[bool] = None
) → Sequence[int]

Removes input tokens that are evaluated in the past and updates the LLM context.

Args:

Returns: The list of tokens to evaluate.


<kbd>method</kbd> LLM.reset

reset() → None

Deprecated since 0.2.27.


<kbd>method</kbd> LLM.sample

sample(
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None
) → int

Samples a token from the model.

Args:

Returns: The sampled token.


<kbd>method</kbd> LLM.tokenize

tokenize(text: str, add_bos_token: Optional[bool] = None) → List[int]

Converts a text into list of tokens.

Args:

Returns: The list of tokens.


<kbd>method</kbd> LLM.__call__

__call__(
    prompt: str,
    max_new_tokens: Optional[int] = None,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None,
    batch_size: Optional[int] = None,
    threads: Optional[int] = None,
    stop: Optional[Sequence[str]] = None,
    stream: Optional[bool] = None,
    reset: Optional[bool] = None
) → Union[str, Generator[str, NoneType, NoneType]]

Generates text from a prompt.

Args:

Returns: The generated text.

<!-- API_DOCS -->

License

MIT