Home

Awesome

LLM-Engines

Author: Dongfu Jiang, Twitter, PyPI Package

A unified inference engine for large language models (LLMs) including open-source models (VLLM, SGLang, Together) and commercial models (OpenAI, Mistral, Claude).

The correctness of the inference has been verified by comparing the outputs of the models with different engines when temperature=0.0 and max_tokens=None. For example, the outputs of a single model using 3 enginer (VLLM, SGLang, Together) will be the same when temperature=0.0 and max_tokens=None. Try examples below to see the outputs of different engines.

Installation

pip install llm-engines # or
# pip install git+https://github.com/jdf-prog/LLM-Engines.git

For development:

pip install -e . # for development
# Add ons
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ # required for sglang
pip install flash-attn --no-build-isolation

Usage

Engines

from llm_engines import LLMEngine
model_name="meta-llama/Meta-Llama-3-8B-Instruct"
llm = LLMEngine()
llm.load_model(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct", 
    num_workers=1, # number of workers
    num_gpu_per_worker=1, # tensor parallelism size for each worker
    engine="vllm", # or "sglang"
    use_cache=False
)
response = llm.call_model(model_name, "What is the capital of France?", temperature=0.0, max_tokens=None)
print(response)
# export TOGETHER_API_KEY="your_together_api_key"
from llm_engines import LLMEngine
model_name="meta-llama/Llama-3-8b-chat-hf"
llm = LLMEngine()
llm.load_model(
    model_name="meta-llama/Llama-3-8b-chat-hf", 
    engine="together", # or "openai", "mistral", "claude"
    use_cache=False
)
response = llm.call_model(model_name, "What is the capital of France?", temperature=0.0, max_tokens=None)
print(response)
# export OPENAI_API_KEY="your_openai_api_key"
from llm_engines import LLMEngine
model_name="gpt-3.5-turbo"
llm = LLMEngine()
llm.load_model(
    model_name="gpt-3.5-turbo", 
    engine="openai", # or "vllm", "together", "mistral", "claude"
    use_cache=False
)
response = llm.call_model(model_name, "What is the capital of France?", temperature=0.0, max_tokens=None)
print(response)
# export MISTRAL_API_KEY="your_mistral_api_key"
from llm_engines import LLMEngine
model_name="mistral-large-latest"
llm = LLMEngine()
llm.load_model(
    model_name="mistral-large-latest", 
    engine="mistral", # or "vllm", "together", "openai", "claude"
    use_cache=False
)
response = llm.call_model(model_name, "What is the capital of France?", temperature=0.0, max_tokens=None)
print(response)
# export ANTHROPIC_API_KEY="your_claude_api_key"
from llm_engines import LLMEngine
model_name="claude-3-opus-20240229"
llm = LLMEngine()
llm.load_model(
    model_name="claude-3-opus-20240229", 
    engine="claude", # or "vllm", "together", "openai", "mistral"
    use_cache=False
)
response = llm.call_model(model_name, "What is the capital of France?", temperature=0.0, max_tokens=None)
print(response)
# export GOOGLE_API_KEY="your_gemini_api_key"
from llm_engines import LLMEngine
model_name="gemini-1.5-flash"
llm = LLMEngine()
llm.load_model(
    model_name="gemini-1.5-flash", 
    engine="gemini", # or "vllm", "together", "openai", "mistral", "claude"
    use_cache=False
)
response = llm.call_model(model_name, "What is the capital of France?", temperature=0.0, max_tokens=None)
print(response)

unload model

Remember to unload the model after using it to free up the resources. By default, all the workers will be unloaded after the program exits. If you want to use different models in the same program, you can unload the model before loading a new model, if that model needs gpu resources.

llm.unload_model(model_name) # unload all the workers named model_name
llm.unload_model() # unload all the workers

Multi-turn conversation

from llm_engines import LLMEngine
model_name="meta-llama/Meta-Llama-3-8B-Instruct"
llm = LLMEngine()
llm.load_model(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct", 
    num_workers=1, # number of workers
    num_gpu_per_worker=1, # tensor parallelism size for each worker
    engine="vllm", # or "sglang"
    use_cache=False
)
messages = [
    "Hello", # user message 
    "Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?", # previous model response
    "What is the capital of France?" # user message
]
# or you can use opneai's multi-turn conversation format. 
messages = [
    {"role": "user", "content": "Hello"}, # user message 
    {"role": "assistant", "content": "Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?"}, # previous model response
    {"role": "user", "content": "What is the capital of France?"} # user message
]
response = llm.call_model(model_name, messages, temperature=0.0, max_tokens=None)
print(response)

the messages should be in the format of

Batch inference

from llm_engines import LLMEngine
model_name="meta-llama/Meta-Llama-3-8B-Instruct"
llm = LLMEngine()
llm.load_model(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct", 
    num_workers=1, # number of workers
    num_gpu_per_worker=1, # tensor parallelism size for each worker
    engine="vllm", # or "sglang"
    use_cache=False
)
batch_messages = [
    "Hello", # user message 
    "Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?", # previous model response
    "What is the capital of France?" # user message
] * 100
response = llm.batch_call_model(model_name, messages, num_proc=32, temperature=0.0, max_tokens=None)
print(response)
# List of responses [response1, response2, ...]

Example inference file: ./examples/batch_inference_wildchat.py

python examples/batch_inference_wildchat.py

OpenAI Batch API by using the above code, it will automatically use the batch API for openai models. if you don't want to use the batch API and still want to use the normal API, set disable_batch_api=True when loading the model. num_proc will be ignored when using the batch API.

By using openai's batch API, you can get half the price of the normal API. The batch API is only available for the models with max_batch_size > 1.

LLM-Engines will calculates the hash of the inputs and generation parameters, and will only send new batch requests if the inputs and generation parameters are different from the previous requests. You can check a list of requested batch information in the ~/llm_engines/generation_cache/openai_batch_cache/batch_submission_status.json file.

Parallel infernece throught huggingface dataset map

Check out ./examples/mp_inference_wildchat.py for parallel inference with multiple models.

python examples/mp_inference_wildchat.py

Cache

if use_cache=True, all the queries and responses are cached in the generation_cache folder, no duplicate queries will be sent to the model. The cache of each model is saved to generation_cache/{model_name}.jsonl

Example items in the cache:

{"cb0b4aaf80c43c9973aefeda1bd72890": {"input": ["What is the capital of France?"], "output": "The capital of France is Paris."}}

The hash key here is the hash of the concatenated inputs.

Chat template

For each open-source models, we use the default chat template as follows:

prompt = self.tokenizer.apply_chat_template(
    messages, 
    add_generation_prompt=add_generation_prompt,
    tokenize=False,
    chat_template=chat_template,
)

There will be errors if the model does not support the chat template.

Worker initialization parameters (load_model)

Generation parameters (call_model, batch_call_model)

Launch a separate vllm/sglang model worker

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct --dtype auto --host "127.0.0.1" --port 34200 --tensor-parallel-size 1 --disable-log-requests &
# address: http://127.0.0.1:34200
CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dtype auto --host "127.0.0.1" --port 34201 --tp-size 1 &
CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dtype auto --host "127.0.0.1" --port 34201 --tp-size 1 --disable-flashinfer & # disable flashinfer if it's not installed
# address: http://127.0.0.1:34201
from llm_engines import get_call_worker_func
call_worker_func = get_call_worker_func(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct", 
    worker_addrs=["http://127.0.0.1:34200", "http://127.0.0.1:34201"], # many workers can be used, will be load balanced
    engine="sglang", 
    use_cache=False
)
response = call_worker_func(["What is the capital of France?"], temperature=0.0, max_tokens=None)
print(response)
# The capital of France is Paris.

Test notes

When setting temperature=0.0 and max_tokens=None, testing long generations:

Star History

Star History Chart

Citation

@misc{jiang2024llmengines,
  title = {LLM-Engines: A unified and parallel inference engine for large language models},
  author = {Dongfu Jiang},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/jdf-progLLM-Engines}},
}