OpenAI-Compatible vLLM Serverless Endpoint Worker

Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the vLLM Inference Engine on RunPod Serverless with just a few clicks.

1. UI for Deploying vLLM Worker on RunPod console:

Demo of Deploying vLLM Worker on RunPod console with new UI

2. Worker vLLM v1.6.0 with vLLM 0.6.3 now available under stable tags

Update v1.6.0 is now available, use the image tag runpod/worker-v1-vllm:v1.6.0stable-cuda12.1.0.

3. OpenAI-Compatible Embedding Worker Released

Deploy your own OpenAI-compatible Serverless Endpoint on RunPod with multiple embedding models and fast inference for RAG and more!

4. Caching Accross RunPod Machines

Worker vLLM is now cached on all RunPod machines, resulting in near-instant deployment! Previously, downloading and extracting the image took 3-5 minutes on average.

Table of Contents

Setting up the Serverless Worker

Option 1: Deploy Any Model Using Pre-Built Docker Image [Recommended]

[!NOTE] You can now deploy from the dedicated UI on the RunPod console with all of the settings and choices listed. Try now by accessing in Explore or Serverless pages on the RunPod console!

We now offer a pre-built Docker Image for the vLLM Worker that you can configure entirely with Environment Variables when creating the RunPod Serverless Endpoint:

RunPod Worker Images

Below is a summary of the available RunPod Worker images, categorized by image stability and CUDA version compatibility.

CUDA VersionStable Image TagDevelopment Image TagNote
12.1.0runpod/worker-v1-vllm:v1.6.0stable-cuda12.1.0runpod/worker-v1-vllm:v1.6.0dev-cuda12.1.0When creating an Endpoint, select CUDA Version 12.3, 12.2 and 12.1 in the filter.


Environment Variables/Settings

Note: 0 is equivalent to False and 1 is equivalent to True for boolean as int values.

MODEL_NAME'facebook/opt-125m'strName or path of the Hugging Face model to use.
TOKENIZERNonestrName or path of the Hugging Face tokenizer to use.
SKIP_TOKENIZER_INITFalseboolSkip initialization of tokenizer and detokenizer.
TOKENIZER_MODE'auto'['auto', 'slow']The tokenizer mode.
TRUST_REMOTE_CODEFalseboolTrust remote code from Hugging Face.
DOWNLOAD_DIRNonestrDirectory to download and load the weights.
LOAD_FORMAT'auto'strThe format of the model weights to load.
HF_TOKEN-strHugging Face token for private and gated models.
DTYPE'auto'['auto', 'half', 'float16', 'bfloat16', 'float', 'float32']Data type for model weights and activations.
KV_CACHE_DTYPE'auto'['auto', 'fp8']Data type for KV cache storage.
QUANTIZATION_PARAM_PATHNonestrPath to the JSON file containing the KV cache scaling factors.
MAX_MODEL_LENNoneintModel context length.
GUIDED_DECODING_BACKEND'outlines'['outlines', 'lm-format-enforcer']Which engine will be used for guided decoding by default.
DISTRIBUTED_EXECUTOR_BACKENDNone['ray', 'mp']Backend to use for distributed serving.
WORKER_USE_RAYFalseboolDeprecated, use --distributed-executor-backend=ray.
PIPELINE_PARALLEL_SIZE1intNumber of pipeline stages.
TENSOR_PARALLEL_SIZE1intNumber of tensor parallel replicas.
MAX_PARALLEL_LOADING_WORKERSNoneintLoad model sequentially in multiple batches.
RAY_WORKERS_USE_NSIGHTFalseboolIf specified, use nsight to profile Ray workers.
ENABLE_PREFIX_CACHINGFalseboolEnables automatic prefix caching.
DISABLE_SLIDING_WINDOWFalseboolDisables sliding window, capping to sliding window size.
USE_V2_BLOCK_MANAGERFalseboolUse BlockSpaceMangerV2.
NUM_LOOKAHEAD_SLOTS0intExperimental scheduling config necessary for speculative decoding.
SEED0intRandom seed for operations.
NUM_GPU_BLOCKS_OVERRIDENoneintIf specified, ignore GPU profiling result and use this number of GPU blocks.
MAX_NUM_BATCHED_TOKENSNoneintMaximum number of batched tokens per iteration.
MAX_NUM_SEQS256intMaximum number of sequences per iteration.
MAX_LOGPROBS20intMax number of log probs to return when logprobs is specified in SamplingParams.
DISABLE_LOG_STATSFalseboolDisable logging statistics.
QUANTIZATIONNone['awq', 'squeezellm', 'gptq']Method used to quantize the weights.
ROPE_SCALINGNonedictRoPE scaling configuration in JSON format.
ROPE_THETANonefloatRoPE theta. Use with rope_scaling.
TOKENIZER_POOL_SIZE0intSize of tokenizer pool to use for asynchronous tokenization.
TOKENIZER_POOL_TYPE'ray'strType of tokenizer pool to use for asynchronous tokenization.
TOKENIZER_POOL_EXTRA_CONFIGNonedictExtra config for tokenizer pool.
ENABLE_LORAFalseboolIf True, enable handling of LoRA adapters.
MAX_LORAS1intMax number of LoRAs in a single batch.
MAX_LORA_RANK16intMax LoRA rank.
LORA_EXTRA_VOCAB_SIZE256intMaximum size of extra vocabulary for LoRA adapters.
LORA_DTYPE'auto'['auto', 'float16', 'bfloat16', 'float32']Data type for LoRA.
LONG_LORA_SCALING_FACTORSNonetupleSpecify multiple scaling factors for LoRA adapters.
MAX_CPU_LORASNoneintMaximum number of LoRAs to store in CPU memory.
FULLY_SHARDED_LORASFalseboolEnable fully sharded LoRA layers.
SCHEDULER_DELAY_FACTOR0.0floatApply a delay before scheduling next prompt.
ENABLE_CHUNKED_PREFILLFalseboolEnable chunked prefill requests.
SPECULATIVE_MODELNonestrThe name of the draft model to be used in speculative decoding.
NUM_SPECULATIVE_TOKENSNoneintThe number of speculative tokens to sample from the draft model.
SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZENoneintNumber of tensor parallel replicas for the draft model.
SPECULATIVE_MAX_MODEL_LENNoneintThe maximum sequence length supported by the draft model.
SPECULATIVE_DISABLE_BY_BATCH_SIZENoneintDisable speculative decoding if the number of enqueue requests is larger than this value.
NGRAM_PROMPT_LOOKUP_MAXNoneintMax size of window for ngram prompt lookup in speculative decoding.
NGRAM_PROMPT_LOOKUP_MINNoneintMin size of window for ngram prompt lookup in speculative decoding.
SPEC_DECODING_ACCEPTANCE_METHOD'rejection_sampler'['rejection_sampler', 'typical_acceptance_sampler']Specify the acceptance method for draft token verification in speculative decoding.
TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLDNonefloatSet the lower bound threshold for the posterior probability of a token to be accepted.
TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHANonefloatA scaling factor for the entropy-based threshold for token acceptance.
MODEL_LOADER_EXTRA_CONFIGNonedictExtra config for model loader.
PREEMPTION_MODENonestrIf 'recompute', the engine performs preemption-aware recomputation. If 'save', the engine saves activations into the CPU memory as preemption happens.
PREEMPTION_CHECK_PERIOD1.0floatHow frequently the engine checks if a preemption happens.
PREEMPTION_CPU_CAPACITY2floatThe percentage of CPU memory used for the saved activations.
DISABLE_LOGGING_REQUESTFalseboolDisable logging requests.
MAX_LOG_LENNoneintMax number of prompt characters or prompt ID numbers being printed in log.
Tokenizer Settings
TOKENIZER_NAMENonestrTokenizer repository to use a different tokenizer than the model's default.
TOKENIZER_REVISIONNonestrTokenizer revision to load.
CUSTOM_CHAT_TEMPLATENonestr of single-line jinja templateCustom chat jinja template. More Info
System, GPU, and Tensor Parallelism(Multi-GPU) Settings
GPU_MEMORY_UTILIZATION0.95floatSets GPU VRAM utilization.
MAX_PARALLEL_LOADING_WORKERSNoneintLoad model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.
BLOCK_SIZE168, 16, 32Token block size for contiguous chunks of tokens.
SWAP_SPACE4intCPU swap space size (GiB) per GPU.
ENFORCE_EAGERFalseboolAlways use eager-mode PyTorch. If False(0), will use eager mode and CUDA graph in hybrid for maximal performance and flexibility.
MAX_SEQ_LEN_TO_CAPTURE8192intMaximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode.
DISABLE_CUSTOM_ALL_REDUCE0intEnables or disables custom all reduce.
Streaming Batch Size Settings:
DEFAULT_BATCH_SIZE50intDefault and Maximum batch size for token streaming to reduce HTTP calls.
DEFAULT_MIN_BATCH_SIZE1intBatch size for the first request, which will be multiplied by the growth factor every subsequent request.
DEFAULT_BATCH_SIZE_GROWTH_FACTOR3floatGrowth factor for dynamic batch size.
The way this works is that the first request will have a batch size of DEFAULT_MIN_BATCH_SIZE, and each subsequent request will have a batch size of previous_batch_size * DEFAULT_BATCH_SIZE_GROWTH_FACTOR. This will continue until the batch size reaches DEFAULT_BATCH_SIZE. E.g. for the default values, the batch sizes will be 1, 3, 9, 27, 50, 50, 50, .... You can also specify this per request, with inputs max_batch_size, min_batch_size, and batch_size_growth_factor. This has nothing to do with vLLM's internal batching, but rather the number of tokens sent in each HTTP request from the worker
OpenAI Settings
RAW_OPENAI_OUTPUT1boolean as intEnables raw OpenAI SSE format string output when streaming. Required to be enabled (which it is by default) for OpenAI compatibility.
OPENAI_SERVED_MODEL_NAME_OVERRIDENonestrOverrides the name of the served model from model repo/path to specified name, which you will then be able to use the value for the model parameter when making OpenAI requests
OPENAI_RESPONSE_ROLEassistantstrRole of the LLM's Response in OpenAI Chat Completions.
Serverless Settings
MAX_CONCURRENCY300intMax concurrent requests per worker. vLLM has an internal queue, so you don't have to worry about limiting by VRAM, this is for improving scaling/load balancing efficiency
DISABLE_LOG_STATSFalseboolEnables or disables vLLM stats logging.
DISABLE_LOG_REQUESTSFalseboolEnables or disables vLLM request logging.

[!TIP] If you are facing issues when using Mixtral 8x7B, Quantized models, or handling unusual models/architectures, try setting TRUST_REMOTE_CODE to 1.

Option 2: Build Docker Image with Model Inside

To build an image with the model baked in, you must specify the following docker arguments when building the image.



For the remaining settings, you may apply them as environment variables when running the container. Supported environment variables are listed in the Environment Variables section.

Example: Building an image with OpenChat-3.5

sudo docker build -t username/image:tag --build-arg MODEL_NAME="openchat/openchat_3.5" --build-arg BASE_PATH="/models" .
(Optional) Including Huggingface Token

If the model you would like to deploy is private or gated, you will need to include it during build time as a Docker secret, which will protect it from being exposed in the image and on DockerHub.

  1. Enable Docker BuildKit (required for secrets).
  1. Export your Hugging Face token as an environment variable
export HF_TOKEN="your_token_here"
  1. Add the token as a secret when building
docker build -t username/image:tag --secret id=HF_TOKEN --build-arg MODEL_NAME="openchat/openchat_3.5" .

Compatible Model Architectures

Below are all supported model architectures (and examples of each) that you can deploy using the vLLM Worker. You can deploy any model on HuggingFace, as long as its base architecture is one of the following:

Usage: OpenAI Compatibility

The vLLM Worker is fully compatible with OpenAI's API, and you can use it with any OpenAI Codebase by changing only 3 lines in total. The supported routes are <ins>Chat Completions</ins> and <ins>Models</ins> - with both streaming and non-streaming.

Modifying your OpenAI Codebase to use your deployed vLLM Worker

Python (similar to Node.js, etc.):

  1. When initializing the OpenAI Client in your code, change the api_key to your RunPod API Key and the base_url to your RunPod Serverless Endpoint URL in the following format: https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1, filling in your deployed endpoint ID. For example, if your Endpoint ID is abc1234, the URL would be https://api.runpod.ai/v2/abc1234/openai/v1.

    • Before:
    from openai import OpenAI
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    • After:
    from openai import OpenAI
    client = OpenAI(
        base_url="https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1",
  2. Change the model parameter to your deployed model's name whenever using Completions or Chat Completions.

    • Before:
    response = client.chat.completions.create(
        messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
    • After:
    response = client.chat.completions.create(
        messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],

Using http requests:

  1. Change the Authorization header to your RunPod API Key and the url to your RunPod Serverless Endpoint URL in the following format: https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1
    • Before:
    curl https://api.openai.com/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -d '{
    "model": "gpt-4",
    "messages": [
        "role": "user",
        "content": "Why is RunPod the best platform?"
    "temperature": 0,
    "max_tokens": 100
    • After:
    curl https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer <YOUR OPENAI API KEY>" \
    -d '{
    "messages": [
        "role": "user",
        "content": "Why is RunPod the best platform?"
    "temperature": 0,
    "max_tokens": 100

OpenAI Request Input Parameters:

When using the chat completion feature of the vLLM Serverless Endpoint Worker, you can customize your requests with the following parameters:

Chat Completions [RECOMMENDED]

<details> <summary>Supported Chat Completions Inputs and Descriptions</summary>
ParameterTypeDefault ValueDescription
messagesUnion[str, List[Dict[str, str]]]List of messages, where each message is a dictionary with a role and content. The model's chat template will be applied to the messages automatically, so the model must have one or it should be specified as CUSTOM_CHAT_TEMPLATE env var.
modelstrThe model repo that you've deployed on your RunPod Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the Examples: Using your RunPod endpoint with OpenAI section
temperatureOptional[float]0.7Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.
top_pOptional[float]1.0Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
nOptional[int]1Number of output sequences to return for the given prompt.
max_tokensOptional[int]NoneMaximum number of tokens to generate per output sequence.
seedOptional[int]NoneRandom seed to use for the generation.
stopOptional[Union[str, List[str]]]listList of strings that stop the generation when they are generated. The returned output will not contain the stop strings.
streamOptional[bool]FalseWhether to stream or not
presence_penaltyOptional[float]0.0Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.
frequency_penaltyOptional[float]0.0Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.
logit_biasOptional[Dict[str, float]]NoneUnsupported by vLLM
userOptional[str]NoneUnsupported by vLLM
Additional parameters supported by vLLM:
best_ofOptional[int]NoneNumber of output sequences that are generated from the prompt. From these best_of sequences, the top n sequences are returned. best_of must be greater than or equal to n. This is treated as the beam width when use_beam_search is True. By default, best_of is set to n.
top_kOptional[int]-1Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.
ignore_eosOptional[bool]FalseWhether to ignore the EOS token and continue generating tokens after the EOS token is generated.
use_beam_searchOptional[bool]FalseWhether to use beam search instead of sampling.
stop_token_idsOptional[List[int]]listList of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens.
skip_special_tokensOptional[bool]TrueWhether to skip special tokens in the output.
spaces_between_special_tokensOptional[bool]TrueWhether to add spaces between special tokens in the output. Defaults to True.
add_generation_promptOptional[bool]TrueRead more here
echoOptional[bool]FalseEcho back the prompt in addition to the completion
repetition_penaltyOptional[float]1.0Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.
min_pOptional[float]0.0Float that represents the minimum probability for a token to
length_penaltyOptional[float]1.0Float that penalizes sequences based on their length. Used in beam search..
include_stop_str_in_outputOptional[bool]FalseWhether to include the stop strings in output text. Defaults to False.

Examples: Using your RunPod endpoint with OpenAI

First, initialize the OpenAI Client with your RunPod API Key and Endpoint URL:

from openai import OpenAI
import os

# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
client = OpenAI(
    base_url="https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1",

Chat Completions:

This is the format used for GPT-4 and focused on instruction-following and chat. Examples of Open Source chat/instruct models include meta-llama/Llama-2-7b-chat-hf, mistralai/Mixtral-8x7B-Instruct-v0.1, openchat/openchat-3.5-0106, NousResearch/Nous-Hermes-2-Mistral-7B-DPO and more. However, if your model is a completion-style model with no chat/instruct fine-tune and/or does not have a chat template, you can still use this if you provide a chat template with the environment variable CUSTOM_CHAT_TEMPLATE.

Getting a list of names for available models:

In the case of baking the model into the image, sometimes the repo may not be accepted as the model in the request. In this case, you can list the available models as shown below and use that name.

models_response = client.models.list()
list_of_models = [model.id for model in models_response]

Usage: Standard (Non-OpenAI)

Request Input Parameters

<details> <summary>Click to expand table</summary>

You may either use a prompt or a list of messages as input. If you use messages, the model's chat template will be applied to the messages automatically, so the model must have one. If you use prompt, you may optionally apply the model's chat template to the prompt by setting apply_chat_template to true.

promptstrPrompt string to generate text based on.
messageslist[dict[str, str]]List of messages, which will automatically have the model's chat template applied. Overrides prompt.
apply_chat_templateboolFalseWhether to apply the model's chat template to the prompt.
sampling_paramsdict{}Sampling parameters to control the generation, like temperature, top_p, etc. You can find all available parameters in the Sampling Parameters section below.
streamboolFalseWhether to enable streaming of output. If True, responses are streamed as they are generated.
max_batch_sizeintenv var DEFAULT_BATCH_SIZEThe maximum number of tokens to stream every HTTP POST call.
min_batch_sizeintenv var DEFAULT_MIN_BATCH_SIZEThe minimum number of tokens to stream every HTTP POST call.
batch_size_growth_factorintenv var DEFAULT_BATCH_SIZE_GROWTH_FACTORThe growth factor by which min_batch_size will be multiplied for each call until max_batch_size is reached.

Sampling Parameters

Below are all available sampling parameters that you can specify in the sampling_params dictionary. If you do not specify any of these parameters, the default values will be used.

<details> <summary>Click to expand table</summary>
nint1Number of output sequences generated from the prompt. The top n sequences are returned.
best_ofOptional[int]nNumber of output sequences generated from the prompt. The top n sequences are returned from these best_of sequences. Must be ≥ n. Treated as beam width in beam search. Default is n.
presence_penaltyfloat0.0Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition.
frequency_penaltyfloat0.0Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition.
repetition_penaltyfloat1.0Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens, values < 1 encourage repetition.
temperaturefloat1.0Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling.
top_pfloat1.0Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
top_kint-1Controls the number of top tokens to consider. Set to -1 to consider all tokens.
min_pfloat0.0Represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable.
use_beam_searchboolFalseWhether to use beam search instead of sampling.
length_penaltyfloat1.0Penalizes sequences based on their length. Used in beam search.
early_stoppingUnion[bool, str]FalseControls stopping condition in beam search. Can be True, False, or "never".
stopUnion[None, str, List[str]]NoneList of strings that stop generation when produced. The output will not contain these strings.
stop_token_idsOptional[List[int]]NoneList of token IDs that stop generation when produced. Output contains these tokens unless they are special tokens.
ignore_eosboolFalseWhether to ignore the End-Of-Sequence token and continue generating tokens after its generation.
max_tokensint16Maximum number of tokens to generate per output sequence.
skip_special_tokensboolTrueWhether to skip special tokens in the output.
spaces_between_special_tokensboolTrueWhether to add spaces between special tokens in the output.

Text Input Formats

You may either use a prompt or a list of messages as input.

  1. prompt The prompt string can be any string, and the model's chat template will not be applied to it unless apply_chat_template is set to true, in which case it will be treated as a user message.


    "prompt": "..."
  2. messages Your list can contain any number of messages, and each message usually can have any role from the following list:

    • user
    • assistant
    • system

    However, some models may have different roles, so you should check the model's chat template to see which roles are required.

    The model's chat template will be applied to the messages automatically, so the model must have one.


    "messages": [
          "role": "system",
          "content": "..."
          "role": "user",
          "content": "..."
          "role": "assistant",
          "content": "..."

Worker Config

The worker config is a JSON file that is used to build the form that helps users configure their serverless endpoint on the RunPod Web Interface.

Note: This is a new feature and only works for workers that use one model

Writing your worker-config.json

The JSON consists of two main parts, schema and versions.

Example of schema

  "schema": {
    "TOKENIZER": {
      "env_var_name": "TOKENIZER",
      "value": "",
      "title": "Tokenizer",
      "description": "Name or path of the Hugging Face tokenizer to use.",
      "required": false,
      "type": "text"
      "env_var_name": "TOKENIZER_MODE",
      "value": "auto",
      "title": "Tokenizer Mode",
      "description": "The tokenizer mode.",
      "required": false,
      "type": "select",
      "options": [
        { "value": "auto", "label": "auto" },
        { "value": "slow", "label": "slow" }

Example of versions

  "versions": {
    "0.5.4": {
      "imageName": "runpod/worker-v1-vllm:v1.2.0stable-cuda12.1.0",
      "minimumCudaVersion": "12.1",
      "categories": [
          "title": "LLM Settings",
          "settings": [
          "title": "Tokenizer Settings",
          "settings": [