Awesome

🚀 TensorRT-LLM as OpenAI API on Windows 🤖

Drop-in replacement REST API compatible with OpenAI API spec using TensorRT-LLM as the inference backend.

Setup a local Llama 2 or Code Llama web server using TRT-LLM for compatibility with the OpenAI Chat and legacy Completions API. This enables accelerated inference on Windows natively, while retaining compatibility with the wide array of projects built using the OpenAI API.

Follow this README to setup your own web server for Llama 2 and Code Llama.

Getting Started

Ensure you have the pre-requisites in place:

Install TensorRT-LLM for Windows using the instructions here.
Ensure you have access to the Llama 2 repository on Huggingface
- CodeLlama-13b-Instruct-hf
- Llama-2-13b-chat-hf
In this repo, we provide instructions to set up an OpenAI API compatible server with either the LLama 2 13B or Code Llama 13B model, both optimized using AWQ 4-bit quantization. To begin, it's necessary to compile a TensorRT Engine tailored to your specific GPU. Refer to the given instructions for constructing your TRT Engine instructions.

<h3 id="building-trt-engine">Building TRT Engine</h3>

Follow these steps to build your TRT engine:

Download models and quantized weights

CodeLlama-13B-instruct AWQ int
- Download CodeLlama 13B model from CodeLlama-13b-Instruct-hf
- Download CodeLlama 13B AWQ int4 checkpoints from here
Llama-2-13b-chat AWQ int4
- Download Llama-2-13b-chat model from Llama-2-13b-chat-hf
- Download Llama-2-13b-chat AWQ int4 checkpoints from here

Clone the TensorRT LLM repository:

git clone https://github.com/NVIDIA/TensorRT-LLM.git

For Code Llama engine, navigate to the examples\llama directory and run the following script:

python build.py --model_dir <path to CodeLlama model> --quant_ckpt_path <path to CodeLlama .npz file> --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --max_batch_size 1 --max_input_len 15360 --max_output_len 1024 --output_dir <TRT engine folder> --rotary_base 1000000 --vocab_size 32064

For Llama2 engine, navigate to the examples\llama directory and run the following script:

python build.py --model_dir <path to llama13_chat model> --quant_ckpt_path <path to Llama2 .npz file> --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --enable_context_fmha --max_batch_size 1 --max_input_len 3500 --max_output_len 1024 --output_dir <TRT engine folder>

<h3 id="setup"> Setup Steps </h3>

Clone this repository:

https://github.com/NVIDIA/trt-llm-as-openai-windows
cd trt-llm-as-openai-windows

Download the tokenizer and config.json from HuggingFace and place them in the model/ directory.
- CodeLlama-13b-Instruct-hf
- Llama-2-13b-chat-hf.
Build the TRT engine by following the instructions provided here and place the TensorRT engine files for the Llama2/CodeLlama model in the 'model/engine' directory
Install the necessary libraries:
```
pip install -r requirements.txt
```

Launch the application using the following command:

Llama-2-13B-chat model

python app.py --trt_engine_path <TRT Engine folder> --trt_engine_name <TRT Engine file>.engine --tokenizer_dir_path <tokernizer folder> --port <optional port>

CodeLlama-13B-instruct model needs additional parameter mentioned below to the command above :

--no_system_prompt True

In our case, that will be (for CodeLlama):

python app.py --trt_engine_path model/ --trt_engine_name llama_float16_tp1_rank0.engine --tokenizer_dir_path model/ --port 8081 --no_system_prompt True

Test the API

Install the 'openai' client library in your Python environment.
```
pip install openai==0.28
```
Run the following code inside your Python env.
Set a random API key and the base URL.

<pre> openai.api_key="ABC" openai.api_base="http://127.0.0.1:8081" response = openai.ChatCompletion.create( model = "Llama2", prompt = "Hello! How are you?") print(response) </pre>

Detailed Command References

python app.py --trt_engine_path <TRT Engine folder> --trt_engine_name <TRT Engine file>.engine --tokenizer_dir_path <tokernizer folder> --port <port>

Arguments

Name	Details
--trt_engine_path <>	Directory of TensorRT engine (built TRT engine using instructions)
--trt_engine_name <>	Engine file name (e.g. llama_float16_tp1_rank0.engine)
--tokenizer_dir_path <>	HF downloaded model files for tokenizer & config.json e.g. Llama-2-13b-chat-hf or CodeLlama-13b-Instruct-hf
--port <>	OpenAI compatible server hosted on localhost and 8081 port as default. Optionally, allows to specify a different port.
--max_output_tokens	Optional override to maximum output token sizes otherwise it defaults to 2048
--max_input_tokens	Optional override to maximum input token sizes otherwise it defaults to 2048
--no_system_prompt	App uses default system prompt and optionally supported to avoid it.

<h3 id="supported-apis">Supported APIs</h3>

/completions
/chat/completions
/v1/completions
/v1/chat/completions

<h2 id="use-cases">Examples</h3> <h3> <a href="https://continue.dev">Continue.dev</a> Visual Studio Code Extension with CodeLlama-13B </h3>

Run this app with CodeLlama-13B-instruct AWQ int4 model as described above.
Install Continue.dev from Visual Studio Marketplace
Configure to use OpenAI API compatible local inference from UI
1. Open Continue.dev plugin from Visual Studio Code left panel
2. Click "+" to add new model
3. Select "Other OpenAI-compatible API"
4. Expand "Advanced (optional)"
  1. apiBase: update to local host url like http://localhost:8081/v1
  2. update contextLength: 16384
5. Select CodeLlama 13b instruct option

Alternatively config.json can be modified directly to include below

Open c:\users\<user>\.continue\config.json in any editor

Add below model config

{
   "models": [
      {
         "title": "CodeLlama-13b-Instruct",
         "provider": "openai",
         "model": "codellama:13b-instruct",
         "apiBase": "http://localhost:8081",
         "contextLength": 16384,
      }
   ]
}

This project requires additional third-party open source software projects as specified in the documentation. Review the license terms of these open source projects before use.