Home

Awesome

English 中文

News

About

Features

Production Proven

Applied in numerous LLM scenarios, such as:

High Performance

Flexibility and Ease of Use

Advanced Acceleration Techniques

How to Use

Requirements

Startup example

  1. docker
cd rtp-llm/docker
# IMAGE_NAME =
# if cuda11: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:deploy_image_cuda11
# if cuda12: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:deploy_image_cuda12
sh ./create_container.sh <CONTAINER_NAME> <IMAGE_NAME>
sh CONTAINER_NAME/sshme.sh

cd ../
# start http service
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
# request to server
curl -XPOST http://localhost:8088 -d '{"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}}'
  1. whl
# Install rtp-llm
cd rtp-llm
# For cuda12 environment, please use requirements_torch_gpu_cuda12.txt
pip3 install -r ./open_source/deps/requirements_torch_gpu.txt
# Use the corresponding whl from the release version, here's an example for the cuda11 version 0.1.0, for the cuda12 whl package please check the release page.
pip3 install maga_transformer-0.1.9+cuda118-cp310-cp310-manylinux1_x86_64.whl
# start http service

cd ../
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
# request to server
curl -XPOST http://localhost:8088 -d '{"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}}'

Docker Relelase Note

FAQ

  1. libcufft.so

    Error log: OSError: libcufft.so.11: cannot open shared object file: No such file or directory

    Resolution: Please check whether cuda and rtp-llm versions are matched

  2. libth_transformer.so

    Error log: OSError: /rtp-llm/maga_transformer/libs/libth_transformer.so: cannot open shared object file: No such file or directory

    Resolution: If installed via whl or docker(which means not a bazel build), please check your current directory is not rtp-llm, or python will use relative path package instead of installed whl

  3. Bazel build time out

    Error log: ERROR: no such package '@pip_gpu_cuda12_torch//': rules_python_external failed: (Timed out)

    Resolution:

    1. change pip mirror repository in open_source/deps/pip.bzl, add extra_pip_args=["--index_url=xxx"]
    2. pip install requirements manually, especially for pytorch, for that bazel build has a 600s timeout by default, which may not be enough for pytorch downloading
  4. Curl error Error log: thread '<unnamed>' panicked at 'index out of bounds: the len is 1 but the index is 1', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-1.8.1/src/dfa.rs:1415:45

    Resolution: upgrade tiktoken to 0.7.0

Documentation

Acknowledgments

Our project is mainly based on FasterTransformer, and on this basis, we have integrated some kernel implementations from TensorRT-LLM. FasterTransformer and TensorRT-LLM have provided us with reliable performance guarantees. Flash-Attention2 and cutlass have also provided a lot of help in our continuous performance optimization process. Our continuous batching and increment decoding draw on the implementation of vllm; sampling draws on transformers, with speculative sampling integrating Medusa's implementation, and the multimodal part integrating implementations from llava and qwen-vl. We thank these projects for their inspiration and help.

External Application Scenarios (Continuously Updated)

Supported Model List

LLM

LLM + Multimodal

Contact Us

DingTalk Group

<img src=picture/dingding.png width="200px">

WeChat Group

<img src=picture/weixin.png width="200px">