Home

Awesome

ZhiLight大模型推理引擎

ZhiLight ✨is a highly optimized LLM inference engine developed by Zhihu and ModelBest Inc. The "Zhi" in its name stands for Zhihu. ZhiLight can accelerate the inference of models like Llama and its variants, especially on PCIe-based GPUs. Compared to mainstream open-source inference engines, for example, vllm, it has significant performance advantages.

🎉🎉 Main Features

🔧 Basic Usage

# Concurrently compile the wheel package, and turn off the unit test
CMAKE_BUILD_PARALLEL_LEVEL=32 TESTING=0 python setup.py bdist_wheel

# Compile with ninja backend
CMAKE_GENERATER="Ninja" python setup.py bdist_wheel

# Install directly
cd ./ZhiLight && pip install -e .

# Start OpenAI compatible server
python -m zhilight.server.openai.entrypoints.api_server [options]

✈️ Docker Image

ZhiLight only depends on the CUDA runtime, cuBLAS, NCCL, and a few Python packages in requirements.txt. You can use the image below for running or building it. You can also directly refer to docker/Dockerfile.

docker pull ghcr.io/zhihu/zhilight/zhilight:0.4.8-cu124

📈 Performance Notes

We conducted performance reviews on various mainstream NVIDIA GPUs with different model sizes and precisions. For dense models ranging from 2B to 110B parameters on PCIe devices, ZhiLight demonstrates significant performance advantages compared to mainstream open-source inference engines.

Test Description:

MiniCPM-2B-sft-bf16

Inference EngineQPSTTFT MeanTTFT P95TPOT MeanTPOT P95
vLLM1.67527.551062.9616.7131.95
SGLang1.67466.191181.533.9659.44
ZhiLight1.67434.64989.0326.161.14

Qwen2-72B-Instruct-GPTQ-Int4

Inference EngineQPSTTFT MeanTTFT P95TPOT MeanTPOT P95
vLLM0.183493.976852.0735.4761.74
SGLang0.182276.13820.738.1265.16
ZhiLight0.181111.81882.526.7541.81
Inference EngineQPSTTFT MeanTTFT P95TPOT MeanTPOT P95
vLLM0.181457.652136.522.1428.96
SGLang0.361113.061850.5730.4143.65
ZhiLight0.181227.371968.9531.9548.53

Qwen1.5-110B-Chat-GPTQ-Int4

Inference EngineQPSTTFT MeanTTFT P95TPOT MeanTPOT P95
vLLM0.093085.744274.0330.3444.08
SGLang0.092418.563187.7331.3953.1
ZhiLight0.181671.382669.8239.6864.35
Inference EngineQPSTTFT MeanTTFT P95TPOT MeanTPOT P95
vLLM0.091899.072719.5923.833.02
SGLang0.181514.492135.7528.547.28
ZhiLight0.11574.852086.827.0738.82

more benchmarks can be found in benchmarks.md

License

Apache License 2.0

Contributors