Home

Awesome

InferLLM

中文 README

InferLLM is a lightweight LLM model inference framework that mainly references and borrows from the llama.cpp project. llama.cpp puts almost all core code and kernels in a single file and use a large number of macros, making it difficult for developers to read and modify. InferLLM has the following features:

In short, InferLLM is a simple and efficient LLM CPU inference framework that can deploy quantized models in LLM locally and has good inference speed.

Latest News

How to use

Download model

Currently, InferLLM uses the same models as llama.cpp and can download models from the llama.cpp project. In addition, models can also be downloaded directly from Hugging Face kewin4933/InferLLM-Model. Currently, two alpaca, llama2, chatglm/chatglm2 and baichuan models are uploaded in this project, one is the Chinese int4 model and the other is the English int4 model.

Compile InferLLM

Local compilation

mkdir build
cd build
cmake ..
make

GPU is disabled default, if you want to enable GPU, please use cmake -DENABLE_GPU=ON .. to enable GPU. Now only CUDA is supported, before use CUDA, please install CUDA toolkit first.

Android cross compilation

According to the cross compilation, you can use the pre-prepared tools/android_build.sh script. You need to install NDK in advance and configure the path of NDK to the NDK_ROOT environment variable.

export NDK_ROOT=/path/to/ndk
./tools/android_build.sh

Run InferLLM

Running ChatGLM model please refer to ChatGLM model documentation.

If it is executed locally, execute ./chatglm -m chatglm-q4.bin -t 4 directly. If you want to execute it on your mobile phone, you can use the adb command to copy alpaca and the model file to your mobile phone, and then execute adb shell ./chatglm -m chatglm-q4.bin -t 4.

The default device is CPU, if you want to inference with GPU, please use ./chatglm -m chatglm-q4.bin -g GPU to specify the GPU device.

According to x86 profiling result, we strongly advise using 4 threads.

Supported model

Now InferLLM supports the following models:

License

InferLLM is licensed under the Apache License, Version 2.0