Awesome
ArkVale: Efficient Gener<ins>a</ins>tive LLM Inference with <ins>R</ins>ecallable <ins>K</ins>ey-<ins>Val</ins>ue <ins>E</ins>viction
[Link] [Paper] [Poster] [Slides]
Download
git clone https://github.com/pku-liang/ArkVale.git --recursive
or
git clone https://github.com/pku-liang/ArkVale.git
cd ArkVale
git submodule update --init --recursive --depth 1
Install
pip install -r requirements.txt
cd source && python3 setup.py [develop|install]
Usage
from transformers import AutoModelForCausalLM
from arkvale import adapter
path = ...
dev = torch.device("cuda:0")
dtype = torch.float16
model = (
AutoModelForCausalLM
.from_pretrained(path, torch_dtype=dtype, device_map=dev)
.eval()
)
adapter.enable_arkvale(
model,
dtype=dtype,
device=dev,
page_size=32,
# page_budgets=None, # page_budgets=None means "full" (no eviction & recall)
page_budgets=4096 // 32,
page_topks=32,
n_max_bytes=40 * (1 << 30),
n_max_cpu_bytes=80 * (1 << 30),
)
...