Home

Awesome

Web-RWKV

crates.io docs.rs

<p align='center'><image src="assets/logo-ba.png"></p>

This is an inference engine for the language model of RWKV implemented in pure WebGPU.

Features

<p align='center'> <image src="screenshots/chat.gif"> <image src="screenshots/batch.gif"> </p>

Note that web-rwkv is only an inference engine. It only provides the following functionalities:

It does not provide the following:

Compile

  1. Install Rust.
  2. Download the model from HuggingFace, and convert it using convert_safetensors.py. Put the .st model under assets/models.
  3. Compile
    $ cargo build --release --examples
    

Examples

Performance Test

The test generates 500 tokens and measure the time cost.

$ cargo run --release --example rt-gen

Chat Demo

To chat with the model, run

$ cargo run --release --example rt-chat

In this demo, type + to retry last round's generation; type - to exit.

Batched Inference

This demo showcases generation of 4 batches of text with various lengths simultaneously.

$ cargo run --release --example rt-batch

Inspector

The inspector demo is a guide to an advanced usage called hooks. Hooks allow user to inject any tensor ops into the model's inference process, fetching and modifying the contents of the runtime buffer, state, and even the model parameters. Hooks enable certain third-party implementations like dynamic LoRA, control net, and so on.

(De)serialization

All versions of models implements serde::ser::Serialize and serde::de::DeserializeSeed<'de>, which means that one can save quantized or lora-merged model into a file and load it afterwards.

Use in Your Project

To use in your own rust project, simply add web-rwkv = "0.8" as a dependency in your Cargo.toml. Check examples on how to create the environment, the tokenizer and how to run the model.

Explanations

Inference Runtime

Since v0.7 there is a runtime feature for the crate. When enabled, applications can use infrastructures of the asynchronous runtime API.

In general, a runtime is an asynchronous task that is driven by tokio. It allows CPU and GPU to work in parallel, maximizing the utilization of GPU computing resource.

Check examples starting with rt for more information, and compare the generation speed with their non-rt counterparts.

Batched Inference

Since version v0.2.4, the engine supports batched inference, i.e., inference of a batch of prompts (with different length) in parallel. This is achieved by a modified WKV kernel.

When building the model, the user specifies token_chunk_size (default: 32, but for powerful GPUs this could be much higher), which is the maximum number of tokens the engine could process in one run call.

After creating the model, the user creates a ModelState with num_batch specified. This means that there are num_batch slots that could consume the inputs in parallel.

Before calling run(), the user fills each slot with some tokens as prompt. If a slot is empty, no inference will be run for it.

After calling run(), some (but may not be all) input tokens are consumed, and logits appears in their corresponding returned slots if the inference of that slot is finished during this run. Since there are only token_chunk_size tokens are processed during each run() call, there may be none of logits appearing in the results.

Hooks

Hooks are a very powerful tool for customizing model inference process. The library provides with the Model::run_with_hooks function, which takes into a HookMap as a parameter.

A HookMap is essentially a hashmap from Model::Hook to functions. A Model::Hook defines a certain place the hook function can be injected into. A model generally has dozens of hooking points. A hook function is a function of Fn(&Model<'_>, &ModelState, &Runtime) -> Result<TensorOp, TensorError>, where you can create tensor ops that reads/writes all the tensors you get here.

An example that reads out every layer's output:

let info = model.info();
// create a buffer to store each layer's output
let buffer = Buffer::new(&context, &info);
let mut hooks = HookMap::default();
for layer in 0..info.num_layer {
   let buffer = buffer.clone();
   hooks.insert(
      v5::Hook::PostFfn(layer),
      Box::new(
            move |_model, _state, runtime: &v5::Runtime| -> Result<TensorOp, TensorError> {
               // figure out how many tokens this run has
               let shape = runtime.ffn_x.shape();
               let num_token = shape[1];
               // "steal" the layer's output (activation), and put it into our buffer
               TensorOp::blit(
                  runtime.ffn_x.view(.., num_token - 1, .., ..)?,
                  buffer.ffn_x.view(.., layer, .., ..)?,
               )
            },
      ),
   );
}
let output = model.run_with_hooks(&mut tokens, &state, &hooks).await?;

Convert Models

You must download the model and put in assets/models before running if you are building from source. You can now download the converted models here.

You may download the official RWKV World series models from HuggingFace, and convert them via the provided convert_safetensors.py.

$ python convert_safetensors.py --input /path/to/model.pth --output /path/to/model.st

If you don't have python installed or don't want to, there is a pure rust converter. You can clone that repo and run

$ cd /path/to/web-rwkv-converter
$ cargo run --release --example converter -- --input /path/to/model.pth --output /path/to/model.st

Troubleshoot

Credits