Home

Awesome

Rust Candle Demo

An interactive command line tool to demonstrate how to use HuggingFace's rust Candle ML framework to execute LLM.

This demo uses the quantized version of LLM openchat: https://huggingface.co/TheBloke/openchat_3.5-GGUF by default.

Prepare

Make sure you have installed the huggingface cli, if not, do it:

pip install -U "huggingface_hub[cli]"

And then you should download this model file associated with the original openchat tokenizer.json file:

mkdir hf_hub
HF_HUB_ENABLE_HF_TRANSFER=1 HF_ENDPOINT=https://hf-mirror.com huggingface-cli download TheBloke/openchat_3.5-GGUF openchat_3.5.Q8_0.gguf  --local-dir hf_hub
HF_HUB_ENABLE_HF_TRANSFER=1 HF_ENDPOINT=https://hf-mirror.com huggingface-cli download openchat/openchat_3.5 tokenizer.json --local-dir hf_hub

Run

There are two examples here:

cargo run --release --bin simple
cargo run --release --bin cli -- --model=xxxxxxx --tokenizer=xxxx

You can use --help to show what parameters could be configured.

$ cargo run --release --bin cli -- --help
    Finished release [optimized] target(s) in 0.04s
     Running `target/release/cli --help`
avx: false, neon: false, simd128: false, f16c: false
Usage: cli [OPTIONS]

Options:
      --tokenizer <TOKENIZER>            [default: ../hf_hub/openchat_3.5_tokenizer.json]
      --model <MODEL>                    [default: ../hf_hub/openchat_3.5.Q8_0.gguf]
  -n, --sample-len <SAMPLE_LEN>          [default: 1000]
      --temperature <TEMPERATURE>        [default: 0.8]
      --seed <SEED>                      [default: 299792458]
      --repeat-penalty <REPEAT_PENALTY>  [default: 1.1]
      --repeat-last-n <REPEAT_LAST_N>    [default: 64]
      --gqa <GQA>                        [default: 8]
  -h, --help                             Print help
  -V, --version                          Print version

License

None.

Feedback

Feel free to submit issues to this repository.