Home

Awesome

SKVQ

This is the official implementation of SKVQ.

<div class="autocb" style="text-align:center;"><img src="./media/score.jpg" style="zoom: 50%;box-shadow: rgba(0, 0, 0, 0.5) 10px 10px 10px; border-radius: 10px;" /></div>

SKVQ achieves extremely low-bit quantization of KV Cache with minimal accuracy drop by leveraging the locality of the attention module.

<div class="autocb" style="text-align:center;"><img src="./media/overview.jpg" style="zoom: 50%;box-shadow: rgba(0, 0, 0, 0.5) 10px 10px 10px; border-radius: 10px;" /></div>

Usage

  1. Environment:

    conda create -n skvq python==3.10
    pip install -r requirements.txt
    # install cuda extension
    cd kernels && pip install -e .
    
  2. Calibration

    python calibration --model [MODEL]
    
  3. Test

    • PPL:

      python eval_ppl.py --model [MODEL]
      
    • needle test

      # SKVQ
      python eval_needle.py \
          --model_name llama2-7b-80k \
          --quant k2-v2-w128-g128-reorder-pre_rope-clip-sink5 \
          --ctx_len 32000 \
      
      # To reproduce KIVI
      python eval_needle.py \
          --model_name llama2-7b-80k \
          --quant k2-v2-w128-g128-KIVI \
          --ctx_len 32000 \
      
    • longbench

      # SKVQ
      python eval_longbench.py \
          --model_name llama3-70b-instruct \
          --quant k2-v2-g128-w128-reorder-pre_rope-clip-sink5-fp8
      
      # To reproduce KIVI
      python eval_longbench.py \
          --model_name llama3-70b-instruct \
          --quant k2-v2-g128-w128-KIVI
      

⚠️Note