Home

Awesome

MLKV: Multi-Layer Key-Value Sharing

Experiments on EleutherAI's Pythia models

Setup

git clone https://github.com/zaydzuhri/pythia-mlkv.git
cd pythia-mlkv
pip install -r requirements.txt

Convert Pythia models to MQA/GQA/MLKV models

git lfs install
git clone https://huggingface.co/EleutherAI/pythia-160m-deduped
rm -rf pythia-160m-deduped/.git
python3 convert_to_mlkv.py --weights_path pythia-160m-deduped --num-key-value-layers 6 --num-key-value-heads 1

Here are all the 8+1 configs needed for all experiments:

NameNum. of layersNum. of attention headsNum. of layers with KV heads (num-key-value-layers)Num. of KV heads in a layer (num-key-value-heads)Total num. of KV headsNum. of parameters
MHA-14412121212144160M
GQA-48121212448160M
MLKV-48121241248160M
MQA-12121212112160M
MLKV-1212124312160M
MLKV-61212616160M
MLKV-41212414160M
MLKV-21212212160M
MLKV-11212111160M

Uptraining

The dataset has been prepared to Huggingface, so you can directly uptrain:

CUDA_VISIBLE_DEVICES=0,1 python3 uptrain.py --output-dir pythia-160m-mlkv-6-b12-g2-v1 --model pythia-160m-deduped_mlkv_6_1 --batch-size 12 --gradient-accumulate-every 1 --learning-rate 6e-4 --warmup-ratio 0.2  --wandb pythia-160m-mlkv-6-b12-g2-v1