Home

Awesome

Llama 3 is an amazing open large language model. The 70B variant's weights were published as 130 GB of bfloat16 tensors in safetensors format. The smaller variant, 8B, weighs 15 GB. Thanks to quantization methods, we can run these models on consumer hardware while retaining good quality. I tested how much quantization affects the Instruct variant of these models, using the MMLU test.

Results

Quick intro

<details> <summary>What's MMLU?</summary>

The "Massive Multitask Language Understanding" test is composed of 14042 multiple choice questions, non-uniformly distributed among 57 categories. "Correctness" in this article refers to the % of questions the model answered correctly.

<details> <summary>Example question</summary>

Question 45 from the "high school mathematics" category, formatted for Llama 3-Instruct:

<|start_header_id|>user<|end_header_id|>

Question: To place the first paving stone in a path, Alex starts at the crate of stones, walks three feet, places the stone, and returns to the crate. For each subsequent stone, Alex walks two feet farther each way. Alex will place the first 50 stones in a path. After returning to the crate from placing the $50^\text{th}$ stone, what is the total distance Alex walked, in feet?

Choices:
A: 100
B: 90950
C: 5200
D: 50<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Answer:

To which a model is expected to reply with a single token, saying A, B, C, or D. Here, C is correct.

</details> <details> <summary>Question count per category</summary>
Question CountCategory
1.100abstract algebra
2.135anatomy
3.152astronomy
4.100business ethics
5.265clinical knowledge
6.144college biology
7.100college chemistry
8.100college computer science
9.100college mathematics
10.173college medicine
11.102college physics
12.100computer security
13.235conceptual physics
14.114econometrics
15.145electrical engineering
16.378elementary mathematics
17.126formal logic
18.100global facts
19.310high school biology
20.203high school chemistry
21.100high school computer science
22.165high school european history
23.198high school geography
24.193high school government and politics
25.390high school macroeconomics
26.270high school mathematics
27.238high school microeconomics
28.151high school physics
29.545high school psychology
30.216high school statistics
31.204high school us history
32.237high school world history
33.223human aging
34.131human sexuality
35.121international law
36.108jurisprudence
37.163logical fallacies
38.112machine learning
39.103management
40.234marketing
41.100medical genetics
42.783miscellaneous
43.346moral disputes
44.895moral scenarios
45.306nutrition
46.311philosophy
47.324prehistory
48.282professional accounting
49.1534professional law
50.272professional medicine
51.612professional psychology
52.110public relations
53.245security studies
54.201sociology
55.100us foreign policy
56.166virology
57.171world religions
</details> </details> <details> <summary>What's quantization?</summary>

"Quantizing" a model means converting parts of it to lower precision numerical representations to lower its memory use. This can allow running large models on limited hardware, but may hurt quality. Learn more!

</details> <details> <summary>Bits per weight, bpw?</summary>

Quantization methods typically use mixed precision, expressing different parts of a model in different ways. A way to characterize quantization in one number is to divide its size (or the size of quantized parts of the model) in bits by its number of parameters (weights). Mind that the number of parameters is typically expressed in metric "engineering" units (powers of 1000), and file size in JEDEC units (powers of 1024), so the formula is:

bpw = (1024/1000)^3 (size in GB) / (billions of parameters) ≈
    ≈ 1.0737 (size in GB) / (billions of parameters)
</details> <details> <summary>EXL2? GGUF?</summary>

These are popular quantized LLM file formats, working with Exllama v2 and llama.cpp, respectively.

</details>

Correctness vs Model Size

The following plot shows how the models slowly lose the ability to answer MMLU questions correctly the more quantized they are.

<img src="./plots/MMLU-Correctness-vs-Model-Size.svg"> <details> <summary>Data table</summary>

* Note: the 70B model was evaluated with only 50 questions per category, the 8B with full MMLU.

bpw here was calculated only considering Llama 3's model.layers.*.weight layers, as the approach to quantizing the rest of the model differs significantly between methods.

Model size [GB]MMLU [%]bpwModelQuantType
45.84* 80.825.6670BQ5_K_MGGUF
34.77* 80.464.2670BIQ4_XSGGUF
29.32* 80.063.5070BIQ3_MGGUF
25.16* 79.093.0470BIQ3_XXSGGUF
22.04* 77.012.6270BIQ2_MGGUF
20.29* 76.052.3870BIQ2_SGGUF
19.36* 74.942.3570BIQ2_XSGGUF
17.46* 72.312.1170BIQ2_XXSGGUF
15.27* 65.211.8170BIQ1_MGGUF
13.9865.2016.008Bfp16GGUF
13.9865.2016.008Bfp16Exl2
13.9865.2116.008Bbf16transformers
13.96* 61.181.6370BIQ1_SGGUF
7.4365.238.508BQ8_0GGUF
6.9964.538.008B8bittransformers
6.9965.207.998B8bitExl2
5.7764.996.498B8bitExl2
5.7365.066.568BQ6_KGGUF
5.0064.905.678BQ5_K_MGGUF
4.8764.885.508BQ5_K_SGGUF
4.4564.275.008BQ5_K_SExl2
4.3064.644.828BQ4_K_MGGUF
4.0964.634.548BQ4_K_SGGUF
4.0764.334.528BIQ4_NLGGUF
3.8764.394.288BIQ4_XSGGUF
3.8463.364.258BIQ4_XSExl2
3.8162.854.088BQ3_K_LGGUF
3.5362.893.798BQ3_K_MGGUF
3.4963.424.008B4bitnf4transformers
3.4961.754.008B4bitfp4transformers
3.3162.553.508BIQ3_MGGUF
3.2360.283.508BIQ3_MExl2
3.2162.133.468BIQ3_SGGUF
3.2059.143.448BQ3_K_SGGUF
3.0661.193.268BIQ3_XSGGUF
2.8360.523.048BIQ3_XXSGGUF
2.7955.902.908BQ2_KGGUF
2.5357.562.648BIQ2_MGGUF
2.3553.982.408BIQ2_SGGUF
2.2649.982.378BIQ2_XSGGUF
2.0743.502.148BIQ2_XXSGGUF
1.8528.831.848BIQ1_MGGUF
1.7126.471.668BIQ1_SGGUF
</details> <details> <summary>Selected results per category</summary>

This table shows average confidence per category. Since 70B models were only evaluated on 50 questions per category, and some categories had 500+, the individual results may not be very comparable between 70B and 8B.

category70B-Q5_K_M70B-IQ2_XXS8B-Q8_08B-IQ2_M
marketing98.1%94.2%89.0%83.2%
high school government and politics98.1%97.8%90.1%80.8%
medical genetics96.6%85.3%82.7%71.0%
jurisprudence96.0%93.7%78.0%71.0%
high school us history95.3%89.0%80.0%70.3%
high school psychology94.8%91.7%84.1%76.5%
high school microeconomics93.9%80.8%75.9%62.0%
human sexuality93.5%81.3%77.7%66.1%
astronomy93.4%81.2%70.9%62.3%
business ethics93.2%76.0%66.6%60.0%
us foreign policy92.6%91.2%85.9%78.3%
prehistory92.5%85.2%73.7%64.5%
nutrition92.0%89.9%76.3%64.1%
high school world history91.3%88.7%82.8%73.8%
college biology90.9%85.1%79.3%67.0%
high school geography90.9%85.7%83.8%74.2%
miscellaneous90.5%86.8%82.8%75.6%
high school computer science90.3%83.3%71.2%62.9%
management89.9%88.4%83.5%73.8%
sociology89.5%83.6%84.4%79.4%
international law87.9%86.4%78.2%69.7%
conceptual physics87.4%82.4%57.0%47.8%
world religions87.2%82.1%82.5%77.5%
professional medicine86.8%76.4%71.7%58.2%
philosophy86.7%73.2%71.4%66.1%
computer security86.5%86.5%76.7%73.7%
moral scenarios86.0%49.3%43.5%33.3%
human aging85.4%83.8%71.8%64.7%
high school biology84.7%79.0%80.3%71.1%
college medicine84.4%74.3%65.6%59.6%
logical fallacies84.3%77.6%77.8%69.6%
professional psychology83.8%75.5%69.4%60.9%
high school european history83.2%79.6%77.9%72.6%
clinical knowledge82.5%72.6%75.0%65.2%
high school macroeconomics82.1%79.2%66.3%55.9%
anatomy81.7%66.2%69.6%56.7%
electrical engineering81.0%71.8%62.8%55.7%
security studies78.7%77.5%72.9%68.5%
high school statistics77.9%53.1%52.6%49.8%
public relations77.7%64.8%70.2%61.3%
elementary mathematics75.7%63.6%46.0%39.1%
machine learning74.3%62.8%49.8%42.5%
high school physics72.2%60.7%37.9%33.5%
moral disputes69.5%60.6%72.5%64.7%
high school chemistry65.8%59.4%52.0%44.4%
college computer science65.6%57.5%55.0%50.6%
college physics65.2%49.9%45.7%43.0%
formal logic62.5%50.1%49.7%42.2%
econometrics61.9%52.2%53.0%41.5%
abstract algebra60.7%41.7%29.7%29.5%
college mathematics60.0%43.1%36.6%31.1%
virology59.1%55.6%51.6%49.6%
professional law58.0%52.6%46.8%41.6%
global facts56.7%44.6%39.1%33.0%
professional accounting54.8%45.4%52.1%47.1%
high school mathematics54.1%44.4%34.9%29.6%
college chemistry52.7%49.4%45.1%40.7%
</details>

Key takeaways:

Is exllamav2 under-performing?

For lower bpw it seems to score lower on MMLU compared to GGUF of the same file size. However, file size does not exactly correlate to the memory a model will use. Considering the average bpw of the quantized layers (as in the next figure) may be a more fair comparison. Still, ExLlamaV2 offers some advantages:

Confidence vs bpw

Confidence here is the average normalized probability that the model would give a correct answer if we only consider the 4 tokens corresponding to valid answers. Random noise would result in 25% confidence (and 25% correctness) because I'm normalizing the 4 possible answers to add up to 100%.

<img src="./plots/Confidence-vs-bpw-no-head.svg">

The main takeaway here is that the 70B model is less affected by quantization. Perhaps it's more sparse relative to the 8B one. Extremely low quants of 70B remain somewhat useable, whereas 8B-IQ1-M and -S are near the random noise threshold.

<img src="./plots/Confidence-loss-vs-bpw.svg">

Here I plotted the loss of confidence (change from maximum). It seems to change like $\propto \text{bpw}^{-4.25}$. I had to include this, because no one can resist "things looking linear on a log-log plot."

Methodology

Shortcomings of this test

Applicability of this test

From anecdotal experience, it seems that quantization affects "rigorous" tasks like writing working source code more than open-ended tasks like creative writing. It would be interesting to methodically measure the effect of quantization on demanding programming benchmarks.

Shortcomings of MMLU

It's okay for this purpose

For this study, MMLU is fine because it's an ablation study. Even if MMLU is a flawed quality benchmark, it's good enough to see how a model's answers change with quantization.

Is MMLU still relevant?

It used to be arguable whether MMLU is a good benchmark, but it does not really matter any more, as top models are scoring around 90%. There's not much room for improvement, but fortunately harder benchmarks are being proposed.

It's partially broken

Some of MMLU questions are broken, arguably useful, opinionated, or lack necessary context. Example, question 133 from the "high school psychology" category:

As a result of an accident, Abdul lost sight in his right eye. To judge the distance of vehicles when he is driving, Abdul is able to rely on cues of

A. I only
B. II only
C. III only
D. I and II only

This question lacks statements numbered I, II, and III necessary to answer it.

Inference code

I based my code on the test included in ExLlamaV2's repository, but modified it heavily.

You can find pre-compiled Python wheels for inference libraries listed in the text-generation-webui repository.

The MMLU dataset can be found on HuggingFace and read with pandas.

The snippets assume that you load a list of strings representing the questions and answers as prompts and answers.

transformers

<details> <summary>Simplified transformers source code</summary>
import torch
import transformers

model_path = "path/to/model"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
config = transformers.PretrainedConfig.from_pretrained(model_path)
config.max_position_embeddings = 2048

quantization_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=False,
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_enable_fp32_cpu_offload=True,
)

model = transformers.LlamaForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    config=config,
    device_map="auto",
    attn_implementation="flash_attention_2",
    low_cpu_mem_usage=True,
    quantization_config=quantization_config,
)

answer_tokens = tokenizer.encode(
    " A B C D", add_special_tokens=False, return_tensors="pt"
)

with torch.no_grad(): # crucial for lower memory use
    for prompt, answer in zip(prompts, answers):
        prompt_ids = tokenizer.encode(
            prompt, add_special_tokens=False, return_tensors="pt"
        )

        logits_ans = model.forward(prompt_ids.cuda()).logits[:, -1, answer_tokens].cpu()
        # process the answer
        torch.cuda.empty_cache()
</details>

llama-cpp-python

Mind to install a correct version of llama-cpp-python, with CUDA support if you can use it. Adjust n_gpu_layers if you can't offload the full model. A model's total number of layers is listed in its config.json as num_hidden_layers.

<details> <summary>Simplified llama-cpp-python source code</summary>
import torch
from llama_cpp_cuda_tensorcores import Llama, llama_tokenizer

model_path = "path/to/model.gguf"
tokenizer_base = "path/to/model"  # where tokenizer.json is located

llama_params = {
    "model_path": model_path,
    "n_ctx": 2048,  # Text context, 0 = from model
    "n_batch": 512,  # Prompt processing maximum batch size
    "n_gpu_layers": -1,  # -1 offloads ALL layers
    "n_threads": 8,  # Number of threads to use for generation
    "n_threads_batch": 8,  # Number of threads to use for batch processing
    "logits_all": False,  # Not needed for model.eval()
    "offload_kqv": True,  # Offload K, Q, V to GPU.
    "tokenizer": llama_tokenizer.LlamaHFTokenizer.from_pretrained(
        tokenizer_base
    ),  # Optional tokenizer to override the default tokenizer from llama.cpp.
    "verbose": False,  # Don't print verbose output to stderr.
}

model = Llama(**llama_params)

answer_tokens = model.tokenize(" A B C D".encode(), add_bos=False)

for prompt, answer in zip(prompts, answers):
    prompt_ids = model.tokenize(prompt.encode(), add_bos=False)

    model.reset()
    model.eval(prompt_ids)
    logits = model.scores[model.n_tokens - 1]
    logits_ans = torch.tensor([logits[i] for i in answer_tokens], device="cpu")
</details>

ExLlamaV2

<details> <summary>Simplified ExLlamaV2 source code</summary>
from exllamav2 import (
    ExLlamaV2,
    ExLlamaV2Cache,
    ExLlamaV2Config,
    ExLlamaV2Tokenizer,
)

model_path = "path/to/model-exl2"
config = ExLlamaV2Config()
config.model_dir = model_path
config.prepare()
config.max_seq_len = 2048
model = ExLlamaV2(config)
tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model, max_seq_len=2048, lazy=True)
model.load_autosplit(cache)

answer_logits = tokenizer.encode(" A B C D")

for prompt, answer in zip(prompts, answers):
    prompt_ids = tokenizer.encode(prompt)
    logits = model.forward(prompt_ids, last_id_only=True)
    logits_ans = logits[:, :, answer_logits].cpu()
</details>

Evaluating the results from logits_ans involves checking if the highest logit corresponds to the correct answer. To measure confidence, record the normalized probability for the correct answer. Here, answer_id is in {0, 1, 2, 3} and corresponds to the correct answer token.

prob_ans = torch.softmax(logits_ans, dim=-1)
confidence = float(prob_ans[0, answer_id])
correct = bool(prob_ans.argmax() == answer_id)

Lastly, record the individual results per question or compute the averages, minding the varying number of questions per category.

The Tensors

<details> <summary>Size of individual model layers</summary>

This table compares the layer sizes of:

layertransformers [bytes]EXL2 [bytes]EXL2 [bpw]GGUF [bytes]GGUF [bpw]
model.embed_tokens1 050 673 1521 050 673 15216.00558 170 1128.50
lm_head1 050 673 152527 405 2488.03558 170 1128.50
model.norm8 1928 19216.0016 38432.00
*.input_layernorm262 144262 14416.00524 28832.00
*.self_attn.q_proj1 073 741 824539 498 4968.04570 425 3448.50
*.self_attn.k_proj268 435 456135 272 4488.06142 606 3368.50
*.self_attn.v_proj268 435 456135 272 4488.06142 606 3368.50
*.self_attn.o_proj1 073 741 824539 498 4968.04570 425 3448.50
*.post_attention_layernorm262 144262 14416.00524 28832.00
*.mlp.down_proj3 758 096 3841 875 792 8967.991 996 488 7048.50
*.mlp.gate_proj3 758 096 3841 874 073 6007.981 996 488 7048.50
*.mlp.up_proj3 758 096 3841 874 073 6007.981 996 488 7048.50
model.layers.*13 959 168 0006 974 006 2727.997 416 578 0488.50

All the "*." layers add up to "model.layers.*".

</details>