Home

Awesome

GreenBit LLaMA

This is GreenBitAI's research code for running 2-bit and 1-bit LLaMA models with extreme compression yet still strong performance, the quantized models are available on the model zoo.

This is meant to be a research demo for the quality of the model. There is no speed-up implemented yet.

Roadmap

Over the next few months, we will continue offering 2-bit and 1-bit versions of LLaMA models. Additionally, we are considering the provision of low-bit versions for other open-source LLMs in the future.

Latest Updates

[12/14/2023] We are happy to release the lossless (<1%) W4A16 01-Yi models (low_bit_yi branch). The 2-bit version will be made open soon.

[10/04/2023] We are happy to release the W2A16 g8/32 TinyLLaMA-1.1B models.

[09/29/2023] We are happy to release the W2A16 g8 LLaMA-1 30B and LLaMA-2 70B models.

[09/12/2023] We are happy to announce the release of the 2-bit LLaMA-2 7B (W2A16 g32/g8) models.

[08/31/2023] We are happy to release the harness benchmarks on 14 zero-shot tasks based on our 2-bit models. Happy trying 😃🚀.

[08/16/2023] We are happy to release the 2-bit OpenLLaMA 3B models, which are quantized into 2-bit representation yet still with strong performance 😃⭐.

Pretrained Model

LLM ModelsMethodBitsGroupsizeWikitext2C4Checkpoint Size (GiB)
LLaMA-2-70B1FP1616-3.315.70130
Ours283.875.9626.9
LLaMA-1-30B1FP1616-4.105.9860.5
Ours284.756.5712.9
LLaMA-2-7B1FP1616-5.476.9712.5
GPTQ241285.617.123.6
GPTQ221282.2e51.7e52.2
OmniQuant341285.587.123.8
OmniQuant331286.037.353.2
OmniQuant3212812.8417.402.2
OmniQuant326410.5613.77-
Ours4325.557.083.7
Ours286.097.632.9
Ours2327.138.672.2
LLaMA-1-7B4FP1616-5.677.0712.5
GPTQ241285.857.213.6
GPTQ231286.617.853.0
OmniQuant3212810.5313.892.2
Ours2327.598.962.2
LLaMA 3B5FP1616-7.349.336.8
GPTQ241287.549.581.9
Ours4327.439.512.0
Ours288.3210.561.5
Ours2168.9211.291.3
Ours2329.8212.141.2
TinyLLaMA 1.1B6FP1616-9.1010.64.0
Ours289.9911.750.6
Ours23212.0414.270.5

Fine-tuned Model

LLM ModelsMethodBitsCheckpoint Size (GiB)
LLaMA-2-70B-Chat1FP1616130
Ours226.9
CodeLLaMA-34B7FP161663
Ours213.5
CodeLLaMA-34B-Python7FP161663
Ours213.5
CodeLLaMA-34B-Instruction7FP161663
Ours--

Zero-Shot Evaluation

TaskMetricTinyLLaMA 1.1B q2g32TinyLLaMA 1.1B q2g8LLaMA 3B q2g32LLaMA 3B q2g16LLaMA 3B q2g8LLaMA-1 7B q2g32LLaMA-2 7B q2g32LLaMA-2 7B q2g8LLaMA 1.1B FP16LLaMA 3B FP16LLaMA-1 7B FP16
Openbookqaacc0.1520.1920.1960.2380.2420.2240.2460.2960.2080.270.29
ac_norm0.3280.3380.3320.3580.3620.3880.3760.40.3680.40.41
arc_challengeacc0.32680.22780.2790.29780.31480.34220.32680.36180.2430.340.39
ac_norm0.33870.2730.29440.33190.33450.33870.33870.3720.2880.370.41
hellawswagacc0.340.37690.42380.4440.4620.49960.49610.53790.4030.490.68
ac_norm0.40970.47110.56850.59880.62420.64470.64640.70140.5030.670.73
piqaacc0.65180.69310.70240.7160.72910.74760.75030.77150.710.750.78
ac_norm0.63930.68120.71160.72470.73120.74430.74210.75680.6880.760.78
arc_easyacc0.44110.51090.59970.6460.65280.60610.61740.62540.5330.690.68
ac_norm0.37160.4120.54170.580.59720.45660.47810.49580.430.650.52
Winograndeacc0.5320.52490.56830.58880.60540.62830.62980.65820.5580.620.68
boolqacc0.5920.61740.62810.66360.63270.64250.70610.72420.5830.680.75
truthfulqa_mcmc10.23380.22770.25090.21180.22520.2240.23130.23990.2280.220.21
mc20.42110.4060.39620.35010.36250.37020.38540.37950.4010.350.34
anli_r1acc0.3630.3360.3370.3340.3440.3310.3330.3630.3540.330.35
anli_r2acc0.3310.3460.3350.3320.3310.3260.3490.3470.3410.320.34
anli_r3acc0.37580.36330.33580.33830.34250.34170.360.37330.3580.350.37
wicacc0.50.50.49840.50940.49690.49840.49530.4890.50.480.5
rteacc0.48740.48740.55960.59930.56320.6390.60650.64260.5160.580.56
recordf10.76080.80230.85020.86250.86870.88590.88720.90370.820.880.91
em0.7530.79340.84270.85450.86120.87810.88010.89590.8180.890.91
Average0.4380.44980.48810.50370.50870.51220.51810.53910.4690.5280.5519
model sizeGiB0.50.61.21.31.52.22.22.94.46.812.5

Requirements

The inference currently requires a machine with CUDA installed. Then you can simply run:

pip install -r requirements.txt

Try the model

Use the environment variable CUDA_VISIBLE_DEVICES to select the correct GPU. Multi-GPU is not supported, but the model is very compressed, so 1 GPU should be enough. To use the instruction-tuned model, you can use the following commands in scripts/. Predefined scripts already there:

bash scripts/evaluate/tiny_llama_w2a16g32.sh    # for open task evaluation of the base model.
bash scripts/inference/llama2_70b_w2a16g8.sh     # for text generation inference of the base model.
bash scripts/instruction-chat/llama2_70b_w2a16g8.sh  # for instruction following chat of the fine-tuned model.
bash scripts/inference/codellama_34b_w2a16g8.sh         # for text generation inference of the codellama model

References

This code is based on:

Thanks to Meta AI for releasing LLaMA, a powerful LLM.

Citation

If you use our approach in your research, please cite our work as follows:

@article{low_bit_llama,
  title={Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs},
  author={Guo, Nianhui and Bethge, Joseph and Hu, Ting and Meinel, Christoph and Yang, Haojin},
  journal={https://github.com/GreenBitAI/low_bit_llama},
  year={2023}
}

License

The original code was released under its respective license and copyrights, i.e.:

We release our changes and additions to these files under the Apache 2.0 License.

Footnotes

  1. LLaMA-2 2 3 4

  2. GPTQ 2 3 4 5

  3. OmniQuant 2 3 4 5

  4. LLaMA-1

  5. OpenLLaMA

  6. TinyLLaMA

  7. CodeLLaMA 2 3