Awesome

What Matters in Transformers? Not All Attention is Needed

Shwai He*, Guoheng Sun*, Zheyu Shen, Ang Li

This is the official implementation of the paper What Matters in Transformers? Not All Attention is Needed . We conduct extensive experiments and analysis to reveal the architecture redundancy within transformer-based Large Language Models (LLMs). Pipeline for Block Drop and Layer Drop is based on the LLaMA-Factory. The quantization is implemented based on the AutoAWQ and AutoGPTQ.

Introduction

Transformer-based large language models (LLMs) often contain architectural redundancies. In this work, we systematically investigate redundancy across different types of modules, including Blocks, Attention layers, and MLP layers. Surprisingly, we found that Attention layers, the core component of transformers, are particularly redundant. For example, in the Llama-3-70B model, half of the Attention layers can be dropped while maintaining performance. Our observations indicate that this redundancy in Attention layers persists throughout the training process, necessitating Attention Drop. Additionally, dropping Attention layers significantly enhances computational and memory efficiency. Our findings are informative for the ML community and provide insights for future architecture design.

News

Nov 2024: Updated the code to support additional language models, including Gemma2, Baichuan, DeepSeek, Yi, Solar. Yi and Solar refer to Llama, and we checked their availability.
Sep 2024: Released checkpoints for dropped models using Block Drop and Layer Drop.
Jun 2024: Published preprint on arXiv along with the related codebase.

Quick Start

Installation

conda create -n llm-drop python=3.10
conda activate llm-drop

git clone https://github.com/CASE-Lab-UMD/LLM-Drop

#For Dropping:
cd ./LLM-Drop
pip install -e .

#For Quantization:
cd ./src/llmtuner/compression/quantization/AutoAWQ
pip install -e .

cd ./src/llmtuner/compression/quantization/AutoAWQ/AutoAWQ_kernels
pip install -e .

cd ./src/llmtuner/compression/quantization/AutoGPTQ
pip install -vvv --no-build-isolation -e .

Prepare Models

Download the models (e.g., Mistral-7B, Llama-2 and Llama-3) from HuggingFace. We create new config and modeling files to represent the models by layers or blocks. The key auto_map needs to be added in the config.json to utilize the new files. Take Mistral-7B as an example:

"auto_map": {
    "AutoConfig": "configuration_dropped_mistral.MistralConfig",
    "AutoModelForCausalLM": "modeling_dropped_mistral.MistralForCausalLM"
  },

Additionally, the key drop_attn_list and drop_mlp_list respectively mark which Attention layers and MLPs should be dropped based on their layer index. For instance,

Drop 4 Attention layers:

 "drop_mlp_list": [],
 "drop_attn_list": [25, 26, 24, 22],

Drop 4 MLPs layers:

 "drop_mlp_list": [26, 27, 25, 24],
 "drop_attn_list": [],

Drop 4 Blocks:

 "drop_mlp_list": [26, 25, 24, 27],
 "drop_attn_list": [26, 25, 24, 27],

Run Dropping

Block Drop

bash scripts/dropping/block_drop.sh

Layer Drop

bash scripts/dropping/layer_drop.sh

Joint Layer Drop

bash scripts/dropping/layer_drop_joint.sh

These bash scripts will generate the importance scores for blocks/layers, determine which blocks/layers to retain, and create new model configuration files indicating the dropped modules.

Benchmarks

Performance

Evaluate the performance of the model with dropping some modules on specific tasks:

bash scripts/benchmark/benchmark_lm_eval.sh

The evaluation code is based on EleutherAI/lm-evaluation-harness. To fully reproduce our results, please use this version. It samples few-shot based on the index of the samples, avoiding the issue of result variation with the number of processes during data parallel inference. Remember to use the modeling files in src/llmtuner/model to load the Mistral and Llama models.

SpeedUp

Evaluate the speedup ratio of the model with dropping some modules:

bash scripts/benchmark/benchmark_speed.sh

Quantization

Please refer to AutoGPTQ and AutoAWQ. Ensure you carefully install the packages that correspond to your CUDA version. For quantization, use the following scripts:

bash scripts/quantization/awq.sh
bash scripts/quantization/gptq.sh

Citation

@misc{he2024matterstransformersattentionneeded,
      title={What Matters in Transformers? Not All Attention is Needed}, 
      author={Shwai He and Guoheng Sun and Zheyu Shen and Ang Li},
      year={2024},
      eprint={2406.15786},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2406.15786}, 
}

Contact Us

If you have any questions, please contact:

Shwai He: shwaihe@umd.edu
Guoheng Sun: ghsun@umd.edu