Home

Awesome

🚀 DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization

This repository provides the codebase for DRPruning, a method that incorporates refined Distributionally Robust Optimization (DRO) into structured pruning to address uneven performance degradation across domains in large language models (LLMs). By dynamically adjusting the data distribution and hyperparameters during training, DRPruning targets underperforming areas, promoting balanced recovery and reducing bias. This approach yields efficient, smaller models with robust, balanced capabilities, outperforming similarly sized models in both monolingual and multilingual settings.

<p align="center"> <img src="pic/main.png" width="900" alt="DRPruning Overview" /> </p>

🔗 Quick Links

Brief Introduction

DRPruning builds upon the LLM-Shearing framework, specifically optimizing it for LLM pre-training and pruning. By integrating DRO, DRPruning more effectively targets worst-case scenarios during pruning, ensuring robust and balanced model performance across various domains. The codebase is organized as follows:

Installation Requirements

To set up the environment, please follow the installation instructions from LLM-Shearing. Alternatively, you can use the provided Dockerfile to build the environment. We have updated several package versions to support newer LLMs. After setting up the environment, install the drpruning package in editable mode:

pip install -e .

Data Preparation

We use the pruning data from LLM-Shearing, which is available on Google Drive here. You can access it by clicking on the link. Alternatively, you have the option to process your own data as follows.

We provide preprocessing code to tokenize, sample, and process RedPajama data into MDS format (Mosaic's streaming package). For monolingual and multilingual settings, we process cerebras/SlimPajama-627B and uonlp/CulturaX, respectively. Please run data/SlimPajama.py and data/CulturaX.py to obtain the processed huggingface dataset versions.

After that, run data/SlimPajama_save.py and data/CulturaX_save.py to obtain datasets that meet the requirements of the Composer repository. Note that the eval folder must include eval_merge, which is a single split that contains validation data from all domains. We provide a utility script data/merge_data.py to merge data from multiple splits into one split. An example of running the script is as follows:

python3 -m drpruning.data.merge_data \
        --input_dir $INPUT_DIR \
        --output_dir $OUTPUT_DIR \
        --output_split eval_merge \
        --split_names domain1 domain2

Model Preparation

To use Hugging Face transformer models with the Composer repository employed by LLM-Shearing, you need to convert the model weights into the key format expected by Composer. Below is an example of converting the weights from the Hugging Face model llama2 into a compatible format:

# Define the Hugging Face model name and the output path
HF_MODEL_NAME=meta-llama/Llama-2-7b-hf
OUTPUT_PATH=models/Llama-2-7b-composer.pt

# Create the necessary directory if it doesn't exist
mkdir -p $(dirname $OUTPUT_PATH)

# Convert the Hugging Face model to Composer key format
python3 -m drpruning.utils.hf_to_composer $HF_MODEL_NAME $OUTPUT_PATH

Our current implementation supports Pythia, LLaMA, LLaMA2, LLaMA3, and Qwen2 models. It should also be straightforward to adapt it for other models such as Mistral-7B.

Sample Scripts for Pruning and Continued Pre-training

For pruning, refer to the example script drpruning/scripts/prune.sh. Due to the higher computational cost of pruning compared to continued pre-training, we halt training with the pruning objective after a specific number of steps (typically 3,200 steps in our experiments). We then proceed with further pre-training of the pruned model. After pruning, the saved models consist of the full parameters of the source model accompanied by a set of masks.

We process the masking variables by:

  1. Removing substructures where the masking variables are near zero.
  2. Incorporating the masking variables into the model parameters through matrix-vector multiplication, resulting in a more compact model.
  3. Renaming the weight keys so that they can be seamlessly loaded into the target model architecture, ensuring that the layer names are all consecutive.

This processing can be done using the following command:

MODEL_PATH=$MODEL_DIR/latest-rank0.pt
python3 -m drpruning.utils.post_pruning_processing prune_and_save_model $MODEL_PATH

The pruned model will be saved as $(dirname $MODEL_PATH)/pruned-latest-rank0.pt. This step is automatically performed at the end of the pruning script.

After model conversion, continue with the pre-training of the pruned model. The process is similar to pre-training a standard model. Refer to the example script drpruning/scripts/continue_pretrain.sh.

<!-- For continued pre-training under the Hugging Face framework, we have provided an implementation that closely follows [ParroT](https://github.com/wxjiao/ParroT). This facilitates simpler implementation and broader applicability for future work. See the example script [`drpruning/scripts/continue_pretrain_hf.sh`](drpruning/scripts/continue_pretrain_hf.sh) for details. -->

Convert Composer Model to Hugging Face Model

After training, if you would like to use Hugging Face for inference or fine-tuning, you can convert your Composer model into a Hugging Face model using the script drpruning/scripts/composer_to_hf.py. Here's an example:

MODEL_PATH=$MODEL_DIR/latest-rank0.pt
OUTPUT_PATH=$MODEL_DIR/hf-latest_rank0
MODEL_CLASS=Llama2
HIDDEN_SIZE=2048
NUM_ATTENTION_HEADS=16
NUM_HIDDEN_LAYERS=24
INTERMEDIATE_SIZE=5504
MODEL_NAME=Pruned-Llama-1.3B

python3 -m drpruning.utils.composer_to_hf $MODEL_PATH $OUTPUT_PATH \
    model_class=${MODEL_CLASS} \
    hidden_size=${HIDDEN_SIZE} \
    num_attention_heads=${NUM_ATTENTION_HEADS} \
    num_hidden_layers=${NUM_HIDDEN_LAYERS} \
    intermediate_size=${INTERMEDIATE_SIZE} \
    num_key_value_heads=${NUM_ATTENTION_HEADS} \
    _name_or_path=${MODEL_NAME}

Note: The parameter names correspond to the Hugging Face configurations of most LLMs and may differ for other models.

Training Configurations

This section provides an in-depth guide on configuring parameters within the training scripts and the YAML configuration files for training. The configurations cover data setup, basic training settings, pruning settings, and dynamic data loading configurations.

Data Configurations

Basic Training Configurations

These configurations largely follow those of the original Composer package. For comprehensive details, refer to Composer's official documentation. Key training parameters include:

<!-- Due to computational constraints, an exhaustive hyperparameter search was not conducted; better hyperparameters may exist. For the Hugging Face implementation, we closely follow [ParroT](https://github.com/wxjiao/ParroT)'s settings and principles. Please refer to [`run_clm_llms.py`](https://github.com/wxjiao/ParroT/blob/master/transformers/examples/pytorch/language-modeling/run_clm_llms.py) for more details. -->

Pruning Configurations

The pruning process allows pruning a source model to a specific target shape. Essential parameters include:

Pruning-specific arguments grouped under model.l0_module:

These parameters allow precise control over the pruning process.

Dynamic Data Proportion Configurations

Parameters for configuring dynamic data proportion are defined within the DynamicLoadingCallback and DRPruningCallback. Configure them in the YAML file under the callbacks.data_loading section:

For DRO, additional configurations are needed:

<!-- For the Hugging Face version, the variable names only need to retain the part after the last `.`, i.e., `callbacks.data_loading.update_type` becomes `update_type`. The remaining settings remain consistent. -->

Note: The code currently supports only local data, functions with a single worker for the dataloader, and does not offer prefetch support. In our testing, these restrictions do not incur additional computational overhead.