Home

Awesome

Awesome Resource-Efficient LLM Papers Awesome

<div style="display: flex; align-items: center;"> <div style="flex: 1;"> A curated list of high-quality papers on resource-efficient LLMs. </div> <div> <img src="media/clean_energy.gif" alt="Clean Energy GIF" width="80" /> </div> </div>

This is the GitHub repo for our survey paper Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models.

Table of Contents

<!-------------------------------------------------------------------------------------->

LLM Architecture Design

Efficient Transformer Architecture

DateKeywordsPaperVenue
2024Approximate attentionSimple linear attention language models balance the recall-throughput tradeoffArXiv
2024Hardware attentionMobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use CasesArXiv
2024Approximate attentionLoMA: Lossless Compressed Memory AttentionArXiv
2024Approximate attentionTwo Stones Hit One Bird: Bilevel Positional Encoding for Better Length ExtrapolationICML
2024Hardware optimizationFlashAttention-2: Faster Attention with Better Parallelism and Work PartitioningICLR
2023Hardware optimizationFlashattention: Fast and memory-efficient exact attention with io-awarenessNeurIPS
2023Approximate attentionKDEformer: Accelerating Transformers via Kernel Density EstimationICML
2023Approximate attentionMega: Moving Average Equipped Gated Attention ICLR
2022Hardware optimizationxFormers - Toolbox to Accelerate Research on TransformersGitHub
2021Approximate attentionEfficient attention: Attention with linear complexitiesWACV
2021Approximate attentionAn Attention Free TransformerArXiv
2021Approximate attentionSelf-attention Does Not Need O(n^2) MemoryArXiv
2021Hardware optimizationLightSeq: A High Performance Inference Library for TransformersNAACL
2021Hardware optimizationFasterTransformer: A Faster Transformer FrameworkGitHub
2020Approximate attentionTransformers are RNNs: Fast Autoregressive Transformers with Linear AttentionICML
2019Approximate attentionReformer: The efficient transformerICLR

Non-transformer Architecture

DateKeywordsPaperVenue
2024DecoderYou Only Cache Once: Decoder-Decoder Architectures for Language ModelsArXiv
2024BitLinear layerScalable MatMul-free Language ModelingArXiv
2023RNN LMRWKV: Reinventing RNNs for the Transformer EraEMNLP-Findings
2023MLPAuto-Regressive Next-Token Predictors are Universal LearnersArXiv
2023Convolutional LMHyena Hierarchy: Towards Larger Convolutional Language modelsICML
2023Sub-quadratic Matrices basedMonarch Mixer: A Simple Sub-Quadratic GEMM-Based ArchitectureNeurIPS
2023Selective State Space ModelMamba: Linear-Time Sequence Modeling with Selective State SpacesArXiv
2022Mixture of ExpertsSwitch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient SparsityJMLR
2022Mixture of ExpertsGLaM: Efficient Scaling of Language Models with Mixture-of-ExpertsICML
2022Mixture of ExpertsMixture-of-Experts with Expert Choice RoutingNeurIPS
2022Mixture of ExpertsEfficient Large Scale Language Modeling with Mixtures of ExpertsEMNLP
2017Mixture of ExpertsOutrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts LayerICLR
<!-------------------------------------------------------------------------------------->

LLM Pre-Training

Memory Efficiency

Distributed Training

DateKeywordsPaperVenue
2024Model ParallelismProTrain: Efficient LLM Training via Adaptive Memory ManagementArxiv
2024Model ParallelismMegaScale: Scaling Large Language Model Training to More Than 10,000 GPUsArxiv
2023Data ParallelismPalm: Scaling language modeling with pathwaysGithub
2023Model ParallelismBpipe: memory-balanced pipeline parallelism for training large language modelsJMLR
2022Model ParallelismAlpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep LearningOSDI
2021Data ParallelismFairScale: A general purpose modular PyTorch library for high performance and large scale trainingJMLR
2020Data ParallelismZero: Memory optimizations toward training trillion parameter modelsIEEE SC20
2019Model ParallelismGPipe: Efficient Training of Giant Neural Networks using Pipeline ParallelismNeurIPS
2019Model ParallelismMegatron-LM: Training Multi-Billion Parameter Language Models Using Model ParallelismArxiv
2019Model ParallelismPipeDream: generalized pipeline parallelism for DNN trainingSOSP
2018Model ParallelismMesh-tensorflow: Deep learning for supercomputersNeurIPS

Mixed precision training

DateKeywordsPaperVenue
2022Mixed Precision TrainingBLOOM: A 176B-Parameter Open-Access Multilingual Language ModelArxiv
2018Mixed Precision TrainingBert: Pre-training of deep bidirectional transformers for language understandingACL
2017Mixed Precision TrainingMixed Precision TrainingICLR

Data Efficiency

Importance Sampling

DateKeywordsPaperVenue
2024Importance samplingLISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-TuningArxiv
2023Survey on importance samplingA Survey on Efficient Training of TransformersIJCAI
2023Importance samplingData-Juicer: A One-Stop Data Processing System for Large Language ModelsArxiv
2023Importance samplingINGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language ModelsEMNLP
2023Importance samplingMachine Learning Force Fields with Data Cost Aware TrainingICML
2022Importance samplingBeyond neural scaling laws: beating power law scaling via data pruningNeurIPS
2021Importance samplingDeep Learning on a Data Diet: Finding Important Examples Early in TrainingNeurIPS
2018Importance samplingTraining Deep Models Faster with Robust, Approximate Importance SamplingNeurIPS
2018Importance samplingNot All Samples Are Created Equal: Deep Learning with Importance SamplingICML

Data Augmentation

DateKeywordsPaperVenue
2024Data AugmentationLLMRec: Large Language Models with Graph Augmentation for RecommendationWSDM
2024Data augmentationLLM-DA: Data Augmentation via Large Language Models for Few-Shot Named Entity RecognitionArxiv
2023Data augmentationMixGen: A New Multi-Modal Data AugmentationWACV
2023Data augmentationAugmentation-Aware Self-Supervision for Data-Efficient GAN TrainingNeurIPS
2023Data augmentationImproving End-to-End Speech Processing by Efficient Text Data Utilization with Latent SynthesisEMNLP
2023Data augmentationFaMeSumm: Investigating and Improving Faithfulness of Medical SummarizationEMNLP

Training Objective

DateKeywordsPaperVenue
2023Training objectiveChallenges and Applications of Large Language ModelsArxiv
2023Training objectiveEfficient Data Learning for Open Information Extraction with Pre-trained Language ModelsEMNLP
2023Masked language-image modelingScaling Language-Image Pre-training via MaskingCVPR
2022Masked image modelingMasked Autoencoders Are Scalable Vision LearnersCVPR
2019Masked language modelingMASS: Masked Sequence to Sequence Pre-training for Language GenerationICML
<!-------------------------------------------------------------------------------------->

LLM Fine-Tuning

Parameter-Efficient Fine-Tuning

DateKeywordsPaperVenue
2024LoRA-based fine-tuningDlora: Distributed parameter-efficient fine-tuning solution for large language modelArxiv
2024LoRA-based fine-tuningSplitLoRA: A Split Parameter-Efficient Fine-Tuning Framework for Large Language ModelsArxiv
2024LoRA-based fine-tuningData-efficient Fine-tuning for LLM-based RecommendationSIGIR
2024LoRA-based fine-tuningMEFT: Memory-Efficient Fine-Tuning through Sparse AdapterACL
2023LoRA-based fine-tuningDyLoRA: Parameter-Efficient Tuning of Pretrained Models using Dynamic Search-Free Low Rank AdaptationEACL
2022Masking-based fine-tuningFine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks AdaptivelyNeurIPS
2021Masking-based fine-tuningBitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-modelsACL
2021Masking-based fine-tuningRaise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuningEMNLP
2021Masking-based fine-tuningUnlearning Bias in Language Models by Partitioning GradientsACL
2019Masking-based fine-tuningSMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized OptimizationACL

Full-Parameter Fine-Tuning

DateKeywordsPaperVenue
2024Full-parameter fine-tuningHift: A hierarchical full parameter fine-tuning strategyArxiv
2024Study of full-parameter fine-tuning optimizationsA Study of Optimizations for Fine-tuning Large Language ModelsArxiv
2023Comparative study betweeen full-parameter and LoRA-base fine-tuningA Comparative Study between Full-Parameter and LoRA-based Fine-Tuning on Chinese Instruction Data for Instruction Following Large Language ModelArxiv
2023Comparative study betweeen full-parameter and parameter-efficient fine-tuningComparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classificationArxiv
2023Full-parameter fine-tuning with limited resourcesFull Parameter Fine-tuning for Large Language Models with Limited ResourcesArxiv
2023Memory-efficient fine-tuningFine-Tuning Language Models with Just Forward PassesNeurIPS
2023Full-parameter fine-tuning for medicine applicationsPMC-LLaMA: Towards Building Open-source Language Models for MedicineArxiv
2022Drawback of full-parameter fine-tuningFine-Tuning can Distort Pretrained Features and Underperform Out-of-DistributionICLR
<!-------------------------------------------------------------------------------------->

LLM Inference

Model Compression

Pruning

DateKeywordsPaperVenue
2024Unstructured PruningSparseLLM: Towards Global Pruning for Pre-trained Language ModelsNeurIPS
2024Structured PruningPerplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference ModelsArxiv
2024Structured PruningBESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity AllocationArxiv
2024Structured PruningShortGPT: Layers in Large Language Models are More Redundant Than You ExpectArxiv
2024Structured PruningNutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language ModelsArxiv
2024Structured PruningSliceGPT: Compress Large Language Models by Deleting Rows and ColumnsICLR
2024Unstructured PruningDynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMsICLR
2024Structured PruningPlug-and-Play: An Efficient Post-training Pruning Method for Large Language ModelsICLR
2023Unstructured PruningOne-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language ModelsArxiv
2023Unstructured PruningSparseGPT: Massive Language Models Can be Accurately Pruned in One-ShotICML
2023Unstructured PruningA Simple and Effective Pruning Approach for Large Language ModelsICLR
2023Unstructured PruningAccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With TransformersTCAD
2023Structured PruningLLM-Pruner: On the Structural Pruning of Large Language ModelsNeurIPS
2023Structured PruningLoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse ApproximationICML
2023Structured PruningStructured Pruning for Efficient Generative Pre-trained Language ModelsACL
2023Structured PruningZipLM: Inference-Aware Structured Pruning of Language ModelsNeurIPS
2023Contextual PruningDeja Vu: Contextual Sparsity for Efficient LLMs at Inference TimeICML

Quantization

DateKeywordsPaperVenue
2024Weight QuantizationEvaluating Quantized Large Language ModelsArxiv
2024Weight QuantizationI-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language ModelsArxiv
2024Weight QuantizationABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language ModelsArxiv
2024Weight-Activation Co-QuantizationRotation and Permutation for Advanced Outlier Management and Efficient Quantization of LLMsNeurIPS
2024Weight QuantizationOmniQuant: Omnidirectionally Calibrated Quantization for Large Language ModelsICLR
2023Weight QuantizationFlexround: Learnable rounding based on element-wise division for post-training quantizationICML
2023Weight QuantizationOutlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scalingEMNLP
2023Weight QuantizationOWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language ModelsAAAI
2023Weight QuantizationGptq: Accurate posttraining quantization for generative pre-trained transformersICLR
2023Weight QuantizationDynamic Stashing Quantization for Efficient Transformer TrainingEMNLP
2023Weight QuantizationQuantization-aware and tensor-compressed training of transformers for natural language understandingInterspeech
2023Weight QuantizationQLoRA: Efficient Finetuning of Quantized LLMsNeurIPS
2023Weight QuantizationStable and low-precision training for large-scale vision-language modelsNeurIPS
2023Weight QuantizationPrequant: A task-agnostic quantization approach for pre-trained language modelsACL
2023Weight QuantizationOlive: Accelerating large language models via hardware-friendly outliervictim pair quantizationISCA
2023Weight QuantizationAwq: Activationaware weight quantization for llm compression and accelerationarXiv
2023Weight QuantizationSpqr: A sparsequantized representation for near-lossless llm weight compressionarXiv
2023Weight QuantizationSqueezeLLM: Dense-and-Sparse QuantizationarXiv
2023Weight QuantizationLLM-QAT: Data-Free Quantization Aware Training for Large Language ModelsarXiv
2022Activation QuantizationGact: Activation compressed training for generic network architecturesICML
2022Fixed-point QuantizationBoost Vision Transformer with GPU-Friendly Sparsity and QuantizationACL
2021Activation QuantizationAc-gc: Lossy activation compression with guaranteed convergenceNeurIPS

Dynamic Acceleration

Input Pruning

DateKeywordsPaperVenue
2024Score-based Token RemovalPrompt-prompted Adaptive Structured Pruning for Efficient LLM GenerationCOLM
2024Score-based Token RemovalLazyLLM: Dynamic Token Pruning for Efficient Long Context LLM InferenceArxiv
2024Learning-based Token RemovalLLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt CompressionACL
2024Learning-based Token RemovalCompressed Context Memory For Online Language Model InteractionICLR
2023Score-based Token RemovalConstraint-aware and Ranking-distilled Token Pruning for Efficient Transformer InferenceKDD
2023Learning-based Token RemovalPuMer: Pruning and Merging Tokens for Efficient Vision Language ModelsACL
2023Learning-based Token RemovalInfor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language modelarXiv
2023Learning-based Token RemovalSmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language ModelsarXiv
2022Learning-based Token RemovalTranskimmer: Transformer Learns to Layer-wise SkimACL
2022Score-based Token RemovalLearned Token Pruning for TransformersKDD
2021Learning-based Token RemovalTR-BERT: Dynamic Token Reduction for Accelerating BERT InferenceNAACL
2021Score-based Token RemovalEfficient sparse attention architecture with cascade token and head pruningHPCA
<!-------------------------------------------------------------------------------------->

System Design

Deployment optimization

DateKeywordsPaperVenue
2024Hardware OptimizationLUT TENSOR CORE: Lookup Table Enables Efficient Low-Bit LLM Inference AccelerationArxiv
2023Hardware offloadingFlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPUPMLR
2023Hardware offloadingFast distributed inference serving for large language modelsarXiv
2022Collaborative inferencePetals: Collaborative Inference and Fine-tuning of Large ModelsarXiv
2022Hardware offloadingDeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented ScaleIEEE SC22

Support Infrastructure

DateKeywordsPaperVenue
2024Edge devicesMobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use CasesICML
2024Edge devicesEdgeShard: Efficient LLM Inference via Collaborative Edge ComputingArxiv
2024Edge devicesAny-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMsICML
2024Edge devicesThe breakthrough memory solutions for improved performance on llm inferenceIEEE Micro
2024Edge devicesMELTing point: Mobile Evaluation of Language TransformersMobiCom
2024Edge devicesLLM as a System Service on Mobile DevicesArxiv
2024Edge devicesLocMoE: A Low-overhead MoE for Large Language Model TrainingArxiv
2024Edge devicesJetmoe: Reaching llama2 performance with 0.1 m dollarsArxiv
2023Edge devicesTraining Large-Vocabulary Neural Language Models by Private Federated Learning for Resource-Constrained DevicesICASSP
2023Edge devicesFederated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the UglyarXiv
2023LibrariesColossal-AI: A Unified Deep Learning System For Large-Scale Parallel TrainingICPP
2023LibrariesGPT-NeoX-20B: An Open-Source Autoregressive Language ModelACL
2023Edge devicesLarge Language Models Empowered Autonomous Edge AI for Connected IntelligencearXiv
2022LibrariesDeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented ScaleIEEE SC22
2022LibrariesAlpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep LearningOSDI
2022Edge devicesEdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq GenerationarXiv
2022Edge devicesProFormer: Towards On-Device LSH Projection-Based TransformersACL
2021Edge devicesGenerate More Features with Cheap Operations for BERTACL
2021Edge devicesSqueezeBERT: What can computer vision teach NLP about efficient neural networks?SustaiNLP
2020Edge devicesLite Transformer with Long-Short Range AttentionarXiv
2019LibrariesMegatron-LM: Training Multi-Billion Parameter Language Models Using Model ParallelismIEEE SC22
2018LibrariesMesh-TensorFlow: Deep Learning for SupercomputersNeurIPS

Other Systems

DateKeywordsPaperVenue
2023Other SystemsTabi: An Efficient Multi-Level Inference System for Large Language ModelsEuroSys
2023Other SystemsNear-Duplicate Sequence Search at Scale for Large Language Model Memorization EvaluationPACMMOD
<!-------------------------------------------------------------------------------------->

Resource-Efficiency Evaluation Metrics & Benchmarks

🧮 Computation Metrics

MetricDescriptionExample Usage
FLOPs (Floating-point operations)the number of arithmetic operations on floating-point numbers[FLOPs]
Training Timethe total duration required for training, typically measured in wall-clock minutes, hours, or days[minutes, days]<br>[hours]
Inference Time/Latencythe average time required generate an output after receiving an input, typically measured in wall-clock time or CPU/GPU/TPU clock time in milliseconds or seconds[end-to-end latency in seconds]<br>[next token generation latency in milliseconds]
Throughputthe rate of output tokens generation or tasks completion, typically measured in tokens per second (TPS) or queries per second (QPS)[tokens/s]<br>[queries/s]
Speed-Up Ratiothe improvement in inference speed compared to a baseline model[inference time speed-up]<br>[throughput speed-up]

💾 Memory Metrics

MetricDescriptionExample Usage
Number of Parametersthe number of adjustable variables in the LLM’s neural network[number of parameters]
Model Sizethe storage space required for storing the entire model[peak memory usage in GB]

⚡️ Energy Metrics

MetricDescriptionExample Usage
Energy Consumptionthe electrical power used during the LLM’s lifecycle[kWh]
Carbon Emissionthe greenhouse gas emissions associated with the model’s energy usage[kgCO2eq]
<!-- software packages designed for real-time tracking of energy consumption and carbon emissions**. -->

The following are available software packages designed for real-time tracking of energy consumption and carbon emission.

<!-- tools for predicting the energy usage and carbon footprint before training**. -->

You might also find the following helpful for predicting the energy usage and carbon footprint before actual training or

💵 Financial Cost Metric

MetricDescriptionExample Usage
Dollars per parameterthe total cost of training (or running) the LLM by the number of parameters

📨 Network Communication Metric

MetricDescriptionExample Usage
Communication Volumethe total amount of data transmitted across the network during a specific LLM execution or training run[communication volume in TB]

💡 Other Metrics

MetricDescriptionExample Usage
Compression Ratiothe reduction in size of the compressed model compared to the original model[compress rate]<br>[percentage of weights remaining]
Loyalty/Fidelitythe resemblance between the teacher and student models in terms of both predictions consistency and predicted probability distributions alignment[loyalty]<br>[fidelity]
Robustnessthe resistance to adversarial attacks, where slight input modifications can potentially manipulate the model's output[after-attack accuracy, query number]
Pareto Optimalitythe optimal trade-offs between various competing factors[Pareto frontier (cost and accuracy)]<br>[Pareto frontier (performance and FLOPs)]

Benchmarks

BenchmarkDescriptionPaper
General NLP Benchmarksan extensive collection of general NLP benchmarks such as GLUE, SuperGLUE, WMT, and SQuAD, etc.A Comprehensive Overview of Large Language Models
Dynaboardan open-source platform for evaluating NLP models in the cloud, offering real-time interaction and a holistic assessment of model quality with customizable DynascoreDynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
EfficientQAan open-domain Question Answering (QA) challenge at NeurIPS 2020 that focuses on building accurate, memory-efficient QA systemsNeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned
SustaiNLP 2020 Shared Taska challenge for development of energy-efficient NLP models by assessing their performance across eight NLU tasks using SuperGLUE metrics and evaluating their energy consumption during inferenceOverview of the SustaiNLP 2020 Shared Task
ELUE (Efficient Language Understanding Evaluation)a benchmark platform for evaluating NLP model efficiency across various tasks, offering online metrics and requiring only a Python model definition file for submissionTowards Efficient NLP: A Standard Evaluation and A Strong Baseline
VLUE (Vision-Language Understanding Evaluation)a comprehensive benchmark for assessing vision-language models across multiple tasks, offering an online platform for evaluation and comparisonVLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Long Range Arena (LAG)a benchmark suite evaluating efficient Transformer models on long-context tasks, spanning diverse modalities and reasoning types while allowing evaluations under controlled resource constraints, highlighting real-world efficiencyLong Range Arena: A Benchmark for Efficient Transformers
Efficiency-aware MS MARCOan enhanced MS MARCO information retrieval benchmark that integrates efficiency metrics like per-query latency and cost alongside accuracy, facilitating a comprehensive evaluation of IR systemsMoving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking
<!-------------------------------------------------------------------------------------->

Reference

If you find this paper list useful in your research, please consider citing:

@article{bai2024beyond,
  title={Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models},
  author={Bai, Guangji and Chai, Zheng and Ling, Chen and Wang, Shiyu and Lu, Jiaying and Zhang, Nan and Shi, Tingwei and Yu, Ziyang and Zhu, Mengdan and Zhang, Yifei and others},
  journal={arXiv preprint arXiv:2401.00625},
  year={2024}
}