Awesome
Awesome Resource-Efficient LLM Papers
<div style="display: flex; align-items: center;"> <div style="flex: 1;"> A curated list of high-quality papers on resource-efficient LLMs. </div> <div> <img src="media/clean_energy.gif" alt="Clean Energy GIF" width="80" /> </div> </div>This is the GitHub repo for our survey paper Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models.
Table of Contents
- Awesome Resource-Efficient LLM Papers
LLM Architecture Design
Efficient Transformer Architecture
Non-transformer Architecture
LLM Pre-Training
Memory Efficiency
Distributed Training
Mixed precision training
Date | Keywords | Paper | Venue |
---|---|---|---|
2022 | Mixed Precision Training | BLOOM: A 176B-Parameter Open-Access Multilingual Language Model | Arxiv |
2018 | Mixed Precision Training | Bert: Pre-training of deep bidirectional transformers for language understanding | ACL |
2017 | Mixed Precision Training | Mixed Precision Training | ICLR |
Data Efficiency
Importance Sampling
Date | Keywords | Paper | Venue |
---|---|---|---|
2024 | Importance sampling | LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning | Arxiv |
2023 | Survey on importance sampling | A Survey on Efficient Training of Transformers | IJCAI |
2023 | Importance sampling | Data-Juicer: A One-Stop Data Processing System for Large Language Models | Arxiv |
2023 | Importance sampling | INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models | EMNLP |
2023 | Importance sampling | Machine Learning Force Fields with Data Cost Aware Training | ICML |
2022 | Importance sampling | Beyond neural scaling laws: beating power law scaling via data pruning | NeurIPS |
2021 | Importance sampling | Deep Learning on a Data Diet: Finding Important Examples Early in Training | NeurIPS |
2018 | Importance sampling | Training Deep Models Faster with Robust, Approximate Importance Sampling | NeurIPS |
2018 | Importance sampling | Not All Samples Are Created Equal: Deep Learning with Importance Sampling | ICML |
Data Augmentation
Date | Keywords | Paper | Venue |
---|---|---|---|
2024 | Data Augmentation | LLMRec: Large Language Models with Graph Augmentation for Recommendation | WSDM |
2024 | Data augmentation | LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named Entity Recognition | Arxiv |
2023 | Data augmentation | MixGen: A New Multi-Modal Data Augmentation | WACV |
2023 | Data augmentation | Augmentation-Aware Self-Supervision for Data-Efficient GAN Training | NeurIPS |
2023 | Data augmentation | Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis | EMNLP |
2023 | Data augmentation | FaMeSumm: Investigating and Improving Faithfulness of Medical Summarization | EMNLP |
Training Objective
Date | Keywords | Paper | Venue |
---|---|---|---|
2023 | Training objective | Challenges and Applications of Large Language Models | Arxiv |
2023 | Training objective | Efficient Data Learning for Open Information Extraction with Pre-trained Language Models | EMNLP |
2023 | Masked language-image modeling | Scaling Language-Image Pre-training via Masking | CVPR |
2022 | Masked image modeling | Masked Autoencoders Are Scalable Vision Learners | CVPR |
2019 | Masked language modeling | MASS: Masked Sequence to Sequence Pre-training for Language Generation | ICML |
LLM Fine-Tuning
Parameter-Efficient Fine-Tuning
Full-Parameter Fine-Tuning
LLM Inference
Model Compression
Pruning
Quantization
Dynamic Acceleration
Input Pruning
System Design
Deployment optimization
Date | Keywords | Paper | Venue |
---|---|---|---|
2024 | Hardware Optimization | LUT TENSOR CORE: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration | Arxiv |
2023 | Hardware offloading | FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU | PMLR |
2023 | Hardware offloading | Fast distributed inference serving for large language models | arXiv |
2022 | Collaborative inference | Petals: Collaborative Inference and Fine-tuning of Large Models | arXiv |
2022 | Hardware offloading | DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale | IEEE SC22 |
Support Infrastructure
Other Systems
Date | Keywords | Paper | Venue |
---|---|---|---|
2023 | Other Systems | Tabi: An Efficient Multi-Level Inference System for Large Language Models | EuroSys |
2023 | Other Systems | Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation | PACMMOD |
Resource-Efficiency Evaluation Metrics & Benchmarks
🧮 Computation Metrics
Metric | Description | Example Usage |
---|---|---|
FLOPs (Floating-point operations) | the number of arithmetic operations on floating-point numbers | [FLOPs] |
Training Time | the total duration required for training, typically measured in wall-clock minutes, hours, or days | [minutes, days]<br>[hours] |
Inference Time/Latency | the average time required generate an output after receiving an input, typically measured in wall-clock time or CPU/GPU/TPU clock time in milliseconds or seconds | [end-to-end latency in seconds]<br>[next token generation latency in milliseconds] |
Throughput | the rate of output tokens generation or tasks completion, typically measured in tokens per second (TPS) or queries per second (QPS) | [tokens/s]<br>[queries/s] |
Speed-Up Ratio | the improvement in inference speed compared to a baseline model | [inference time speed-up]<br>[throughput speed-up] |
💾 Memory Metrics
Metric | Description | Example Usage |
---|---|---|
Number of Parameters | the number of adjustable variables in the LLM’s neural network | [number of parameters] |
Model Size | the storage space required for storing the entire model | [peak memory usage in GB] |
⚡️ Energy Metrics
Metric | Description | Example Usage |
---|---|---|
Energy Consumption | the electrical power used during the LLM’s lifecycle | [kWh] |
Carbon Emission | the greenhouse gas emissions associated with the model’s energy usage | [kgCO2eq] |
<!-- tools for predicting the energy usage and carbon footprint before training**. -->The following are available software packages designed for real-time tracking of energy consumption and carbon emission.
You might also find the following helpful for predicting the energy usage and carbon footprint before actual training or
💵 Financial Cost Metric
Metric | Description | Example Usage |
---|---|---|
Dollars per parameter | the total cost of training (or running) the LLM by the number of parameters |
📨 Network Communication Metric
Metric | Description | Example Usage |
---|---|---|
Communication Volume | the total amount of data transmitted across the network during a specific LLM execution or training run | [communication volume in TB] |
💡 Other Metrics
Metric | Description | Example Usage |
---|---|---|
Compression Ratio | the reduction in size of the compressed model compared to the original model | [compress rate]<br>[percentage of weights remaining] |
Loyalty/Fidelity | the resemblance between the teacher and student models in terms of both predictions consistency and predicted probability distributions alignment | [loyalty]<br>[fidelity] |
Robustness | the resistance to adversarial attacks, where slight input modifications can potentially manipulate the model's output | [after-attack accuracy, query number] |
Pareto Optimality | the optimal trade-offs between various competing factors | [Pareto frontier (cost and accuracy)]<br>[Pareto frontier (performance and FLOPs)] |
Benchmarks
Benchmark | Description | Paper |
---|---|---|
General NLP Benchmarks | an extensive collection of general NLP benchmarks such as GLUE, SuperGLUE, WMT, and SQuAD, etc. | A Comprehensive Overview of Large Language Models |
Dynaboard | an open-source platform for evaluating NLP models in the cloud, offering real-time interaction and a holistic assessment of model quality with customizable Dynascore | Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking |
EfficientQA | an open-domain Question Answering (QA) challenge at NeurIPS 2020 that focuses on building accurate, memory-efficient QA systems | NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned |
SustaiNLP 2020 Shared Task | a challenge for development of energy-efficient NLP models by assessing their performance across eight NLU tasks using SuperGLUE metrics and evaluating their energy consumption during inference | Overview of the SustaiNLP 2020 Shared Task |
ELUE (Efficient Language Understanding Evaluation) | a benchmark platform for evaluating NLP model efficiency across various tasks, offering online metrics and requiring only a Python model definition file for submission | Towards Efficient NLP: A Standard Evaluation and A Strong Baseline |
VLUE (Vision-Language Understanding Evaluation) | a comprehensive benchmark for assessing vision-language models across multiple tasks, offering an online platform for evaluation and comparison | VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models |
Long Range Arena (LAG) | a benchmark suite evaluating efficient Transformer models on long-context tasks, spanning diverse modalities and reasoning types while allowing evaluations under controlled resource constraints, highlighting real-world efficiency | Long Range Arena: A Benchmark for Efficient Transformers |
Efficiency-aware MS MARCO | an enhanced MS MARCO information retrieval benchmark that integrates efficiency metrics like per-query latency and cost alongside accuracy, facilitating a comprehensive evaluation of IR systems | Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking |
Reference
If you find this paper list useful in your research, please consider citing:
@article{bai2024beyond,
title={Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models},
author={Bai, Guangji and Chai, Zheng and Ling, Chen and Wang, Shiyu and Lu, Jiaying and Zhang, Nan and Shi, Tingwei and Yu, Ziyang and Zhu, Mengdan and Zhang, Yifei and others},
journal={arXiv preprint arXiv:2401.00625},
year={2024}
}