Awesome
Efficient LLM and Multimodal Foundation Model Survey
This repo contains the paper list and figures for A Survey of Resource-efficient LLM and Multimodal Foundation Models.
Abstract
Large foundation models, including large language models (LLMs), vision transformers (ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine learning lifecycle, from training to deployment. However, the substantial advancements in versatility and performance these models offer come at a significant cost in terms of hardware resources. To support the growth of these large models in a scalable and environmentally sustainable way, there has been a considerable focus on developing resource-efficient strategies. This survey delves into the critical importance of such research, examining both algorithmic and systemic aspects. It offers a comprehensive analysis and valuable insights gleaned from existing literature, encompassing a broad array of topics from cutting-edge model architectures and training/serving algorithms to practical system designs and implementations. The goal of this survey is to provide an overarching understanding of how current approaches are tackling the resource challenges posed by large foundation models and to potentially inspire future breakthroughs in this field.
Scope and rationales
The scope of this survey is mainly defined by following aspects.
- We survey only algorithm and system innovations; we exclude a huge body of work at hardware design, which is out of our expertise.
- The definition of resource in this survey is limited to mainly physical ones, including computing, memory, storage, bandwidth, etc; we exclude training data (labels) and privacy that can also be regarded as resources;
- We mainly survey papers published on top-tier CS conferences, i.e., those included in CSRankings. We also manually pick related and potentially high-impact papers from arXiv.
- We mainly survey papers published after the year of 2020, since the innovation of AI is going fast with old knowledge and methods being overturned frequently.
Citation
@article{xu2024a,
title = {A Survey of Resource-efficient LLM and Multimodal Foundation Models},
author = {Xu, Mengwei and Yin, Wangsong and Cai, Dongqi and Yi, Rongjie
and Xu, Daliang and Wang, Qipeng and Wu, Bingyang and Zhao, Yihao and Yang, Chen
and Wang, Shihe and Zhang, Qiyang and Lu, Zhenyan and Zhang, Li and Wang, Shangguang
and Li, Yuanchun, and Liu Yunxin and Jin, Xin and Liu, Xuanzhe},
journal={arXiv preprint arXiv:2401.08092},
year = {2024}
}
Contribute
If we leave out any important papers, please let us know in the Issues and we will include them in the next version.
We will actively maintain the survey and the Github repo.
Table of Contents
- Foundation Model Overview
- Resource-efficient Architectures
- Resource-efficient Algorithms
- Resource-efficient Systems
Foundation Model Overview
Language Foundation Models
- Attention is all you need. [arXiv'17] [Paper] [Code]
- Bert: Pre-training of deep bidirectional transformers for language understanding. [arXiv'18] [Paper] [Code]
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. [arXiv'19] [Paper] [Code]
- Roberta: A robustly optimized bert pretraining approach. [arXiv'19] [Paper] [Code]
- Sentence-bert: Sentence embeddings using siamese bert-networks. [EMNLP'19] [Paper] [Code]
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. [ACL'19] [Paper] [Code]
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. [arXiv'19] [Paper] [Code]
- Improving language understanding by generative pre-training. [URL]
- Language Models are Unsupervised Multitask Learners. [URL]
- Language Models are Few-Shot Learners. [NeurIPS'20] [Paper] [Code]
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling. [arXiv'21] [Paper] [Code]
- Palm: Scaling language modeling with pathways. [JMLR'22] [Paper] [Code]
- Training language models to follow instructions with human feedback. [NeurIPS'22] [Paper]
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. [JMLR'22] [Paper]
- Glam: Efficient scaling of language models with mixture-ofexperts. [ICML'22] [Paper]
- wav2vec 2.0: A framework for self-supervised learning of speech representations. [NeurIPS'20] [Paper] [Code]
- HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. [TASLP'21] [Paper] [Code]
- Robust Speech Recognition via Large-Scale Weak Supervision. [ICML'23] [Paper]
- GPT-4 Technical Report. [arXiv'23] [Paper]
- Palm 2 technical report. [URL]
- Llama 2: Open foundation and fine-tuned chat models. [arXiv'23] [Paper] [Code]
Vision Foundation Models
- End-to-End Object Detection with Transformers. [ECCV'20] [Paper] [Code]
- Generative Pretraining from Pixels. [ICML'20] [Paper] [Code]
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. [ICLR'20] [Paper] [Code]
- Training data-efficient image transformers & distillation through attention. [ICML'21] [Paper] [Code]
- SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. [NeurIPS'21] [Paper] [Code]
- You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection. [NeurIPS'21] [Paper] [Code]
- Swin Transformer V2: Scaling Up Capacity and Resolution. [CVPR'22] [Paper] [Code]
- Masked Autoencoders Are Scalable Vision Learners. [CVPR'22] [Paper] [Code]
- Exploring Plain Vision Transformer Backbones for Object Detection. [ECCV'22] [Paper] [Code]
- BEiT: BERT Pre-Training of Image Transformers. [ICLR'22] [Paper] [Code]
- DINOv2: Learning Robust Visual Features without Supervision. [arXiv'20] [Paper]
- Sequential Modeling Enables Scalable Learning for Large Vision Models. [arXiv'23] [Paper] [Code]
Multimodal Large FMs
- Learning transferable visual models from natural language supervision. [ICML'21] [Paper] [Code]
- Align before fuse: Vision and language representation learning with momentum distillation. [NeurIPS'21] [Paper] [Code]
- Scaling up visual and vision-language representation learning with noisy text supervision. [ICML'21] [Paper]
- Imagebind: One embedding space to bind them all. [CVPR'23] [Paper] [Code]
- Languagebind: Extending video-language pretraining to n-modality by language- based semantic alignment. [arXiv'23] [Paper] [Code]
- Pandagpt: One model to instruction-follow them all. [arXiv'23] [Paper] [Code]
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. [arXiv'23] [Paper] [Code]
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. [arXiv'23] [Paper] [Code]
- mplug-owl: Modularization empowers large language models with multi-modality. [arXiv'23] [Paper] [Code]
- Visual instruction tuning. [arXiv'23] [Paper] [Code]
- Flamingo: a visual language model for few-shot learning. [NeurIPS'22] [Paper]
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. [arXiv'23] [Paper] [Code]
- Palm-e: An embodied multimodal language model. [arXiv'23] [Paper] [Code]
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. [arXiv'23] [Paper] [Code]
- Any-to-any generation via composable diffusion. [arXiv'23] [Paper] [Code]
- Next-gpt: Any-to-any multimodal llm. [arXiv'23] [Paper] [Code]
- Uniter: Universal image-text representation learning. [ECCV'20] [Paper] [Code]
- Flava: A foundational language and vision alignment model. [CVPR'22] [Paper] [Code]
- Coca: Contrastive captioners are image-text foundation models. [arXiv'22] [Paper]
- Grounded language-image pre-training. [CVPR'22] [Paper] [Code]
- Segment anything. [arXiv'23] [Paper] [Code]
- Gemini: A Family of Highly Capable Multimodal Models. [arXiv'23] [Paper]
- Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. [arXiv'23] [Paper] [Code]
- Auto-encoding variational bayes. [arXiv'13] [Paper]
- Neural discrete representation learning. [NeurIPS'17] [Paper] [Code]
- Denoising Diffusion Probabilistic Models. [NeurIPS'20] [Paper] [Code]
- Denoising diffusion implicit models. [ICLR'21] [Paper] [Code]
- Convolutional Networks for Biomedical Image Segmentation. [MICCAI'15] [Paper] [Code]
- High-Resolution Image Synthesis with Latent Diffusion Models. [CVPR'22] [Paper] [Code]
- Consistency models. [arXiv'23] [Paper] [Code]
- Zero-shot text-to-image generation. [ICML'21] [Paper] [Code]
- Any-to-Any Generation via Composable Diffusion. [arXiv'23] [Paper] [Code]
- MobileVLM V2: Faster and Stronger Baseline for Vision Language Model. [arXiv'24] [Paper] [Code]
- SAM 2: Segment Anything in Images and Videos. [arXiv'24] [Paper] [Code]
- Mobile Foundation Model as Firmware. [MobiCom'24] [Paper] [Code]
Resource-efficient Architectures
Efficient Attention
- Longformer: The long-document transformer. [arXiv'20] [Paper] [Code]
- ETC: Encoding Long and Structured Inputs in Transformers. [ACL'20] [Paper] [Code]
- Big bird: Transformers for longer sequences. [NeurIPS'2] [Paper] [Code]
- Efficient Attentions for Long Document Summarization. [NAACL'21] [Paper] [Code]
- MATE: Multi-view Attention for Table Transformer Efficiency. [EMNLP'21] [Paper] [Code]
- LittleBird: Efficient Faster & Longer Transformer for Question Answering. [arXiv'23] [Paper] [Code]
- Albert: A lite bert for self-supervised learning of language representations. [arXiv'19] [Paper] [Code]
- An efficient encoder-decoder architecture with top-down attention for speech separation. [ICLR'23] [Paper] [Code]
- Reformer: The Efficient Transformer. [ICLR'20] [Paper] [Code]
- Transformers are rnns: Fast autoregressive transformers with linear attention. [ICML'20] [Paper] [Code]
- Linformer: Self-Attention with Linear Complexity. [arXiv'20] [Paper] [Code]
- Luna: Linear unified nested attention. [NeurIPS'21] [Paper] [Code]
- Rethinking Attention with Performers. [arXiv'20] [Paper] [Code]
- PolySketchFormer: Fast Transformers via Sketches for Polynomial Kernels. [arXiv'23] [Paper]
- Mega: Moving Average Equipped Gated Attention. [ICLR'23] [Paper] [Code]
- Vision Transformer with Deformable Attention. [arXiv'22] [Paper] [Code]
- CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. [arXiv'21] [Paper] [Code]
- An attention free transformer. [arXiv'21] [Paper] [Code]
- Hyena hierarchy: Towards larger convolutional language models. [arXiv'23] [Paper]
- Perceiver: General perception with iterative attention. [ICML'21] [Paper] [Code]
- Scaling transformer to 1m tokens and beyond with rmt. [arXiv'23] [Paper]
- Recurrent memory transformer. [NeurIPS'22] [Paper] [Code]
- RWKV: Reinventing RNNs for the Transformer Era. [arXiv'23] [Paper] [Code]
- Retentive Network: A Successor to Transformer for Large Language Model. [arXiv'23] [Paper] [Code]
- Efficiently modeling long sequences with structured state spaces. [ICLR'22] [Paper] [Code]
- Hungry hungry hippos: Towards language modeling with state space models. [ICLR'23] [Paper] [Code]
- Resurrecting recurrent neural networks for long sequences. [arXiv'23] [Paper] [Code]
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces. [arXiv'23] [Paper] [Code]
Dynamic Neural Network
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. [JMLR'22] [Paper] [Code]
- Scaling vision with sparse mixture of experts. [NeruIPS'21] [Paper] [Code]
- Glam: Efficient scaling of language models with mixture-of-experts. [ICML'22] [Paper] [Code]
- Multimodal contrastive learning with limoe: the language-image mixture of experts. [NeruIPS'22] [Paper] [Code]
- Mistral 7B. [arXiv'23] [Paper] [Code]
- Fast Feedforward Networks. [arXiv'23] [Paper] [Code]
- MoEfication: Transformer Feed-forward Layers are Mixtures of Experts. [ACL'22] [Paper] [Code]
- Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. [arXiv'23] [Paper] [Code]
- Simplifying Transformer Blocks. [arXiv'23] [Paper] [Code]
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. [arXiv'23] [Paper] [Code]
- Bert loses patience: Fast and robust inference with early exit. [NeruIPS'20] [Paper] [Code]
- DeeBERT: Dynamic early exiting for accelerating BERT inference. [arXiv'20] [Paper] [Code]
- LGViT: Dynamic Early Exiting for Accelerating Vision Transformer. [MM'23] [Paper] [Code]
- Multi-Exit Vision Transformer for Dynamic Inference. [arXiv'21] [Paper]
- Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference. [arXiv'23] [Paper]
Diffusion-specific Optimization
- Improved denoising diffusion probabilistic models. [arXiv'21] [Paper] [Code]
- Accelerating diffusion models via early stop of the diffusion process. [arXiv'22] [Paper] [Code]
- Denoising diffusion implicit models. [ICLR'21] [Paper] [Code]
- gDDIM: Generalized denoising diffusion implicit models. [arXiv'22] [Paper] [Code]
- Pseudo numerical methods for diffusion models on manifolds. [arXiv'22] [Paper] [Code]
- Elucidating the design space of diffusion-based generative models. [arXiv'22] [Paper] [Code]
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. [NeurIPS'22] [Paper] [Code]
- Progressive distillation for fast sampling of diffusion models. [arXiv'22] [Paper] [Code]
- Fast sampling of diffusion models with exponential integrator. [arXiv'22] [Paper] [Code]
- Score-based generative modeling through stochastic differential equations. [arXiv'20] [Paper] [Code]
- Learning fast samplers for diffusion models by differentiating through sample quality. [arXiv'22] [Paper]
- Redi: efficient learning-free diffusion inference via trajectory retrieval. [ICML'23] [Paper] [Code]
- Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models. [NSDI'24] [Paper]
- Salad: Part-level latent diffusion for 3d shape generation and manipulation. [ICCV'23] [Paper] [Code]
- Binary Latent Diffusion. [CVPR'23] [Paper] [Code]
- LD-ZNet: A latent diffusion approach for text-based image segmentation. [ICCV'23] [Paper] [Code]
- Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. [CVPR'23] [Paper] [Code]
- High-resolution image reconstruction with latent diffusion models from human brain activity. [CVPR'23] [Paper] [Code]
- Belfusion: Latent diffusion for behavior-driven human motion prediction. [ICCV'23] [Paper] [Code]
- Unified multi-modal latent diffusion for joint subject and text conditional image generation. [arXiv'23] [Paper]
- SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds. [arXiv'23] [Paper] [Code]
- ERNIE-ViLG 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. [CVPR'23] [Paper] [Code]
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. [arXiv'23] [Paper]
- ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models. [arXiv'23] [Paper] [Code]
- Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of Experts And Frequency-augmented Decoder Approach. [arXiv'23] [Paper] [Code]
ViT-specific Optimizations
- LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference. [ICCV'21] [Paper] [Code]
- MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. [ICLR'22] [Paper] [Code]
- EfficientFormer: Vision Transformers at MobileNet Speed. [NeurIPS'22] [Paper] [Code]
- EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. [CVPR'23] [Paper] [Code]
- MetaFormer Is Actually What You Need for Vision. [CVPR'22] [Paper] [Code]
Resource-efficient Algorithms
Pre-training Algorithms
- Deduplicating Training Data Makes Language Models Better. [ACL'22] [Paper] [Code]
- TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection. [EMNLP'22] [Paper]
- Masked autoencoders are scalable vision learners. [CVPR'22] [Paper] [Code]
- MixMAE: Mixed and masked autoencoder for efficient pretraining of hierarchical vision transformers. [CVPR'23] [Paper] [Code]
- COPA: Efficient Vision-Language Pre-training through Collaborative Object-and Patch-Text Alignment. [MM'23] [Paper]
- Patchdropout: Economizing vision transformers using patch dropout. [WACV'23] [Paper] [Code]
- Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers. [CVPR'23] [Paper] [Code]
- Zero-Cost Proxies for Lightweight NAS. [ICLR'21] [Paper] [Code]
- ZiCo: Zero-shot NAS via inverse Coefficient of Variation on Gradients. [ICLR'23] [Paper] [Code]
- PASHA: Efficient HPO and NAS with Progressive Resource Allocation. [ICLR'23] [Paper] [Code]
- RankNAS: Efficient Neural Architecture Search by Pairwise Ranking. [EMNLP'21] [Paper]
- PreNAS: Preferred One-Shot Learning Towards Efficient Neural Architecture Search. [ICML'23] [Paper] [Code]
- ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices. [ICCV'23] [Paper] [Code]
- Efficient training of BERT by progressively stacking. [ICML'19] [Paper] [Code]
- On the Transformer Growth for Progressive BERT Training. [NAACL'21] [Paper] [Code]
- Staged training for transformer language models. [ICML'22] [Paper] [Code]
- Knowledge Inheritance for Pre-trained Language Models. [NAACL'22] [Paper] [Code]
- Learning to Grow Pretrained Models for Efficient Transformer Training. [ICLR'23] [Paper] [Code]
- Mesa: A memory-saving training framework for transformers. [arXiv'21] [Paper] [Code]
- GACT: Activation compressed training for generic network architectures. [ICML'22] [Paper] [Code]
Finetuning Algorithms
- Memory efficient continual learning with transformers. [NeurIPS'22] [Paper]
- Metatroll: Few-shot detection of state-sponsored trolls with transformer adapters. [WWW'23] [Paper] [Code]
- St-adapter: Parameter-efficient image-to- video transfer learning. [NeurIPS'22] [Paper] [Code]
- Parameter-efficient fine-tuning without introducing new latency. [arXiv'23] [Paper] [Code]
- Adamix: Mixture-of-adaptations for parameter-efficient model tuning. [arXiv'22] [Paper] [Code]
- Residual adapters for parameter-efficient asr adaptation to atypical and accented speech. [arXiv'21] [Paper]
- Make your pre-trained model reversible: From parameter to memory efficient fine-tuning. [arXiv'23] [Paper] [Code]
- Pema: Plug-in external memory adaptation for language models. [arXiv'23] [Paper]
- The power of scale for parameter-efficient prompt tuning. [arXiv'21] [Paper]
- Attempt: Parameter-efficient multi-task tuning via attentional mixtures of soft prompts. [EMNLP'22] [Paper] [Code]
- Mprompt: Exploring multi-level prompt tuning for machine reading comprehension. [arXiv'23] [Paper] [Code]
- Bioinstruct: Instruction tuning of large language models for biomedical natural language processing. [arXiv'23] [Paper] [Code]
- Decomposed prompt tuning via low-rank reparameterization. [arXiv'23] [Paper] [Code]
- A dual prompt learning framework for few-shot dialogue state tracking. [WWW'23] [Paper] [Code]
- User-aware prefix-tuning is a good learner for personalized image captioning. [arXiv'23] [Paper]
- Prefix-diffusion: A lightweight diffusion model for diverse image captioning. [arXiv'23] [Paper]
- Domain aligned prefix averaging for domain generalization in abstractive summarization. [arXiv'23] [Paper] [Code]
- Prefix propagation: Parameter-efficient tuning for long sequences. [arXiv'23] [Paper] [Code]
- Pip: Parse-instructed prefix for syntactically controlled paraphrase generation. [arXiv'23] [Paper] [Code]
- Towards building the federated gpt: Federated instruction tuning. [arXiv'23] [Paper] [Code]
- Domain-oriented prefix-tuning: Towards efficient and generalizable fine-tuning for zero-shot dialogue summarization. [arXiv'23] [Paper]
- Unified low-resource sequence labeling by sample-aware dynamic sparse finetuning. [arXiv'23] [Paper] [Code]
- On the effectiveness of parameter-efficient fine-tuning. [AAAI'23] [Paper] [Code]
- Sensitivity-aware visual parameter-efficient fine-tuning. [ICCV'23] [Paper] [Code]
- VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control. [ICCV'23] [Paper] [Code] [Project]
- Smartfrz: An efficient training framework using attention-based layer freezing. [ICLR'23] [Paper]
- Token mixing: parameter-efficient transfer learning from image-language to video-language. [AAAI'23] [Paper] [Code]
- One-for-all: Generalized lora for parameter-efficient fine-tuning. [arXiv'23] [Paper] [Code]
- Dsee: Dually sparsity-embedded efficient tuning of pre-trained language models. [arXiv'21] [Paper] [Code]
- Longlora: Efficient fine-tuning of long-context large language models. [arXiv'23] [Paper] [Code]
- Qlora: Efficient finetuning of quantized llms. [arXiv'23] [Paper] [Code]
- Pela: Learning parameter-efficient models with low-rank approximation. [arXiv'23] [Paper] [Code]
- Efficientdm: Efficient quantization-aware fine-tuning of low-bit diffusion models. [arXiv'23] [Paper]
- Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. [arXiv'23] [Paper]
- Loftq: Lora-fine-tuning-aware quantization for large language models. [arXiv'23] [Paper] [Code]
- Full parameter fine-tuning for large language models with limited resources. [arXiv'23] [Paper] [Code]
- Fine-tuning language models with just forward passes. [arXiv'23] [Paper] [Code]
- Efficient transformers with dynamic token pooling. [arXiv'23] [Paper] [Code]
- Qa-lora: Quantization-aware low-rank adaptation of large language models. [arXiv'23] [Paper] [Code]
- Efficient low-rank backpropagation for vision transformer adaptation. [arXiv'23] [Paper]
- Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices. [arXiv'23] [Paper]
- PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models. [arXiv'24] [Paper] [Code]
- DoRA: Weight-Decomposed Low-Rank Adaptation. [ICML'24] [Paper] [Code]
- LoRA+: Efficient Low Rank Adaptation of Large Models. [ICML'24] [Paper]
- Towards Green AI in Fine-Tuning Large Language Models via Adaptive Backpropagation. [ICLR'24] [Paper]
Inference Algorithms
- Fast inference from transformers via speculative decoding. [ICML'23] [Paper] [Code]
- Accelerating Large Language Model Decoding with Speculative Sampling. [arXiv'23] [Paper] [Code]
- SpecTr: Fast Speculative Decoding via Optimal Transport. [NeurIPS'23] [Paper] [Code]
- ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training. [EMNLP'20] [Paper] [Code]
- Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding. [arXiv'23] [Paper] [Code]
- LLMCad: Fast and Scalable On-device Large Language Model Inference. [arXiv'23] [Paper]
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads. [URL]
- Break the Sequential Dependency of LLM Inference Using Lookahead Decoding. [URL]
- SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification. [arXiv'23] [Paper] [Code]
- Prompt Cache: Modular Attention Reuse for Low-Latency Inference. [arXiv'23] [Paper]
- Inference with Reference: Lossless Acceleration of Large Language Models. [arXiv'23] [Paper] [Code]
- Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding. [arXiv'23] [Paper] [Code]
- LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. [EMNLP'23] [Paper] [Code]
- Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models. [EMNLP'22] [Paper] [Code]
- Etropyrank: Unsupervised keyphrase extraction via side-information optimization for language model-based text compression. [ICML'23] [Paper]
- LLMZip: Lossless Text Compression using Large Language Models. [arXiv'23] [Paper] [Code]
- In-context Autoencoder for Context Compression in a Large Language Mode. [arXiv'23] [Paper] [Code]
- Nugget 2D: Dynamic Contextual Compression for Scaling Decoder-only Language Models. [arXiv'23] [Paper]
- Boosting LLM Reasoning: Push the Limits of Few-shot Learning with Reinforced In-Context Pruning. [arXiv'23] [Paper]
- PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination. [ICML'20] [Paper] [Code]
- Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search. [ACL'21] [Paper] [Code]
- TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference. [NAACL'21] [Paper] [Code]
- DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. [NeurIPS'21] [Paper] [Code]
- AdaViT: Adaptive Vision Transformers for Efficient Image Recognition. [CVPR'21] [Paper]
- AdaViT: Adaptive Tokens for Efficient Vision Transformer. [CVPR'22] [Paper] [Code]
- SPViT: Enabling Faster Vision Transformers via Soft Token Pruning. [ECCV'22] [Paper] [Code]
- PuMer: Pruning and Merging Tokens for Efficient Vision Language Models. [ACL'23] [Paper] [Code]
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. [NeurIPS'23] [Paper] [Code]
- Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers. [arXiv'23] [Paper] [Code]
- Landmark Attention: Random-Access Infinite Context Length for Transformers. [NeurIPS'23] [Paper] [Code]
- Train short, test long: Attention with linear biases enables input length extrapolation. [ICLR'22] [Paper] [Code]
- A Length-Extrapolatable Transformer. [ACL'22] [Paper] [Code]
- CLEX: Continuous Length Extrapolation for Large Language Models. [arXiv'23] [Paper] [Code]
- Extending Context Window of Large Language Models via Positional Interpolation. [arXiv'23] [Paper]
- YaRN: Efficient Context Window Extension of Large Language Models. [arXiv'23] [Paper] [Code]
- functional interpolation for relative positions improves long context transformers. [arXiv'23] [Paper]
- PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training. [arXiv'23] [Paper] [Code]
- Recurrent Memory Transformer. [NeurIPS'22] [Paper] [Code]
- Block-Recurrent Transformers. [NeurIPS'22] [Paper] [Code]
- Memformer: A Memory-Augmented Transformer for Sequence Modeling. [ACL'22] [Paper] [Code]
- LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models. [arXiv'23] [Paper] [Code]
- Efficient Streaming Language Models with Attention Sinks. [arXiv'23] [Paper] [Code]
- Parallel context windows for large language models. [ACL'23] [Paper] [Code]
- LongNet: Scaling Transformers to 1,000,000,000 Tokens. [arXiv'23] [Paper] [Code]
- Efficient Long-Text Understanding with Short-Text Models. [TACL'23] [Paper] [Code]
- Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models. [arXiv'24] [Paper] [Code]
- LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images. [ECCV'24] [Paper] [Code]
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. [arXiv'24] [Paper] [Code]
- Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. [ICLR'24] [Paper]
- vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. [arXiv'24] [Paper] [Code]
- LLM as a System Service on Mobile Devices. [arXiv'24] [Paper]
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. [arXiv'24] [Paper] [Code]
- CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. [SIGCOMM'24] [Paper] [Code]
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. [OSDI'24] [Paper]
- Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. [ATC'24] [Paper]
Model Compression
- From Dense to Sparse: Contrastive Pruning for Better Pre-Trained Language Model Compression. [AAAI'22] [Paper]
- Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models. [NeurIPS'22] [Paper] [Code]
- ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design. [HPCA'23] [Paper] [Code]
- A Simple and Effective Pruning Approach for Large Language Models. [arXiv'23] [Paper] [Code]
- Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers. [ICLR'20] [Paper] [Code]
- UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers. [ICML'23] [Paper] [Code]
- Sparsegpt: Massive language models can be accurately pruned in one-shot. [arXiv'23] [Paper] [Code]
- One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models. [ICASSP'24] [Paper] [Code]
- BiT: Robustly Binarized Multi-distilled Transformer. [NeurIPS'22] [Paper] [Code]
- DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models. [ACL'23] [Paper] [Code]
- Block-Skim: Efficient Question Answering for Transformer. [AAAI'22] [Paper] [Code]
- Depgraph: Towards any structural pruning. [CVPR'23] [Paper] [Code]
- PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance. [ICML'22] [Paper] [Code]
- Differentiable joint pruning and quantization for hardware efficiency. [ECCV'20] [Paper]
- SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. [HPCA'21] [Paper] [Code]
- Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning. [arXiv'23] [Paper] [Code]
- Accelerated sparse neural training: A provable and efficient method to find n: m transposable masks. [NeurIPS'21] [Paper] [Code]
- Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning. [arXiv'23] [Paper]
- What matters in the structured pruning of generative language models?. [arXiv'23] [Paper]
- LLM-Pruner: On the Structural Pruning of Large Language Models. [NeurIPS'23] [Paper] [Code]
- Deja vu: Contextual sparsity for efficient llms at inference time. [ICML'23] [Paper] [Code]
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. [arXiv'23] [Paper] [Code]
- Distilling Large Vision-Language Model with Out-of-Distribution Generalizability. [ICCV'23] [Paper] [Code]
- DIME-FM : DIstilling Multimodal and Efficient Foundation Models. [ICCV'23] [Paper] [Code]
- MixKD: Towards Efficient Distillation of Large-scale Language Models. [arXiv'20] [Paper]
- Less is More: Task-aware Layer-wise Distillation for Language Model Compression. [arXiv'22] [Paper] [Code]
- DISTILLM: Towards Streamlined Distillation for Large Language Models. [arXiv'24] [Paper] [Code]
- Propagating Knowledge Updates to LMs Through Distillation. [arXiv'23] [Paper]
- GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models. [arXiv'23] [Paper] [Code]
- Knowledge Distillation of Large Language Models. [arXiv'23] [Paper]
- Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty. [ACL'23] [Paper] [Code]
- Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. [ACL'23] [Paper] [Code]
- Teaching Small Language Models to Reason. [ACL'22] [Paper]
- Explanations from Large Language Models Make Small Reasoners Better. [arXiv'22] [Paper]
- Lion: Adversarial distillation of closed-source large language model. [arXiv'23] [Paper] [Code]
- LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions. [arXiv'23] [Paper] [Code]
- LLM. int8 (): 8-bit Matrix Multiplication for Transformers at Scale. [arXiv'22] [Paper] [Code]
- LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models. [arXiv'22] [Paper]
- Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning. [NeurIPS'22] [Paper] [Code]
- GPTQ: accurate post-training quantization for generative pre-trained transformers. [ICLR'23] [Paper] [Code]
- Few-bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction. [ICML'22] [Paper] [Code]
- SqueezeLLM: Dense-and-Sparse Quantization. [arXiv'23] [Paper] [Code]
- SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. [arXiv'23] [Paper] [Code]
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. [arXiv'23] [Paper] [Code]
- QuIP: 2-Bit Quantization of Large Language Models With Guarantees. [NeurIPS'23] [Paper] [Code]
- OWQ: Lessons learned from activation outliers for weight quantization in large language models. [arXiv'23] [Paper] [Code]
- FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs. [arXiv'23] [Paper]
- BinaryBERT: Pushing the Limit of BERT Quantization. [ACL'21] [Paper] [Code]
- I-BERT: Integer-only BERT Quantization. [ICML'21] [Paper] [Code]
- Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models. [NeurIPS'22] [Paper] [Code]
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. [ICML'22] [Paper] [Code]
- ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. [NeurIPS'22] [Paper] [Code]
- Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer. [NeurIPS'22] [Paper] [Code]
- RPTQ: Reorder-based Post-training Quantization for Large Language Models. [arXiv'23] [Paper] [Code]
- Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. [ACL'23] [Paper] [Code]
- ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats. [arXiv'23] [Paper]
- ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation. [arXiv'23] [Paper]
- I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference. [ICCV'23] [Paper] [Code]
- Q-Diffusion: Quantizing Diffusion Models. [ICCV'23] [Paper] [Code]
- OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization. [ISCA'23] [Paper]
- QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models. [arXiv'23] [Paper]
- Integer or floating point? new outlooks for low-bit quantization on large language models. [arXiv'23] [Paper]
- Oscillation-free Quantization for Low-bit Vision Transformers. [ICML'23] [Paper] [Code]
- FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization. [ICML'23] [Paper]
- OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models. [arXiv'23] [Paper] [Code]
- LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. [arXiv'23] [Paper] [Code]
- Compression of generative pre-trained language models via quantization. [ACL'22] [Paper]
- BitNet: Scaling 1-bit Transformers for Large Language Models. [arXiv'23] [Paper] [Code]
- QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models. [arXiv'23] [Paper]
- LLM-FP4: 4-Bit Floating-Point Quantized Transformers. [arXiv'23] [Paper]
- Atom: Low-bit Quantization for Efficient and Accurate LLM Serving. [arXiv'23] [Paper]
- Matrix Compression via Randomized Low Rank and Low Precision Factorization. [NeurIPS'23] [Paper] [Code]
- TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition. [arXiv'23] [Paper]
- LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression. [arXiv'23] [Paper]
- ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with Linear Taylor Attention. [HPCA'23] [Paper] [Code]
- SpinQuant: LLM Quantization with Learned Rotations. [arXiv'24] [Paper]
- QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. [arXiv'24] [Paper]
- I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models. [arXiv'24] [Paper]
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models. [arXiv'24] [Paper] [Code]
- PowerInfer-2: Fast Large Language Model Inference on a Smartphone. [arXiv'24] [Paper] [Code]
- Achieving Sparse Activation in Small Language Models. [arXiv'24] [Paper] [Code]
Resource-efficient Systems
Distributed Training
- Optimizing Dynamic Neural Networks with Brainstorm. [OSDI'23] [Paper] [Code]
- GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. [SOSP'23] [Paper]
- Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates. [SOSP'23] [Paper] [Code]
- Varuna: Scalable, Low-cost Training of Massive Deep Learning Models. [EuroSys'22] [Paper] [Code]
- HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism. [ATC'20] [Paper]
- ZeRO-Offload: Democratizing Billion-Scale Model Training. [ATC'21] [Paper] [Code]
- Whale: Efficient Giant Model Training over Heterogeneous GPUs. [ATC'22] [Paper] [Code]
- SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization. [ATC'23] [Paper] [Code]
- Behemoth: A Flash-centric Training Accelerator for Extreme-scale DNNs. [FAST'21] [Paper]
- FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks. [FAST'21] [Paper] [Code]
- Sequence Parallelism: Long Sequence Training from System Perspective. [ACL'23] [Paper] [Code]
- Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models. [ASPLOS'23] [Paper]
- Mobius: Fine Tuning Large-scale Models on Commodity GPU Servers. [ASPLOS'23] [Paper]
- Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression. [ASPLOS'23] [Paper] [Code]
- Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads. [ASPLOS'22] [Paper] [Code]
- FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement. [SIGMOD'23] [Paper] [Code]
- On Optimizing the Communication of Model Parallelism. [MLSys'23] [Paper] [Code]
- Reducing Activation Recomputation in Large Transformer Models. [MLSys'23] [Paper] [Code]
- PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices. [MLSys'23] [Paper] [Code]
- Breadth-First Pipeline Parallelism. [MLSys'23] [Paper]
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts. [MLSys'23] [Paper] [Code]
- Tutel: Adaptive Mixture-of-Experts at Scale. [MLSys'23] [Paper] [Code]
- Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. [ICLR'20] [Paper] [Code]
- Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism. [VLDB'23] [Paper] [Code]
- MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud. [VLDB'23] [Paper]
- Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism. [ATC'21] [Paper] [Code]
- Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs. [NSDI'23] [Paper] [Code]
- Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models. [SIGCOMM'23] [Paper]
- MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters. [NSDI'22] [Paper] [Code]
- Zero: Memory optimizations toward training trillion parameter models. [SC'20] [Paper] [Code]
- Efficient large-scale language model training on gpu clusters using megatron-lm. [HPC'21] [Paper] [Code]
- Alpa: Automating inter-and Intra-Operator parallelism for distributed deep learning. [OSDI'22] [Paper] [Code]
- Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training. [ICPC'23] [Paper] [Code]
- Megatron-lm: Training multi-billion parameter language models using model parallelism. [arXiv'19] [Paper] [Code]
- Pytorch FSDP: experiences on scaling fully sharded data parallel. [arXiv'23] [Paper] [Code]
- DeepSpeed. [URL]
- Huggingface PEFT. [URL]
- FairScale. [URL]
- OpenLLM: Operating LLMs in production. [URL]
Federated Learning
- Flower: A friendly federated learning research framework. [arXiv'20] [Paper] [Code]
- Fedml: A research library and benchmark for federated machine learning. [arXiv'20] [Paper] [Code]
- FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks. [NAACL'22] [Paper] [Code]
- FATE-LLM: A Industrial Grade Federated Learning Framework for Large Language Models. [arXiv'23] [Paper] [Code]
- Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning. [arXiv'23] [Paper] [Code]
- Federated Self-supervised Speech Representations: Are We There Yet?. [arXiv'23] [Paper]
- Towards Building the Federated GPT: Federated Instruction Tuning. [arXiv'23] [Paper] [Code]
- Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly. [arXiv'23] [Paper]
- Privacy-Preserving Fine-Tuning of Artificial Intelligence (AI) Foundation Models with Federated Learning, Differential Privacy, Offsite Tuning, and Parameter-Efficient Fine-Tuning (PEFT). [TechRxiv'23] [Paper]
- Efficient federated learning for modern nlp. [MobiCom'23] [Paper] [Code]
- Federated few-shot learning for mobile NLP. [MobiCom'23] [Paper] [Code]
- Low-parameter federated learning with large language models. [arXiv'23] [Paper] [Code]
- FedPrompt: Communication-Efficient and Privacy-Preserving Prompt Tuning in Federated Learning. [ICASSP'23] [Paper]
- Reducing Communication Overhead in Federated Learning for Pre-trained Language Models Using Parameter-Efficient Finetuning. [Conference on Lifelong Learning Agents'23] [Paper]
- FEDBFPT: An efficient federated learning framework for BERT further pre-training. [AAAI'23] [Paper] [Code]
- FedOBD: Opportunistic Block Dropout for Efficiently Training Large-scale Neural Networks through Federated Learning. [arXiv'22] [Paper]
- FedBERT: When federated learning meets pre-training. [TIST'22] [Paper] [Code]
- FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning. [CVPR'23] [Paper] [Code]
- Federated fine-tuning of billion-sized language models across mobile devices. [arXiv'23] [Paper] [Code]
- Efficient Federated Prompt Tuning for Black-box Large Pre-trained Models. [arXiv'23] [Paper]
- Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes. [arXiv'23] [Paper]
Serving on Cloud
- Orca: A Distributed Serving System for Transformer-Based Generative Models. [OSDI'22] [Paper]
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. [arXiv'23] [Paper]
- Fast Distributed Inference Serving for Large Language Models. [arXiv'23] [Paper]
- FlexGen: high-throughput generative inference of large language models with a single GPU. [ICML'23] [Paper] [Code]
- DeepSpeed-FastGen. [URL]
- Splitwise: Efficient Generative LLM Inference Using Phase Splitting. [arXiv'23] [Paper]
- Efficiently Scaling Transformer Inference. [MLSys'23] [Paper]
- DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. [SC'22] [Paper]
- FlashDecoding++: Faster Large Language Model Inference on GPUs. [arXiv'23] [Paper]
- Flash-Decoding for long-context inference. [URL]
- A High-Performance Transformer Boosted for Variable-Length Inputs. [IPDPS'23] [Paper] [Code]
- SpotServe: Serving Generative Large Language Models on Preemptible Instances. [ASPLOS'24] [Paper] [Code]
- HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment. [arXiv'23] [Paper] [Code]
- Punica: Multi-Tenant LoRA Serving. [arXiv'23] [Paper] [Code]
- SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models. [arXiv'23] [Paper] [Code]
- Efficient Memory Management for Large Language Model Serving with PagedAttention. [SOSP'23] [Paper] [Code]
- Efficiently Programming Large Language Models using SGLang. [arXiv'23] [Paper]
- Batched Low-Rank Adaptation of Foundation Models. [ICLR'24] [Paper]
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. [OSDI'24] [Paper] [Code]
Serving on Edge
- EdgeFM: Leveraging Foundation Model for Open-set Learning on the Edge. [SenSys'23] [Paper]
- EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models. [arXiv'23] [Paper]
- Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping. [arXiv'23] [Paper]
- LLMCad: Fast and Scalable On-device Large Language Model Inference. [arXiv'23] [Paper]
- STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining. [ASPLOS'23] [Paper]
- Practical Edge Kernels for Integer-Only Vision Transformers Under Post-training Quantization. [MLSys'23] [Paper]
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. [arXiv'23] [Paper] [Code]
- PowerInfer-2: Fast Large Language Model Inference on a Smartphone. [arXiv'24] [Paper] [Code]
- LLM in a flash: Efficient Large Language Model Inference with Limited Memory. [arXiv'23] [Paper]
- On-Device Language Models: A Comprehensive Review. [arXiv'24] [Paper]
- LLM as a System Service on Mobile Devices. [arXiv'24] [Paper]
- ELMS: Elasticized Large Language Models On Mobile Devices. [arXiv'24] [Paper]