Awesome

awesome-long-context

Efficient Inference, Sparse Attention, Efficient KV Cache

[2020/01] Reformer: The Efficient Transformer

[2020/06] Linformer: Self-Attention with Linear Complexity

[2022/12] Parallel Context Windows for Large Language Models

[2023/04] Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering

[2023/05] Landmark Attention: Random-Access Infinite Context Length for Transformers

[2023/05] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

[2023/06] Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

[2023/06] H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

[2023/07] Scaling In-Context Demonstrations with Structured Attention

[2023/08] LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models

[2023/09] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS

[2023/10] HyperAttention: Long-context Attention in Near-Linear Time

[2023/10] TRAMS: Training-free Memory Selection for Long-range Language Modeling

External Memory & Information Retrieval

[2023/06] Augmenting Language Models with Long-Term Memory

[2023/06] Long-range Language Modeling with Self-retrieval

[2023/07] Focused Transformer: Contrastive Training for Context Scaling

Positional Encoding

[2021/04] RoFormer: Enhanced Transformer with Rotary Position Embedding

[2022/03] Transformer Language Models without Positional Encodings Still Learn Positional Information

[2022/04] Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

[2022/05] KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

[2022/12] Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

[2022/12] The Impact of Positional Encoding on Length Generalization in Transformers

[2023/05] Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings

[2023/06] Extending Context Window of Large Language Models via Positional Interpolation

[2023/07] Exploring Transformer Extrapolation

[2023/09] YaRN: Efficient Context Window Extension of Large Language Models

[2023/09] Effective Long-Context Scaling of Foundation Models

[2023/10] CLEX: Continuous Length Extrapolation for Large Language Models

Context Compression

[2022/12] Structured Prompting: Scaling In-Context Learning to 1,000 Examples

[2023/05] Efficient Prompting via Dynamic In-Context Learning

[2023/05] Adapting Language Models to Compress Contexts

[2023/05] Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

[2023/07] In-context Autoencoder for Context Compression in a Large Language Model

[2023/10] Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

[2023/10] RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation

[2023/10] Compressing Context to Enhance Inference Efficiency of Large Language Models

[2023/10] LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

[2023/10] LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

[2023/10] TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction

Architecture Variances

[2021/11] Efficiently Modeling Long Sequences with Structured State Spaces

[2022/12] Hungry Hungry Hippos: Towards Language Modeling with State Space Models

[2023/02] Hyena Hierarchy: Towards Larger Convolutional Language Models

[2023/04] Scaling Transformer to 1M tokens and beyond with RMT

[2023/06] Block-State Transformer

[2023/07] Retentive Network: A Successor to Transformer for Large Language Models

[2023/10] Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors

[2023/10] Mamba: Linear-Time Sequence Modeling with Selective State Spaces

White-Box

[2019/06] Theoretical Limitations of Self-Attention in Neural Sequence Models

[2020/06] $O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

[2022/02] Overcoming a Theoretical Limitation of Self-Attention

[2023/05] Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer

[2023/10] JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Long Context Modeling

[2023/07] LongNet: Scaling Transformers to 1,000,000,000 Tokens

[2023/08] Giraffe: Adventures in Expanding Context Lengths in LLMs

[2023/09] LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

[2023/10] Mistral 7B

Benchmarks

[2020/11] Long Range Arena: A Benchmark for Efficient Transformers

[2022/01] SCROLLS: Standardized CompaRison Over Long Language Sequences

[2023/01] LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization

[2023/05] ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding

[2023/08] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

[2023/10] M4LE: A MULTI-ABILITY MULTI-RANGE MULTITASK MULTI-DOMAIN LONG-CONTEXT EVALUATION BENCHMARK FOR LARGE LANGUAGE MODELS

Data

[2023/12] Structured Packing in LLM Training Improves Long Context Utilization

[2024/01] LongAlign: A Recipe for Long Context Alignment of Large Language Models

[2024/02] Data Engineering for Scaling Language Models to 128K Context

Others

[2023/07] Zero-th Order Algorithm for Softmax Attention Optimization

[2023/10] (Dynamic) Prompting might be all you need to repair Compressed LLMs

[2023/10] Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors