Awesome

llm-paper-notes

Notes from the Latent Space paper club. Follow along or start your own!

Attention Is All You Need: Query, Key, and Value are all you need* (*Also position embeddings, multiple heads, feed-forward layers, skip-connections, etc.)
GPT: Improving Language Understanding by Generative Pre-Training: Decoder is all you need* (*Also, pre-training + finetuning)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: Encoder is all you need*. Left-to-right language modeling is NOT all you need. (*Also, pre-training + finetuning)
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: Encoder-only or decoder-only is NOT all you need, though text-to-text is all you need* (*Also, pre-training + finetuning)
GPT2: Language Models are Unsupervised Multitask Learners: Unsupervised pre-training is all you need?!
GPT3: Language Models are Few-Shot Learners: Unsupervised pre-training + a few* examples is all you need. (*From 5 examples, in Conversational QA, to 50 examples in Winogrande, PhysicalQA, and TriviaQA)
Scaling Laws for Neural Language Models: Larger models trained on lesser data* are what you you need. (*10x more compute should be spent on 5.5x larger model and 1.8x more tokens)
Chinchilla: Training Compute-Optimal Large Language Models: Smaller models trained on more data* are what you need. (*10x more compute should be spent on 3.2x larger model and 3.2x more tokens)
LLaMA: Open and Efficient Foundation Language Models: Smoler models trained longer—on public data—is all you need
InstructGPT: Training language models to follow instructions with human feedback: 40 labelers are all you need* (*Plus supervised fine-tuning, reward modeling, and PPO)
LoRA: Low-Rank Adaptation of Large Language Models: One rank is all you need
QLoRA: Efficient Finetuning of Quantized LLMs: 4-bit is all you need* (*Plus double quantization and paged optimizers)
DPR: Dense Passage Retrieval for Open-Domain Question Answering: Dense embeddings are all you need* (*Also, high precision retrieval)
RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks: Semi-parametric models* are all you need (*Dense vector retrieval as non-parametric component; pre-trained LLM as parametric component)
RETRO: Improving language models by retrieving from trillions of tokens: Retrieving based on input chunks and chunked cross attention are all you need
Internet-augmented language models through few-shot prompting for open-domain question answering: Google Search as retrieval is all you need
HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels: LLM-generated, hypothetical documents are all you need
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness: For-loops in SRAM are all you need
ALiBi; Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation: Constant bias on the query-key dot-product is all you need* (*Also hyperparameter m and cached Q, K, V representations)
Codex: Evaluating Large Language Models Trained on Code: Finetuning on code is all you need
Layer Normalization: Consistent mean and variance at each layer is all you need
On Layer Normalization in the Transformer Architecture: Pre-layer norm, instead of post-layer norm, is all you need
PPO: Proximal Policy Optimization Algorithms: Clipping your surrogate function is all you need
WizardCoder: Empowering Code Large Language Models with Evol-Instruct: Asking the model to make the question harder is all you need* (*Where do they get the responses to these harder questions though?!)
Llama 2: Open Foundation and Fine-Tuned Chat Models: Iterative finetuning, PPO, rejection sampling, and ghost attention is all you need* (*Also, 27,540 SFT annotations and more than 1 million binary comparison preference data)
RWKV: Reinventing RNNs for the Transformer Era: Linear attention during inference, via RNNs, is what you need
RLAIF; Constitutional AI: Harmlessness from AI Feedback: A natural language constitution* and model feedback on harmlessness is all you need (*16 different variants of harmlessness principles)
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer: Noise in your softmax and expert regularization are all you need
CLIP: Learning Transferable Visual Models From Natural Language Supervision: *A projection layer between text and image embeddings is all you need (*Also, 400 million image-text pairs)
ViT; An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: Flattened 2D patches are all you need
Generative Agents: Interactive Simulacra of Human Behavior: Reflection, memory, and retrieval are all you need
Out-of-Domain Finetuning to Bootstrap Hallucination Detection: Open-source, permissive-use data is what you need
DPO; Direct Preference Optimization: Your Language Model is Secretly a Reward Model: A separate reward model is NOT what you need
Consistency Models: Mapping to how diffusion adds gaussian noise to images is all you need
LCM; Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference: Consistency modeling in latent space is all you need* (*Also, a diffusion model to distill from)
LCM-LoRA: A Universal Stable-Diffusion Acceleration Module: Combining LoRAs is all you need
Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models: Asking the LLM to reflect on retrieved documents is all you need
Emergent Abilities of Large Language Models: The Bitter Lesson is all you need
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions: The Bellman equation and replay buffers are all you need
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations: Classification guidelines and the multiple-choice response are all you need
REST^EM; Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models: Synthetic data and a reward function are all you need
Mixture Of Experts Explained: MOE is a architectural choice to route observations to subnetworks within a block. This allows us to scale up parameter counts by introducting more experts and hence capabilities of our network. However, this introduces new challenges due to the higher parameter count to run inference with, training instabilities and inference-time provisioning of experts across devices.
Self-Instruct: Aligning Language Models with Self-Generated Instructions: (GitHub Repo, LS Discord chat)
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling: (Presentation) A series of open source LLMs with completely reproducible datasets and checkpoints for LLM research. Novel studies (incl negative results) in memorization, data deduplication and data order, and gender debiasing.
Self-Rewarding Language Models: (GitHub Repo, LS Discord Chat): Instead of training reward models from human preferences, LLMs can be used to provide their own rewards during training (aka WITHOUT distillation from GPT4). Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613.
Building Your Own Product Copilot - Challenges, Opportunities, and Needs: Prompt engineering LLMs is NOT all you need.
Matryoshka Representation Learning: Aggregated losses across $2^n$-dim embeddings is all you need.
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems: Bigger GPUs is not all you need.
How to Generate and Use Synthetic Data for Finetuning: Synthetic data is almost all you need.
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision: 680k hrs of audio and multitask formulated as a sequence is all you need.
Leveraging Large Language Models for NLG Evaluation: A Survey: Slides, LS Discord Chat a survey paper of model and task eval techniques. Includes using Auto-J correlation instead of AlpacaEval which we liked.