Awesome
ML Papers Explained
Explanations to key concepts in ML
Language Models
Paper | Date | Description |
---|---|---|
Transformer | June 2017 | An Encoder Decoder model, that introduced multihead attention mechanism for language translation task. |
Elmo | February 2018 | Deep contextualized word representations that captures both intricate aspects of word usage and contextual variations across language contexts. |
Marian MT | April 2018 | A Neural Machine Translation framework written entirely in C++ with minimal dependencies, designed for high training and translation speed. |
GPT | June 2018 | A Decoder only transformer which is autoregressively pretrained and then finetuned for specific downstream tasks using task-aware input transformations. |
BERT | October 2018 | Introduced pre-training for Encoder Transformers. Uses unified architecture across different tasks. |
Transformer XL | January 2019 | Extends the original Transformer model to handle longer sequences of text by introducing recurrence into the self-attention mechanism. |
XLM | January 2019 | Proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective. |
GPT 2 | February 2019 | Demonstrates that language models begin to learn various language processing tasks without any explicit supervision. |
Sparse Transformer | April 2019 | Introduced sparse factorizations of the attention matrix to reduce the time and memory consumption to O(n√ n) in terms of sequence lengths. |
UniLM | May 2019 | Utilizes a shared Transformer network and specific self-attention masks to excel in both language understanding and generation tasks. |
XLNet | June 2019 | Extension of the Transformer-XL, pre-trained using a new method that combines ideas from AR and AE objectives. |
RoBERTa | July 2019 | Built upon BERT, by carefully optimizing hyperparameters and training data size to improve performance on various language tasks . |
Sentence BERT | August 2019 | A modification of BERT that uses siamese and triplet network structures to derive sentence embeddings that can be compared using cosine-similarity. |
CTRL | September 2019 | A 1.63B language model that can generate text conditioned on control codes that govern style, content, and task-specific behavior, allowing for more explicit control over text generation. |
Tiny BERT | September 2019 | Uses attention transfer, and task specific distillation for distilling BERT. |
ALBERT | September 2019 | Presents certain parameter reduction techniques to lower memory consumption and increase the training speed of BERT. |
Distil BERT | October 2019 | Distills BERT on very large batches leveraging gradient accumulation, using dynamic masking and without the next sentence prediction objective. |
Distil RoBERTa | October 2019 | Distillation of RoBERTa, using the same techniques as Distil BERT. |
T5 | October 2019 | A unified encoder-decoder framework that converts all text-based language problems into a text-to-text format. |
BART | October 2019 | An Encoder-Decoder pretrained to reconstruct the original text from corrupted versions of it. |
XLM-Roberta | November 2019 | A multilingual masked language model pre-trained on text in 100 languages, shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of crosslingual transfer tasks. |
XLM-Roberta | November 2019 | A multilingual masked language model pre-trained on text in 100 languages, shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of crosslingual transfer tasks. |
Pegasus | December 2019 | A self-supervised pre-training objective for abstractive text summarization, proposes removing/masking important sentences from an input document and generating them together as one output sequence. |
Reformer | January 2020 | Improves the efficiency of Transformers by replacing dot-product attention with locality-sensitive hashing (O(Llog L) complexity), using reversible residual layers to store activations only once, and splitting feed-forward layer activations into chunks, allowing it to perform on par with Transformer models while being much more memory-efficient and faster on long sequences. |
mBART | January 2020 | A multilingual sequence-to-sequence denoising auto-encoder that pre-trains a complete autoregressive model on large-scale monolingual corpora across many languages using the BART objective, achieving significant performance gains in machine translation tasks. |
UniLMv2 | February 2020 | Utilizes a pseudo-masked language model (PMLM) for both autoencoding and partially autoregressive language modeling tasks,significantly advancing the capabilities of language models in diverse NLP tasks. |
ELECTRA | March 2020 | Proposes a sample-efficient pre-training task called replaced token detection, which corrupts input by replacing some tokens with plausible alternatives and trains a discriminative model to predict whether each token was replaced or no. |
FastBERT | April 2020 | A speed-tunable encoder with adaptive inference time having branches at each transformer output to enable early outputs. |
MobileBERT | April 2020 | Compressed and faster version of the BERT, featuring bottleneck structures, optimized attention mechanisms, and knowledge transfer. |
Longformer | April 2020 | Introduces a linearly scalable attention mechanism, allowing handling texts of exteded length. |
GPT 3 | May 2020 | Demonstrates that scaling up language models greatly improves task-agnostic, few-shot performance. |
DeBERTa | June 2020 | Enhances BERT and RoBERTa through disentangled attention mechanisms, an enhanced mask decoder, and virtual adversarial training. |
DeBERTa v2 | June 2020 | Enhanced version of the DeBERTa featuring a new vocabulary, nGiE integration, optimized attention mechanisms, additional model sizes, and improved tokenization. |
T5 v1.1 | July 2020 | An enhanced version of the original T5 model, featuring improvements such as GEGLU activation, no dropout in pre-training, exclusive pre-training on C4, no parameter sharing between embedding and classifier layers. |
mT5 | October 2020 | A multilingual variant of T5 based on T5 v1.1, pre-trained on a new Common Crawl-based dataset covering 101 languages (mC4). |
Codex | July 2021 | A GPT language model finetuned on publicly available code from GitHub. |
FLAN | September 2021 | An instruction-tuned language model developed through finetuning on various NLP datasets described by natural language instructions. |
T0 | October 2021 | A fine tuned encoder-decoder model on a multitask mixture covering a wide variety of tasks, attaining strong zero-shot performance on several standard datasets. |
DeBERTa V3 | November 2021 | Enhances the DeBERTa architecture by introducing replaced token detection (RTD) instead of mask language modeling (MLM), along with a novel gradient-disentangled embedding sharing method, exhibiting superior performance across various natural language understanding tasks. |
WebGPT | December 2021 | A fine-tuned GPT-3 model utilizing text-based web browsing, trained via imitation learning and human feedback, enhancing its ability to answer long-form questions with factual accuracy. |
Gopher | December 2021 | Provides a comprehensive analysis of the performance of various Transformer models across different scales upto 280B on 152 tasks. |
LaMDA | January 2022 | Transformer based models specialized for dialog, which are pre-trained on public dialog data and web text. |
BERTopic | March 20222 | Utilizes Sentence-BERT for document embeddings, UMAP, HDBSCAN (soft-clustering), and an adjusted class-based TF-IDF, addressing multiple topics per document and dynamic topics' linear evolution. |
Instruct GPT | March 2022 | Fine-tuned GPT using supervised learning (instruction tuning) and reinforcement learning from human feedback to align with user intent. |
CodeGen | March 2022 | An LLM trained for program synthesis using input-output examples and natural language descriptions. |
Chinchilla | March 2022 | Investigated the optimal model size and number of tokens for training a transformer LLM within a given compute budget (Scaling Laws). |
PaLM | April 2022 | A 540-B parameter, densely activated, Transformer, trained using Pathways, (ML system that enables highly efficient training across multiple TPU Pods). |
GPT-NeoX-20B | April 2022 | An autoregressive LLM trained on the Pile, and the largest dense model that had publicly available weights at the time of submission. |
OPT | May 2022 | A suite of decoder-only pre-trained transformers with parameter ranges from 125M to 175B. OPT-175B being comparable to GPT-3. |
Flan T5, Flan PaLM | October 2022 | Explores instruction fine tuning with a particular focus on scaling the number of tasks, scaling the model size, and fine tuning on chain-of-thought data. |
BLOOM | November 2022 | A 176B-parameter open-access decoder-only transformer, collaboratively developed by hundreds of researchers, aiming to democratize LLM technology. |
BLOOMZ, mT0 | November 2022 | Applies Multitask prompted fine tuning to the pretrained multilingual models on English tasks with English prompts to attain task generalization to non-English languages that appear only in the pretraining corpus. |
Galactica | November 2022 | An LLM trained on scientific data thus specializing in scientific knowledge. |
ChatGPT | November 2022 | An interactive model designed to engage in conversations, built on top of GPT 3.5. |
Self Instruct | December 2022 | A framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations. |
LLaMA | February 2023 | A collection of foundation LLMs by Meta ranging from 7B to 65B parameters, trained using publicly available datasets exclusively. |
Toolformer | February 2023 | An LLM trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. |
Alpaca | March 2023 | A fine-tuned LLaMA 7B model, trained on instruction-following demonstrations generated in the style of self-instruct using text-davinci-003. |
GPT 4 | March 2023 | A multimodal transformer model pre-trained to predict the next token in a document, which can accept image and text inputs and produce text outputs. |
Vicuna | March 2023 | A 13B LLaMA chatbot fine tuned on user-shared conversations collected from ShareGPT, capable of generating more detailed and well-structured answers compared to Alpaca. |
BloombergGPT | March 2023 | A 50B language model train on general purpose and domain specific data to support a wide range of tasks within the financial industry. |
Pythia | April 2023 | A suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. |
WizardLM | April 2023 | Introduces Evol-Instruct, a method to generate large amounts of instruction data with varying levels of complexity using LLM instead of humans to fine tune a Llama model |
CodeGen2 | May 2023 | Proposes an approach to make the training of LLMs for program synthesis more efficient by unifying key components of model architectures, learning methods, infill sampling, and data distributions |
PaLM 2 | May 2023 | Successor of PALM, trained on a mixture of different pre-training objectives in order to understand different aspects of language. |
LIMA | May 2023 | A LLaMa model fine-tuned on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. |
Gorilla | May 2023 | A retrieve-aware finetuned LLaMA-7B model, specifically for API calls. |
Orca | June 2023 | Presents a novel approach that addresses the limitations of instruction tuning by leveraging richer imitation signals, scaling tasks and instructions, and utilizing a teacher assistant to help with progressive learning. |
Falcon | June 2023 | An Open Source LLM trained on properly filtered and deduplicated web data alone. |
Phi-1 | June 2023 | An LLM for code, trained using a textbook quality data from the web and synthetically generated textbooks and exercises with GPT-3.5. |
WizardCoder | June 2023 | Enhances the performance of the open-source Code LLM, StarCoder, through the application of Code Evol-Instruct. |
LLaMA 2 | July 2023 | Successor of LLaMA. LLaMA 2-Chat is optimized for dialogue use cases. |
Tool LLM | July 2023 | A LLaMA model finetuned on an instruction-tuning dataset for tool use, automatically created using ChatGPT. |
Humpback | August 2023 | LLaMA finetuned using Instrustion backtranslation. |
Code LLaMA | August 2023 | LLaMA 2 based LLM for code. |
WizardMath | August 2023 | Proposes Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method, applied to Llama-2 to enhance the mathematical reasoning abilities. |
LLaMA 2 Long | September 2023 | A series of long context LLMs s that support effective context windows of up to 32,768 tokens. |
Phi-1.5 | September 2023 | Follows the phi-1 approach, focusing this time on common sense reasoning in natural language. |
MAmmoTH | September 2023 | A series of LLMs specifically designed for general math problem-solving, trained on MathInstruct, a dataset compiled from 13 math datasets with intermediate rationales that combines chain-of-thought and program-of-thought approaches to accommodate different thought processes for various math problems. |
Mistral 7B | October 2023 | Leverages grouped-query attention for faster inference, coupled with sliding window attention to effectively handle sequences of arbitrary length with a reduced inference cost. |
Llemma | October 2023 | An LLM for mathematics, formed by continued pretraining of Code Llama on a mixture of scientific papers, web data containing mathematics, and mathematical code. |
CodeFusion | October 2023 | A diffusion code generation model that iteratively refines entire programs based on encoded natural language, overcoming the limitation of auto-regressive models in code generation by allowing reconsideration of earlier tokens. |
Zephyr 7B | October 2023 | Utilizes dDPO and AI Feedback (AIF) preference data to achieve superior intent alignment in chat-based language modeling. |
Grok 1 | November 2023 | A 314B Mixture-of-Experts model, modeled after the Hitchhiker's Guide to the Galaxy, designed to be witty. |
Orca 2 | November 2023 | Introduces Cautious Reasoning for training smaller models to select the most effective solution strategy based on the problem at hand, by crafting data with task-specific system instruction(s) corresponding to the chosen strategy in order to obtain teacher responses for each task and replacing the student’s system instruction with a generic one vacated of details of how to approach the task. |
Phi-2 | December 2023 | A 2.7B model, developed to explore whether emergent abilities achieved by large-scale language models can also be achieved at a smaller scale using strategic choices for training, such as data selection. |
TinyLlama | January 2024 | A 1.1B language model built upon the architecture and tokenizer of Llama 2, pre-trained on around 1 trillion tokens for approximately 3 epochs, leveraging FlashAttention and Grouped Query Attention, to achieve better computational efficiency. |
Mixtral 8x7B | January 2024 | A Sparse Mixture of Experts language model trained with multilingual data using a context size of 32k tokens. |
H2O Danube 1.8B | January 2024 | A language model trained on 1T tokens following the core principles of LLama 2 and Mistral, leveraging and refining various techniques for pre-training large language models. |
OLMo | February 2024 | A state-of-the-art, truly open language model and framework that includes training data, code, and tools for building, studying, and advancing language models. |
MobileLLM | February 2024 | Leverages various architectures and attention mechanisms to achieve a strong baseline network, which is then improved upon by introducing an immediate block-wise weight-sharing approach, resulting in a further accuracy boost. |
Orca Math | February 2024 | A fine tuned Mistral-7B that excels at math problems without external tools, utilizing a high-quality synthetic dataset of 200K problems created through multi-agent collaboration and an iterative learning process that involves practicing problem-solving, receiving feedback, and learning from preference pairs incorporating the model's solutions and feedback. |
Gemma | February 2024 | A family of 2B and 7B, state-of-the-art language models based on Google's Gemini models, offering advancements in language understanding, reasoning, and safety. |
Aya 101 | Februray 2024 | A massively multilingual generative language model that follows instructions in 101 languages,trained by finetuning mT5. |
Nemotron-4 15B | February 2024 | A 15B multilingual language model trained on 8T text tokens by Nvidia. |
Hawk, Griffin | February 2024 | Introduces Real Gated Linear Recurrent Unit Layer that forms the core of the new recurrent block, replacing Multi-Query Attention for better efficiency and scalability |
WRAP | March 2024 | Uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles to jointly pre-train LLMs on real and synthetic rephrases. |
Command R | March 2024 | An LLM optimized for retrieval-augmented generation and tool use, across multiple languages. |
DBRX | March 2024 | A 132B open, general-purpose fine grained Sparse MoE LLM surpassing GPT-3.5 and competitive with Gemini 1.0 Pro. |
Grok 1.5 | March 2024 | An advancement over grok, capable of long context understanding up to 128k tokens and advanced reasoning. |
Command R+ | April 2024 | Successor of Command R+ with improved performance for retrieval-augmented generation and tool use, across multiple languages. |
Llama 3 | April 2024 | A family of 8B and 70B parameter models trained on 15T tokens with a focus on data quality, demonstrating state-of-the-art performance on various benchmarks, improved reasoning capabilities. |
Mixtral 8x22B | April 2024 | A open-weight AI model optimised for performance and efficiency, with capabilities such as fluency in multiple languages, strong mathematics and coding abilities, and precise information recall from large documents. |
CodeGemma | April 2024 | Open code models based on Gemma models by further training on over 500 billion tokens of primarily code. |
RecurrentGemma | April 2024 | Based on Griffin, uses a combination of linear recurrences and local attention instead of global attention to model long sequences efficiently. |
Rho-1 | April 2024 | Introduces Selective Language Modelling that optimizes the loss only on tokens that align with a desired distribution, utilizing a reference model to score and select tokens. |
Phi-3 | April 2024 | A series of language models trained on heavily filtered web and synthetic data set, achieving performance comparable to much larger models like Mixtral 8x7B and GPT-3.5. |
Open ELM | April 2024 | A fully open language model designed to enhance accuracy while using fewer parameters and pre-training tokens. Utilizes a layer-wise scaling strategy to allocate smaller dimensions in early layers, expanding in later layers. |
H2O Danube2 1.8B | April 2024 | An updated version of the original H2O-Danube model, with improvements including removal of sliding window attention, changes to the tokenizer, and adjustments to the training data, resulting in significant performance enhancements. |
MAmmoTH 2 | May 2024 | LLMs fine tuned on a dataset curated through the proposed paradigm that efficiently harvest 10M naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. It involves recalling relevant documents, extracting instruction-response pairs, and refining the extracted pairs using open-source LLMs. |
Granite Code Models | May 2024 | A family of code models ranging from 3B to 34B trained on 3.5-4.5T tokens of code written in 116 programming languages. |
Codestral 22B | May 2024 | An open-weight model designed for code generation tasks, trained on over 80 programming languages, and licensed under the Mistral AI Non-Production License, allowing developers to use it for research and testing purposes. |
Aya 23 | May 2024 | A family of multilingual language models supporting 23 languages, designed to balance breadth and depth by allocating more capacity to fewer languages during pre-training. |
Gemma 2 | June 2024 | Utilizes interleaving local-global attentions and group-query attention, trained with knowledge distillation instead of next token prediction to achieve competitive performance comparable with larger models. |
Orca 3 (Agent Instruct) | June 2024 | A fine tuned Mistral-7B through Generative Teaching via synthetic data generated through the proposed AgentInstruct framework, which generates both the prompts and responses, using only raw data sources like text documents and code files as seeds. |
Nemotron-4 340B | June 2024 | 340B models, along with a reward model by Nvidia, suitable for generating synthetic data to train smaller language models, with over 98% of the data used in model alignment being synthetically generated. |
Mathstral | July 2024 | a 7B model designed for math reasoning and scientific discovery based on Mistral 7B specializing in STEM subjects. |
Mistral Nemo | July 2024 | A 12B Language Model built in collaboration between Mistral and NVIDIA, featuring a context window of 128K, an efficient tokenizer and trained with quantization awareness, enabling FP8 inference without any performance loss. |
Smol LM | July 2024 | A family of small models with 135M, 360M, and 1.7B parameters, utilizes Grouped-Query Attention (GQA), embedding tying, and a context length of 2048 tokens, trained on a new open source high-quality dataset. |
Llama 3.1 | July 2024 | A family of multilingual language models ranging from 8B to 405B parameters, trained on a massive dataset of 15T tokens and achieving comparable performance to leading models like GPT-4 on various tasks. |
Llama 3.1 - Multimodal Experiments | July 2024 | Additional experiments of adding multimodal capabilities to Llama3. |
Mistral Large 2 | July 2024 | A 123B model, offers significant improvements in code generation, mathematics, and reasoning capabilities, advanced function calling, a 128k context window, and supports dozens of languages and over 80 coding languages. |
Minitron | July 2024 | Prunes an existing Nemotron model and re-trains it with a fraction of the original training data, achieving compression factors of 2-4×, compute cost savings of up to 40×, and improved performance on various language modeling tasks. |
H2O Danube 3 | July 2024 | A series of 4B and 500M language models, trained on high-quality Web data in three stages with different data mixes before being fine-tuned for chat version. |
LLM Compiler | July 2024 | A suite of pre-trained models designed for code optimization tasks, built upon Code Llama, with two sizes (7B and 13B), trained on LLVM-IR and assembly code to optimize compiler intermediate representations, assemble/disassemble, and achieve high accuracy in optimizing code size and disassembling from x86_64 and ARM assembly back into LLVM-IR. |
Apple Intelligence Foundation Language Models | July 2024 | Two foundation language models, AFM-on-device (a ~3 B parameter model) and AFM-server (a larger server-based model), designed to power Apple Intelligence features efficiently, accurately, and responsibly, with a focus on Responsible AI principles that prioritize user empowerment, representation, design care, and privacy protection. |
Hermes 3 | August 2024 | Neutrally generalist instruct and tool use models, created by fine-tuning Llama 3.1 models with strong reasoning and creative abilities, and are designed to follow prompts neutrally without moral judgment or personal opinions. |
Smol LM v0.2 | August 2024 | An advancement over SmolLM, better at staying on topic and responding appropriately to standard prompts, such as greetings and questions about their role as AI assistants. |
Phi-3.5 | August 2024 | A family of models consisting of three variants - MoE (16x3.8B), mini (3.8B), and vision (4.2B) - which are lightweight, multilingual, and trained on synthetic and filtered publicly available documents - with a focus on very high-quality, reasoning dense data. |
Minitron Approach in Practice | August 2024 | Applies the minitron approach to Llama 3.1 8B and Mistral-Nemo 12B, additionally applies teacher correction to align with the new data distribution. |
OLMoE | September 2024 | An open source language model based on sparse Mixture-of-Experts architecture with 7B parameters, out of which only 1B parameters are active per input token. Conducted extensive experiments on MoE training, analyzing routing strategies, expert specialization, and the impact of design choices like routing algorithms and expert size. |
o1 | September 2024 | A large language model trained with reinforcement learning to think before answering, producing a long internal chain of thought before responding. |
o1-mini | September 2024 | A cost-efficient reasoning model, excelling at STEM, especially math and coding , nearly matching the performance of OpenAI o1 on evaluation benchmarks. |
Llama 3.1-Nemotron-51B | September 2024 | Uses knowledge distillation and NAS to optimize various constraints, resulting in a model that achieves 2.2x faster inference compared to the reference model while maintaining nearly the same accuracy, with an irregular block structure that reduces or prunes attention and FFN layers for better utilization of H100 and improved LLMs for inference. |
Mistral Small | September 2024 | A 22B model with significant improvements in human alignment, reasoning capabilities, and code over the previous model. |
Llama 3.2 | September 2024 | Small and medium-sized vision LLMs (11B and 90B), and lightweight, text-only models (1B and 3B). |
Ministral | October 2024 | 3B and 8B models with support up to 128k context length having a special interleaved sliding-window attention pattern for faster and memory-efficient inference. |
Quantized Llama 3 | October 2024 | Optimized versions of the Llama, using techniques like Quantization-Aware Training with LoRA Adapters and SpinQuant, to reduce model size and memory usage while maintaining accuracy and performance, enabling deployment on resource-constrained devices like mobile phones. |
Nemotron-Mini-Hindi | October 2024 | A bilingual language model based on Nemotron-Mini 4B, specifically trained to improve Hindi and English performance using continuous pre-training on 400B real and synthetic tokens. |
Smol LM v2 | November 2024 | A family of language models (135M, 360M, and 1.7B parameters), trained on 2T, 4T, and 11T tokens respectively from datasets including FineWeb-Edu, DCLM, The Stack, and curated math and coding datasets, with instruction-tuned versions created using Smol Talk dataset and DPO using UltraFeedback. |
Command R 7B | December 2024 | The smallest, fastest, and final model in the R series of enterprise-focused LLMs. It offers a context length of 128k and delivers a powerful combination of multilingual support, citation verified retrieval-augmented generation (RAG), reasoning, tool use, and agentic behavior. |
ModernBERT | December 2024 | Modernized encoder-only transformer model trained on 2 trillion tokens with a native 8192 sequence length, incorporating architectural improvements like GeGLU activations, RoPE embeddings, alternating attention, and unpadding, resulting in state-of-the-art performance across diverse classification and retrieval tasks (including code) and superior inference speed and memory efficiency compared to existing encoder models. |
Multi Modal Language Models
Paper | Date | Description |
---|---|---|
Florence | November 2021 | A computer vision foundation model that can be adapted to various tasks by expanding representations from coarse (scene) to fine (object), static (images) to dynamic (videos), and RGB to multiple modalities. |
BLIP | February 2022 | A Vision-Language Pre-training (VLP) framework that introduces Multimodal mixture of Encoder-Decoder (MED) and Captioning and Filtering (CapFilt), a new dataset bootstrapping method for learning from noisy image-text pairs. |
Flamingo | April 2022 | Visual Language Models enabling seamless handling of interleaved visual and textual data, and facilitating few-shot learning on large-scale web corpora. |
PaLI | September 2022 | A joint language-vision model that generates multilingual text based on visual and textual inputs, trained using large pre-trained encoder-decoder language models and Vision Transformers, specifically mT5 and ViT-e. |
BLIP 2 | January 2023 | A Vision-Language Pre-training (VLP) framework that proposes Q-Former, a trainable module to bridge the gap between a frozen image encoder and a frozen LLM to bootstrap vision-language pre-training. |
LLaVA 1 | April 2023 | A large multimodal model connecting CLIP and Vicuna trained end-to-end on instruction-following data generated through GPT-4 from image-text pairs. |
PaLI-X | May 2023 | A multilingual vision and language model with scaled-up components, specifically ViT-22 B and UL2 32B, exhibits emergent properties such as complex counting and multilingual object detection, and demonstrates improved performance across various tasks. |
InstructBLIP | May 2023 | Introduces instruction-aware Query Transformer to extract informative features tailored to the given instruction to study vision-language instruction tuning based on the pretrained BLIP-2 models. |
Idefics | June 2023 | 9B and 80B multimodal models trained on Obelics, an open web-scale dataset of interleaved image-text documents, curated in this work. |
GPT-4V | September 2023 | A multimodal model that combines text and vision capabilities, allowing users to instruct it to analyze image inputs. |
PaLI-3 | October 2023 | A 5B vision language model, built upon a 2B SigLIP Vision Model and UL2 3B Language Model outperforms larger models on various benchmarks and achieves SOTA on several video QA benchmarks despite not being pretrained on any video data. |
LLaVA 1.5 | October 2023 | An enhanced version of the LLaVA model that incorporates a CLIP-ViT-L-336px with an MLP projection and academic-task-oriented VQA data to set new benchmarks in large multimodal models (LMM) research. |
Florence-2 | November 2023 | A vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. |
CogVLM | November 2023 | Bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. |
Gemini 1.0 | December 2023 | A family of highly capable multi-modal models, trained jointly across image, audio, video, and text data for the purpose of building a model with strong generalist capabilities across modalities. |
MoE-LLaVA | January 2024 | A MoE-based sparse LVLM framework that activates only the top-k experts through routers during deployment, maintaining computational efficiency while achieving comparable performance to larger models. |
LLaVA 1.6 | January 2024 | An improved version of a LLaVA 1.5 with enhanced reasoning, OCR, and world knowledge capabilities, featuring increased image resolution |
Gemini 1.5 Pro | February 2024 | A highly compute-efficient multimodal mixture-of-experts model that excels in long-context retrieval tasks and understanding across text, video, and audio modalities. |
Claude 3 | March 2024 | A family of VLMs consisting of Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus, sets new industry standards for cognitive tasks, offering varying levels of intelligence, speed, and cost-efficiency. |
MM1 | March 2024 | A multimodal llm that combines a ViT-H image encoder with 378x378px resolution, pretrained on a data mix of image-text documents and text-only documents, scaled up to 3B, 7B, and 30B parameters for enhanced performance across various tasks. |
Grok 1.5 V | April 2024 | The first multimodal model in the grok series. |
Idefics2 | April 2024 | Improvement upon Idefics1 with enhanced OCR capabilities, simplified architecture, and better pre-trained backbones, trained on a mixture of openly available datasets and fine-tuned on task-oriented data. |
Phi 3 Vision | May 2024 | First multimodal model in the Phi family, bringing the ability to reason over images and extract and reason over text from images. |
An Introduction to Vision-Language Modeling | May 2024 | Provides a comprehensive introduction to VLMs, covering their definition, functionality, training methods, and evaluation approaches, aiming to help researchers and practitioners enter the field and advance the development of VLMs for various applications. |
GPT-4o | May 2024 | An omni model accepting and generating various types of inputs and outputs, including text, audio, images, and video. |
Gemini 1.5 Flash | May 2024 | A more lightweight variant of the Gemini 1.5 pro, designed for efficiency with minimal regression in quality, making it suitable for applications where compute resources are limited. |
Chameleon | May 2024 | A family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. |
Claude 3.5 Sonnet | June 2024 | Surpasses previous versions and competitors in intelligence, speed, and cost-efficiency, excelling in graduate-level reasoning, undergraduate-level knowledge, coding proficiency, and visual reasoning. |
Pali Gemma | July 2024 | Combines SigLIP vision model and the Gemma language model and follows the PaLI-3 training recipe to achieve strong performance on various vision-language tasks. |
GPT-4o mini | July 2024 | A cost-efficient small model that outperforms GPT-4 on chat preferences, enabling a broad range of tasks with low latency and supporting text, vision, and multimodal inputs and outputs. |
Grok 2 | August 2024 | A frontier language model with state-of-the-art capabilities in chat, coding, and reasoning on par with Claude 3.5 Sonnet and GPT-4-Turbo. |
BLIP-3 (xGen-MM) | August 2024 | A comprehensive system for developing Large Multimodal Models, comprising curated datasets, training recipes, model architectures, and pre-trained models that demonstrate strong in-context learning capabilities and competitive performance on various tasks. |
Idefics 3 | August 2024 | A VLM based on Llama 3.1 and SigLIP-SO400M trained efficiently, using only open datasets and a straightforward pipeline, significantly outperforming in document understanding tasks. |
CogVLM2 | August 2024 | A family of visual language models that enables image and video understanding with improved training recipes, exploring enhanced vision-language fusion, higher input resolution, and broader modalities and applications. |
Eagle | August 2024 | Provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions, and reveals several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. |
Pixtral | September 2024 | A 12B parameter natively multimodal vision-language model, trained with interleaved image and text data demonstrating strong performance on multimodal tasks, and excels in instruction following. |
NVLM | September 2024 | A family of multimodal large language models, provides a comparison between decoder-only multimodal LLMs and cross-attention based models and proposes a hybrid architecture, it further introduces a 1-D title-tagging design for tile-based dynamic high resolution images. |
Molmo | September 2024 | A family of open-weight vision-language models that achieve state-of-the-art performance by leveraging a novel, human-annotated image caption dataset called PixMo. |
MM-1.5 | September 2024 | A family of multimodal large language models designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning, achieved through a data-centric approach involving diverse data mixtures, And specialized variants for video and mobile UI understanding. |
Mississippi | October 2024 | A collection of small, efficient, open-source vision-language models built on top of Danube, trained on 37 million image-text pairs, specifically designed to perform well on document analysis and OCR tasks while maintaining strong performance on general vision-language benchmarks. |
Claude 3.5 Haiku | October 2024 | A fast and affordable language model that excels in tasks such as coding, reasoning, and content creation. |
Smol VLM | November 2024 | A 2B vision-language model, built using a modified Idefics3 architecture with a smaller language backbone (SmolLM2 1.7B), aggressive pixel shuffle compression, 384x384 image patches, and a shape-optimized SigLIP vision backbone, featuring a 16k token context window. |
Retrieval and Representation Learning
Paper | Date | Description |
---|---|---|
SimCLR | February 2020 | A simplified framework for contrastive learning that optimizes data augmentation composition, introduces learnable nonlinear transformations, and leverages larger batch sizes and more training steps. |
Dense Passage Retriever | April 2020 | Shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual encoder framework. |
ColBERT | April 2020 | Introduces a late interaction architecture that adapts deep LMs (in particular, BERT) for efficient retrieval. |
SimCLRv2 | June 2020 | A Semi-supervised learning framework which uses unsupervised pre training followed by supervised fine-tuning and distillation with unlabeled examples. |
CLIP | February 2021 | A vision system that learns image representations from raw text-image pairs through pre-training, enabling zero-shot transfer to various downstream tasks. |
ColBERTv2 | December 2021 | Couples an aggressive residual compression mechanism with a denoised supervision strategy to simultaneously improve the quality and space footprint of late interaction. |
Matryoshka Representation Learning | May 2022 | Encodes information at different granularities and allows a flexible representation that can adapt to multiple downstream tasks with varying computational resources using a single embedding. |
E5 | December 2022 | A family of text embeddings trained in a contrastive manner with weak supervision signals from a curated large-scale text pair dataset CCPairs. |
SigLip | March 2023 | A simple pairwise Sigmoid loss function for Language-Image Pre-training that operates solely on image-text pairs, allowing for larger batch sizes and better performance at smaller batch sizes. |
Jina Embeddings v1 | July 2023 | Contrastively fine tuned T5 encoder on curated high quality pairwise and triplet data specifically to sensitize the models to distinguish negations of statements from confirming statements. |
Jina Embeddings v2 | October 2023 | An open-source text embedding model capable of accommodating up to 8192 tokens, trained by pre-training a modified BERT from scratch before fine tuning for embeddings objectives. |
SynCLR | December 2023 | A visual representation learning method that leverages generative models to synthesize large-scale curated datasets without relying on any real data. |
E5 Mistral 7B | December 2023 | Leverages proprietary LLMs to generate diverse synthetic data to fine tune open-source decoder-only LLMs for hundreds of thousands of text embedding tasks. |
Nomic Embed Text v1 | February 2024 | A 137M parameter, open-source English text embedding model with an 8192 context length that outperforms OpenAI's models on both short and long-context tasks. |
Nomic Embed Text v1.5 | February 2024 | An advanced text embedding model that utilizes Matryoshka Representation Learning to offer flexible embedding sizes with minimal performance trade-offs |
Jina Bilingual Embeddings | February 2024 | A suite of bilingual text embedding models that support up to 8192 tokens, trained by pre-training a modified bilingual BERT from scratch before fine tuning for embeddings objectives. |
Jina Reranker | February 2024 | A neural reranking model that enhances search and RAG systems by reordering retrieved documents for better alignment with search query terms. |
Gecko | March 2024 | A 1.2B versatile text embedding model achieving strong retrieval performance by distilling knowledge from LLMs into a retriever. |
NV Embed | May 2024 | Introduces architectural innovations and training recipe to significantly enhance LLMs performance in general-purpose text embedding tasks. |
Nomic Embed Vision v1 and v1.5 | June 2024 | Aligns a Vision Encoder with the existing text encoders without destroying the downstream performance of the text encoder, to attain a unified multimodal latent space. |
ColPali | June 2024 | A retrieval model based on PaliGemma to produce high-quality contextualized embeddings solely from images of document pages, and employees late interaction allowing for efficient and effective visually rich document retrieval. |
Jina Reranker v2 | June 2024 | Builds upon Jina Reranker v1 by adding multilingual support, function-calling capabilities, structured data querying, code retrieval, and ultra-fast inference. |
E5-V | July 2024 | A framework that adapts Multimodal Large Language Models for achieving universal multimodal embeddings by leveraging prompts and single modality training on text pairs, which demonstrates strong performance in multimodal embeddings without fine-tuning and eliminates the need for costly multimodal training data collection. |
Matryoshka Adaptor | July 2024 | A framework designed for the customization of LLM embeddings, facilitating substantial dimensionality reduction while maintaining comparable performance levels. |
Jina Embeddings v3 | September 2024 | A text embedding model with 570 million parameters that supports long-context retrieval tasks up to 8192 tokens, includes LoRA adapters for various NLP tasks, and allows flexible output dimension reduction from 1024 down to 32 using Matryoshka Representation Learning. |
Parameter Efficient Fine Tuning
Paper | Date | Description |
---|---|---|
LoRA | July 2021 | Introduces trainable rank decomposition matrices into each layer of a pre-trained Transformer model, significantly reducing the number of trainable parameters for downstream tasks. |
DyLoRA | October 2022 | Allows for flexible rank size by randomly truncating low-rank matrices during training, enabling adaptation to different rank values without retraining. |
AdaLoRA | March 2023 | Dynamically allocates a parameter budget based on an importance metric to prune less important singular values during training. |
QLoRA | May 2023 | Allows efficient training of large models on limited GPU memory, through innovations like 4-bit NormalFloat (NF4), double quantization, and paged optimizers. |
LoRA-FA | August 2023 | Freezes one of the low-rank matrices and only trains a scaling vector for the other, further reducing the number of trainable parameters compared to standard LoRA. |
Delta-LoRA | September 2023 | Utilizes the delta of the low-rank matrix updates to refine the pre-trained weights directly, removing the Dropout layer for accurate backpropagation. |
LongLoRA | September 2023 | Enables context extension for large language models, achieving significant computation savings through sparse local attention and parameter-efficient fine-tuning. |
VeRA | October 2023 | Utilizes frozen, shared random matrices across all layers and trains scaling vectors to adapt those matrices for each layer, reducing the number of trainable parameters compared to LoRA. |
LoRA+ | February 2024 | Enhances LoRA by setting different learning rates for the A and B matrices based on a fixed ratio, promoting better feature learning and improved performance. |
MoRA | May 2024 | Introduces a square matrix and non-parameterized operators to achieve high-rank updating with the same number of trainable parameters as LoRA, improving knowledge memorization capabilities. |
DoRA | May 2024 | Decomposes the high-rank LoRA matrix into multiple single-rank components, allowing dynamic pruning of less important components during training for a more efficient parameter budget allocation. |
LLM Evaluation
Paper | Date | Description |
---|---|---|
Prometheus | October 2023 | A 13B fully open source evaluation LLM trained on Feedback Collection curated using GPT-4 (in this work). |
Prometheus 2 | May 2024 | 7B & 8x7B evaluation LLMs that score high correlations with both human evaluators and proprietary LM-based judges on both direct assessment and pairwise ranking, obtained by merging Mistral models trained on Feedback Collection and Preference Collection (curated in this work. |
Compression, Pruning, Quantization
Paper | Date | Description |
---|---|---|
LLMLingua | October 2023 | A novel coarse-to-fine prompt compression method, incorporating a budget controller, an iterative token-level compression algorithm, and distribution alignment, achieving up to 20x compression with minimal performance loss. |
LongLLMLingua | October 2023 | A novel approach for prompt compression to enhance performance in long context scenarios using question-aware compression and document reordering. |
LLMLingua2 | March 2024 | A novel approach to task-agnostic prompt compression, aiming to enhance generalizability, using data distillation and leveraging a Transformer encoder for token classification. |
Vision Transformers
Paper | Date | Description |
---|---|---|
Vision Transformer | October 2020 | Images are segmented into patches, which are treated as tokens and a sequence of linear embeddings of these patches are input to a Transformer |
DeiT | December 2020 | A convolution-free vision transformer that uses a teacher-student strategy with attention-based distillation tokens. |
Swin Transformer | March 2021 | A hierarchical vision transformer that uses shifted windows to addresses the challenges of adapting the transformer model to computer vision. |
Convolutional Vision Transformer | March 2021 | Improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions, to yield the best of both designs. |
LeViT | April 2021 | A hybrid neural network built upon the ViT architecture and DeiT training method, for fast inference image classification. |
DINO | April 2021 | Investigates whether self-supervised learning provides new properties to Vision Transformer that stand out compared to convolutional networks and finds that self-supervised ViT features contain explicit information about the semantic segmentation of an image, and are also excellent k-NN classifiers. |
BEiT | June 2021 | Utilizes a masked image modeling task inspired by BERT in, involving image patches and visual tokens to pretrain vision Transformers. |
MobileViT | October 2021 | A lightweight vision transformer designed for mobile devices, effectively combining the strengths of CNNs and ViTs. |
Masked AutoEncoder | November 2021 | An encoder-decoder architecture that reconstructs input images by masking random patches and leveraging a high proportion of masking for self-supervision. |
DINOv2 | April 2022 | Demonstrates that existing self-supervised pre-training methods can produce general-purpose visual features by training on curated data from diverse sources, and proposes a new approach that combines techniques to scale pre-training with larger models and datasets. |
MaxViT | April 2022 | Introduces multi-axis attention, allowing global-local spatial interactions on arbitrary input resolutions with only linear complexity. |
Swin Transformer V2 | April 2022 | A successor to Swin Transformer, addressing challenges like training stability, resolution gaps, and labeled data scarcity. |
EfficientFormer | June 2022 | Revisits the design principles of ViT and its variants through latency analysis and identifies inefficient designs and operators in ViT to propose a new dimension consistent design paradigm for vision transformers and a simple yet effective latency-driven slimming method to optimize for inference speed. |
FastViT | March 2023 | A hybrid vision transformer architecture featuring a novel token mixing operator called RepMixer, which significantly improves model efficiency. |
Efficient Vit | May 2023 | Employs a single memory-bound MHSA between efficient FFN layers, improves memory efficiency while enhancing channel communication. |
SoViT | May 2023 | A shape-optimized vision transformer that achieves competitive results with models twice its size, while being pre-trained with an equivalent amount of compute. |
Convolutional Neural Networks
Paper | Date | Description |
---|---|---|
Lenet | December 1998 | Introduced Convolutions. |
Alex Net | September 2012 | Introduced ReLU activation and Dropout to CNNs. Winner ILSVRC 2012. |
VGG | September 2014 | Used large number of filters of small size in each layer to learn complex features. Achieved SOTA in ILSVRC 2014. |
Inception Net | September 2014 | Introduced Inception Modules consisting of multiple parallel convolutional layers, designed to recognize different features at multiple scales. |
Inception Net v2 / Inception Net v3 | December 2015 | Design Optimizations of the Inception Modules which improved performance and accuracy. |
Res Net | December 2015 | Introduced residual connections, which are shortcuts that bypass one or more layers in the network. Winner ILSVRC 2015. |
Inception Net v4 / Inception ResNet | February 2016 | Hybrid approach combining Inception Net and ResNet. |
Dense Net | August 2016 | Each layer receives input from all the previous layers, creating a dense network of connections between the layers, allowing to learn more diverse features. |
Xception | October 2016 | Based on InceptionV3 but uses depthwise separable convolutions instead on inception modules. |
Res Next | November 2016 | Built over ResNet, introduces the concept of grouped convolutions, where the filters in a convolutional layer are divided into multiple groups. |
Mobile Net V1 | April 2017 | Uses depthwise separable convolutions to reduce the number of parameters and computation required. |
Mobile Net V2 | January 2018 | Built upon the MobileNetv1 architecture, uses inverted residuals and linear bottlenecks. |
Mobile Net V3 | May 2019 | Uses AutoML to find the best possible neural network architecture for a given problem. |
Efficient Net | May 2019 | Uses a compound scaling method to scale the network's depth, width, and resolution to achieve a high accuracy with a relatively low computational cost. |
NF Net | February 2021 | An improved class of Normalizer-Free ResNets that implement batch-normalized networks, offer faster training times, and introduce an adaptive gradient clipping technique to overcome instabilities associated with deep ResNets. |
Conv Mixer | January 2022 | Processes image patches using standard convolutions for mixing spatial and channel dimensions. |
ConvNeXt | January 2022 | A pure ConvNet model, evolved from standard ResNet design, that competes well with Transformers in accuracy and scalability. |
ConvNeXt V2 | January 2023 | Incorporates a fully convolutional MAE framework and a Global Response Normalization (GRN) layer, boosting performance across multiple benchmarks. |
Efficient Net V2 | April 2024 | A new family of convolutional networks, achieves faster training speed and better parameter efficiency than previous models through neural architecture search and scaling, with progressive learning allowing for improved accuracy on various datasets while training up to 11x faster. |
MobileNetV4 | April 2024 | Features a universally efficient architecture design, including the Universal Inverted Bottleneck (UIB) search block, Mobile MQA attention block, and an optimized neural architecture search recipe, which enables it to achieve high accuracy and efficiency on various mobile devices and accelerators. |
Object Detection
Paper | Date | Description |
---|---|---|
SSD | December 2015 | Discretizes bounding box outputs over a span of various scales and aspect ratios per feature map. |
Feature Pyramid Network | December 2016 | Leverages the inherent multi-scale hierarchy of deep convolutional networks to efficiently construct feature pyramids. |
Focal Loss | August 2017 | Addresses class imbalance in dense object detectors by down-weighting the loss assigned to well-classified examples. |
DETR | May 2020 | A novel object detection model that treats object detection as a set prediction problem, eliminating the need for hand-designed components. |
OWL ViT | May 2022 | Employs Vision Transformers, CLIP-based contrastive pre-training, and bipartite matching loss for open-vocabulary detection, utilizing image-level pre-training, multihead attention pooling, and mosaic image augmentation. |
Segment Anything Model | April 2023 | Introduces a novel image segmentation task, model, and dataset, aiming to enable prompt-able, zero-shot transfer learning in computer vision. |
SAM 2 | July 2024 | A foundation model towards solving promptable visual segmentation in images and videos based on a simple transformer architecture with streaming memory for real-time video processing. |
Region-based Convolutional Neural Networks
Paper | Date | Description |
---|---|---|
RCNN | November 2013 | Uses selective search for region proposals, CNNs for feature extraction, SVM for classification followed by box offset regression. |
Fast RCNN | April 2015 | Processes entire image through CNN, employs RoI Pooling to extract feature vectors from ROIs, followed by classification and BBox regression. |
Faster RCNN | June 2015 | A region proposal network (RPN) and a Fast R-CNN detector, collaboratively predict object regions by sharing convolutional features. |
Mask RCNN | March 2017 | Extends Faster R-CNN to solve instance segmentation tasks, by adding a branch for predicting an object mask in parallel with the existing branch. |
Cascade RCNN | December 2017 | Proposes a multi-stage approach where detectors are trained with progressively higher IoU thresholds, improving selectivity against false positives. |
Document AI
Paper | Date | Description |
---|---|---|
Table Net | January 2020 | An end-to-end deep learning model designed for both table detection and structure recognition. |
SPADE | May 2020 | Formulates Information Extraction (IE) as a spatial dependency parsing problem. |
Layout Parser | March 2021 | A library integrating Detectron2, CNN-RNN OCR, layout structures, TensorFlow/PyTorch, and a Model Zoo. The toolkit features Tesseract, Google Cloud Vision for OCR, active learning tools and community platform ensures efficiency and adaptability. |
Layout Reader | August 2021 | A seq2seq model that accurately predicts reading order, text, and layout information from document images. |
Donut | November 2021 | An OCR-free Encoder-Decoder Transformer model. The encoder takes in images, decoder takes in prompts & encoded images to generate the required text. |
DiT | March 2022 | An Image Transformer pre-trained (self-supervised) on document images |
Pix2Struct | October 2022 | A pretrained image-to-text model designed for visual language understanding, particularly in tasks involving visually-situated language. |
Matcha | December 2022 | Leverages Pix2Struct, and introduces pretraining tasks focused on math reasoning and chart derendering to improve chart and plot comprehension, enhancing understanding in diverse visual language tasks. |
DePlot | December 2022 | Built upon MatCha, standardises plot to table task, translating plots into linearized tables (markdown) for processing by LLMs. |
UDoP | December 2022 | Integrates text, image, and layout information through a Vision-Text-Layout Transformer, enabling unified representation. |
GeoLayoutLM | April 2023 | Explicitly models geometric relations in pre-training and enhances feature representation. |
Nougat | August 2023 | A Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language. |
LMDX | September 2023 | A methodology to adapt arbitrary LLMs for document information extraction. |
DocLLM | January 2024 | A lightweight extension to traditional LLMs that focuses on reasoning over visual documents, by incorporating textual semantics and spatial layout without expensive image encoders. |
Layout Transformers
Paper | Date | Description |
---|---|---|
Layout LM | December 2019 | Utilises BERT as the backbone, adds two new input embeddings: 2-D position embedding and image embedding (Only for downstream tasks). |
LamBERT | February 2020 | Utilises RoBERTa as the backbone and adds Layout embeddings along with relative bias. |
Layout LM v2 | December 2020 | Uses a multi-modal Transformer model, to integrate text, layout, and image in the pre-training stage, to learn end-to-end cross-modal interaction. |
Structural LM | May 2021 | Utilises BERT as the backbone and feeds text, 1D and (2D cell level) embeddings to the transformer model. |
Doc Former | June 2021 | Encoder-only transformer with a CNN backbone for visual feature extraction, combines text, vision, and spatial features through a multi-modal self-attention layer. |
BROS | August 2021 | Built upon BERT, encodes relative positions of texts in 2D space and learns from unlabeled documents with area masking strategy. |
LiLT | February 2022 | Introduced Bi-directional attention complementation mechanism (BiACM) to accomplish the cross-modal interaction of text and layout. |
Layout LM V3 | April 2022 | A unified text-image multimodal Transformer to learn cross-modal representations, that imputs concatenation of text embedding and image embedding. |
ERNIE Layout | October 2022 | Reorganizes tokens using layout information, combines text and visual embeddings, utilizes multi-modal transformers with spatial aware disentangled attention. |
Generative Adversarial Networks
Paper | Date | Description |
---|---|---|
Generative Adversarial Networks | June 2014 | Introduces a framework where, a generative and a discriminative model, are trained simultaneously in a minimax game. |
Conditional Generative Adversarial Networks | November 2014 | A method for training GANs, enabling the generation based on specific conditions, by feeding them to both the generator and discriminator networks. |
Deep Convolutional Generative Adversarial Networks | November 2015 | Demonstrates the ability of CNNs for unsupervised learning using specific architectural constraints designed. |
Improved GAN | June 2016 | Presents a variety of new architectural features and training procedures that can be applied to the generative adversarial networks (GANs) framework. |
Wasserstein Generative Adversarial Networks | January 2017 | An alternative GAN training algorithm that enhances learning stability, mitigates issues like mode collapse. |
Cycle GAN | March 2017 | An approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples by leveraging adversarial losses and cycle consistency constraints, using two GANs. |
Tabular Deep Learning
Paper | Date | Description |
---|---|---|
Entity Embeddings | April 2016 | Maps categorical variables into continuous vector spaces through neural network learning, revealing intrinsic properties. |
Wide and Deep Learning | June 2016 | Combines memorization of specific patterns with generalization of similarities. |
Deep and Cross Network | August 2017 | Combines the a novel cross network with deep neural networks (DNNs) to efficiently learn feature interactions without manual feature engineering. |
Tab Transformer | December 2020 | Employs multi-head attention-based Transformer layers to convert categorical feature embeddings into robust contextual embeddings. |
Tabular ResNet | June 2021 | An MLP with skip connections. |
Feature Tokenizer Transformer | June 2021 | Transforms all features (categorical and numerical) to embeddings and applies a stack of Transformer layers to the embeddings. |
Datasets
Paper | Date | Description |
---|---|---|
Obelics | June 2023 | An open web-scale filtered dataset of interleaved image-text documents comprising 141M web pages, 353M associated images, and 115B text tokens, extracted from CommonCrawl |
Dolma | January 2024 | An open corpus of three trillion tokens designed to support language model pretraining research. |
Aya Dataset | Februray 2024 | A human-curated instruction-following dataset that spans 65 languages, created to bridge the language gap in datasets for natural language processing. |
WebSight | March 2024 | A synthetic dataset consisting of 2M pairs of HTML codes and their corresponding screenshots, generated through LLMs, aimed to accelerate research for converting a screenshot into a corresponding HTML. |
Cosmopedia | March 2024 | Synthetic Data containing over 30M files and 25B tokens, generated by Mixtral-8x7B-Instruct-v0., aimed to reproduce the training data for Phi-1.5. |
RewardBench | March 2024 | A benchmark dataset and code-base designed to evaluate reward models used in RLHF. |
Fine Web | May 2024 | A large-scale dataset for pretraining LLMs, consisting of 15T tokens, shown to produce better-performing models than other open pretraining datasets. |
Cosmopedia v2 | July 2024 | An enhanced version of Cosmopedia, with a lot of emphasis on prompt optimization. |
Docmatix | July 2024 | A massive dataset for DocVQA containing 2.4M images, 9.5M question-answer pairs, and 1.3M PDF documents, generated by taking transcriptions from the PDFA OCR dataset and using a Phi-3-small model to generate Q/A pairs. |
Pixmo | September 2024 | A high-quality dataset of detailed image descriptions collected through speech-based annotations, enabling the creation of more robust and accurate VLMs. |
Smol Talk | November 2024 | A synthetic instruction-following dataset comprising 1 million samples, built using a fine-tuned LLM on a diverse range of instruction-following datasets and then generating synthetic conversations using various prompts and instructions to improve instruction following, chat, and reasoning capabilities. |
LLM Training
Paper | Date | Description |
---|---|---|
Direct Preference Optimization | December 2023 | A stable, performant, and computationally lightweight algorithm that fine-tunes llms to align with human preferences without the need for reinforcement learning, by directly optimizing for the policy best satisfying the preferences with a simple classification objective. |
RAFT | March 2024 | A training method that enhances the performance of LLMs for open-book in-domain question answering by training them to ignore irrelevant documents, cite verbatim relevant passages, and promote logical reasoning. |
RLHF Workflow | May 2024 | Provides a detailed recipe for online iterative RLHF and achieves state-of-the-art performance on various benchmarks using fully open-source datasets. |
Magpie | June 2024 | A self-synthesis method that extracts high-quality instruction data at scale by prompting an aligned LLM with left-side templates, generating 4M instructions and their corresponding responses. |
Instruction Pre-Training | June 2024 | A framework to augment massive raw corpora with instruction-response pairs enabling supervised multitask pretraining of LMs. |
Self-Taught Evaluators | August 2024 | An iterative training scheme that uses only synthetically generated preference data, without human annotations, to improve an LLM's ability to judge the quality of model responses by iteratively generating contrasting model outputs, training an LLM-as-a-Judge to produce reasoning traces and judgments, and using the improved predictions in subsequent iterations. |
Direct Judgement Preference Optimization | September 2024 | Proposes learning through preference optimization to enhance the evaluation capabilities of LLM judges which are trained on three approaches: Chain-of-Thought Critique, Standard Judgement, and Response Deduction across various use cases, including single rating, pairwise comparison, and classification. |
LongCite | October 2024 | A system comprising LongBench-Cite benchmark, CoF pipeline for generating cited QA instances, LongCite-45k dataset, and LongCite-8B/9B models trained on this dataset to improve the trustworthiness of long-context LLMs by enabling them to generate responses with fine-grained sentence-level citations. |
Thought Preference Optimization | October 2024 | Iteratively trains LLMs to generate useful "thoughts" that improve response quality by prompting the model to produce thought-response pairs, scoring the responses with a judge model, creating preference pairs from the highest and lowest-scoring responses and their associated thoughts, and then using these pairs with DPO or IRPO loss to optimize the thought generation process while mitigating judge model length bias through score normalization. |
Self-Consistency Preference Optimization | November 2024 | An unsupervised iterative training method for LLMs that leverages the concept of self-consistency to create preference pairs by selecting the most consistent response as the chosen response and the least consistent one as the rejected response, and then optimizes a weighted loss function that prioritizes pairs with larger vote margins, reflecting the model's confidence in the preference. |
Miscellaneous
Paper | Date | Description |
---|---|---|
ColD Fusion | December 2022 | A method enabling the benefits of multitask learning through distributed computation without data sharing and improving model performance. |
Are Emergent Abilities of Large Language Models a Mirage? | April 2023 | This paper presents an alternative explanation for emergent abilities, i.e. emergent abilities are created by the researcher’s choice of metrics, not fundamental changes in model family behaviour on specific tasks with scale. |
Scaling Data-Constrained Language Models | May 2023 | This study investigates scaling language models in data-constrained regimes. |
RAGAS | September 2023 | A framework for reference-free evaluation of RAG systems, assessing the retrieval system's ability to find relevant context, the LLM's faithfulness in using that context, and the overall quality of the generated response. |
DSPy | October 2023 | A programming model that abstracts LM pipelines as text transformation graphs, i.e. imperative computation graphs where LMs are invoked through declarative modules, optimizing their use through a structured framework of signatures, modules, and teleprompters to automate and enhance text transformation tasks. |
An In-depth Look at Gemini's Language Abilities | December 2023 | A third-party, objective comparison of the abilities of the OpenAI GPT and Google Gemini models with reproducible code and fully transparent results. |
STORM | February 2024 | A writing system that addresses the prewriting stage of long-form article generation by researching diverse perspectives, simulating multi-perspective question-asking, and curating information to create an outline, ultimately leading to more organized and comprehensive articles compared to baseline methods. |
PromptWizard | May 2024 | A framework that leverages LLMs to iteratively synthesize and refine prompts tailored to specific tasks by optimizing both prompt instructions and in-context examples, maximizing model performance. |
Monte Carlo Tree Self-refine | June 2024 | Integrates LLMs with Monte Carlo Tree Search to enhance performance in complex mathematical reasoning tasks, leveraging systematic exploration and heuristic self-refine mechanisms to improve decision-making frameworks. |
Proofread | June 2024 | A Gboard feature powered by a server-side LLM, enabling seamless sentence-level and paragraph-level corrections with a single tap. |
CriticGPT | June 2024 | A model based on GPT-4 trained with RLHF to catch errors in ChatGPT's code output, accepts a question-answer pair as input and outputs a structured critique that highlights potential problems in the answer. |
Gemma APS | June 2024 | Proposes a scalable, yet accurate, proposition segmentation model by modeling Proposition segmentation as a supervised task by training LLMs on existing annotated datasets. |
ShieldGemma | July 2024 | A comprehensive suite of LLM-based safety content moderation models ranging from 2B to 27B parameters built upon Gemma2 that provide predictions of safety risks across key harm types (sexually explicit, dangerous content, harassment, hate speech) in both user input and LLM-generated output. |
Spreadsheet LLM | July 2024 | An efficient encoding method that utilizes SheetCompressor, a framework comprising structural anchor based compression, inverse index translation, and data format aware aggregation, to effectively compress spreadsheets for LLMs, and Chain of Spreadsheet for spreadsheet understanding and spreadsheet QA task. |
OmniParser | August 2024 | A method for parsing user interface screenshots into structured elements, enhancing the ability of GPT-4V to generate actions grounded in the interface by accurately identifying interactable icons and understanding element semantics. |
Reader-LM | September 2024 | Small multilingual models specifically trained to generate clean markdown directly from noisy raw HTML, with a context length of up to 256K tokens. |
DataGemma | September 2024 | A set of models that aims to reduce hallucinations in LLMs by grounding them in the factual data of Google's Data Commons, allowing users to ask questions in natural language and receive responses based on verified information from trusted sources. |
GSM-Symbolic | October 2024 | Investigates the true mathematical reasoning capabilities of LLMs by introducing GSM-Symbolic, a new benchmark based on symbolic templates, revealing that LLMs exhibit inconsistent performance, struggle with complex questions, and appear to rely on pattern recognition rather than genuine logical reasoning. |
Literature Reviewed
- Convolutional Neural Networks
- Layout Transformers
- Region-based Convolutional Neural Networks
- Tabular Deep Learning
- Generative Adversarial Networks
- Parameter Efficient Fine Tuning
Reading Lists
- Language Models
- Encoder Only Language Transformers
- Decoder Only Language Transformers
- Language Models for Retrieval
- Small LLMs
- LLMs for Code
- GPT Models
- LLaMA Models
- Gemini / Gemma Models
- Wizard Models
- Orca Series
- BLIP Series
- LLM Lingua Series
- Multi Task Language Models
- Layout Aware Transformers
- Retrieval and Representation Learning
- LLM Evaluation
- Vision Transformers
- Multi Modal Transformers
- Convolutional Neural Networks
- Object Detection
- Region Based Convolutional Neural Networks
- Document Information Processing
Reach out to Ritvik or Elvis if you have any questions.
If you are interested to contribute, feel free to open a PR.