Home

Awesome

ML Papers Explained

Explanations to key concepts in ML

Language Models

PaperDateDescription
TransformerJune 2017An Encoder Decoder model, that introduced multihead attention mechanism for language translation task.
ElmoFebruary 2018Deep contextualized word representations that captures both intricate aspects of word usage and contextual variations across language contexts.
Marian MTApril 2018A Neural Machine Translation framework written entirely in C++ with minimal dependencies, designed for high training and translation speed.
GPTJune 2018A Decoder only transformer which is autoregressively pretrained and then finetuned for specific downstream tasks using task-aware input transformations.
BERTOctober 2018Introduced pre-training for Encoder Transformers. Uses unified architecture across different tasks.
Transformer XLJanuary 2019Extends the original Transformer model to handle longer sequences of text by introducing recurrence into the self-attention mechanism.
XLMJanuary 2019Proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective.
GPT 2February 2019Demonstrates that language models begin to learn various language processing tasks without any explicit supervision.
Sparse TransformerApril 2019Introduced sparse factorizations of the attention matrix to reduce the time and memory consumption to O(n√ n) in terms of sequence lengths.
UniLMMay 2019Utilizes a shared Transformer network and specific self-attention masks to excel in both language understanding and generation tasks.
XLNetJune 2019Extension of the Transformer-XL, pre-trained using a new method that combines ideas from AR and AE objectives.
RoBERTaJuly 2019Built upon BERT, by carefully optimizing hyperparameters and training data size to improve performance on various language tasks .
Sentence BERTAugust 2019A modification of BERT that uses siamese and triplet network structures to derive sentence embeddings that can be compared using cosine-similarity.
CTRLSeptember 2019A 1.63B language model that can generate text conditioned on control codes that govern style, content, and task-specific behavior, allowing for more explicit control over text generation.
Tiny BERTSeptember 2019Uses attention transfer, and task specific distillation for distilling BERT.
ALBERTSeptember 2019Presents certain parameter reduction techniques to lower memory consumption and increase the training speed of BERT.
Distil BERTOctober 2019Distills BERT on very large batches leveraging gradient accumulation, using dynamic masking and without the next sentence prediction objective.
T5October 2019A unified encoder-decoder framework that converts all text-based language problems into a text-to-text format.
BARTOctober 2019An Encoder-Decoder pretrained to reconstruct the original text from corrupted versions of it.
XLM-RobertaNovember 2019A multilingual masked language model pre-trained on text in 100 languages, shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of crosslingual transfer tasks.
XLM-RobertaNovember 2019A multilingual masked language model pre-trained on text in 100 languages, shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of crosslingual transfer tasks.
PegasusDecember 2019A self-supervised pre-training objective for abstractive text summarization, proposes removing/masking important sentences from an input document and generating them together as one output sequence.
ReformerJanuary 2020Improves the efficiency of Transformers by replacing dot-product attention with locality-sensitive hashing (O(Llog L) complexity), using reversible residual layers to store activations only once, and splitting feed-forward layer activations into chunks, allowing it to perform on par with Transformer models while being much more memory-efficient and faster on long sequences.
mBARTJanuary 2020A multilingual sequence-to-sequence denoising auto-encoder that pre-trains a complete autoregressive model on large-scale monolingual corpora across many languages using the BART objective, achieving significant performance gains in machine translation tasks.
UniLMv2February 2020Utilizes a pseudo-masked language model (PMLM) for both autoencoding and partially autoregressive language modeling tasks,significantly advancing the capabilities of language models in diverse NLP tasks.
ELECTRAMarch 2020Proposes a sample-efficient pre-training task called replaced token detection, which corrupts input by replacing some tokens with plausible alternatives and trains a discriminative model to predict whether each token was replaced or no.
FastBERTApril 2020A speed-tunable encoder with adaptive inference time having branches at each transformer output to enable early outputs.
MobileBERTApril 2020Compressed and faster version of the BERT, featuring bottleneck structures, optimized attention mechanisms, and knowledge transfer.
LongformerApril 2020Introduces a linearly scalable attention mechanism, allowing handling texts of exteded length.
GPT 3May 2020Demonstrates that scaling up language models greatly improves task-agnostic, few-shot performance.
DeBERTaJune 2020Enhances BERT and RoBERTa through disentangled attention mechanisms, an enhanced mask decoder, and virtual adversarial training.
DeBERTa v2June 2020Enhanced version of the DeBERTa featuring a new vocabulary, nGiE integration, optimized attention mechanisms, additional model sizes, and improved tokenization.
T5 v1.1July 2020An enhanced version of the original T5 model, featuring improvements such as GEGLU activation, no dropout in pre-training, exclusive pre-training on C4, no parameter sharing between embedding and classifier layers.
mT5October 2020A multilingual variant of T5 based on T5 v1.1, pre-trained on a new Common Crawl-based dataset covering 101 languages (mC4).
CodexJuly 2021A GPT language model finetuned on publicly available code from GitHub.
FLANSeptember 2021An instruction-tuned language model developed through finetuning on various NLP datasets described by natural language instructions.
T0October 2021A fine tuned encoder-decoder model on a multitask mixture covering a wide variety of tasks, attaining strong zero-shot performance on several standard datasets.
DeBERTa V3November 2021Enhances the DeBERTa architecture by introducing replaced token detection (RTD) instead of mask language modeling (MLM), along with a novel gradient-disentangled embedding sharing method, exhibiting superior performance across various natural language understanding tasks.
WebGPTDecember 2021A fine-tuned GPT-3 model utilizing text-based web browsing, trained via imitation learning and human feedback, enhancing its ability to answer long-form questions with factual accuracy.
GopherDecember 2021Provides a comprehensive analysis of the performance of various Transformer models across different scales upto 280B on 152 tasks.
LaMDAJanuary 2022Transformer based models specialized for dialog, which are pre-trained on public dialog data and web text.
Instruct GPTMarch 2022Fine-tuned GPT using supervised learning (instruction tuning) and reinforcement learning from human feedback to align with user intent.
CodeGenMarch 2022An LLM trained for program synthesis using input-output examples and natural language descriptions.
ChinchillaMarch 2022Investigated the optimal model size and number of tokens for training a transformer LLM within a given compute budget (Scaling Laws).
PaLMApril 2022A 540-B parameter, densely activated, Transformer, trained using Pathways, (ML system that enables highly efficient training across multiple TPU Pods).
GPT-NeoX-20BApril 2022An autoregressive LLM trained on the Pile, and the largest dense model that had publicly available weights at the time of submission.
OPTMay 2022A suite of decoder-only pre-trained transformers with parameter ranges from 125M to 175B. OPT-175B being comparable to GPT-3.
Flan T5, Flan PaLMOctober 2022Explores instruction fine tuning with a particular focus on scaling the number of tasks, scaling the model size, and fine tuning on chain-of-thought data.
BLOOMNovember 2022A 176B-parameter open-access decoder-only transformer, collaboratively developed by hundreds of researchers, aiming to democratize LLM technology.
BLOOMZ, mT0November 2022Applies Multitask prompted fine tuning to the pretrained multilingual models on English tasks with English prompts to attain task generalization to non-English languages that appear only in the pretraining corpus.
GalacticaNovember 2022An LLM trained on scientific data thus specializing in scientific knowledge.
ChatGPTNovember 2022An interactive model designed to engage in conversations, built on top of GPT 3.5.
Self InstructDecember 2022A framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations.
LLaMAFebruary 2023A collection of foundation LLMs by Meta ranging from 7B to 65B parameters, trained using publicly available datasets exclusively.
ToolformerFebruary 2023An LLM trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction.
AlpacaMarch 2023A fine-tuned LLaMA 7B model, trained on instruction-following demonstrations generated in the style of self-instruct using text-davinci-003.
GPT 4March 2023A multimodal transformer model pre-trained to predict the next token in a document, which can accept image and text inputs and produce text outputs.
VicunaMarch 2023A 13B LLaMA chatbot fine tuned on user-shared conversations collected from ShareGPT, capable of generating more detailed and well-structured answers compared to Alpaca.
BloombergGPTMarch 2023A 50B language model train on general purpose and domain specific data to support a wide range of tasks within the financial industry.
PythiaApril 2023A suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters.
WizardLMApril 2023Introduces Evol-Instruct, a method to generate large amounts of instruction data with varying levels of complexity using LLM instead of humans to fine tune a Llama model
CodeGen2May 2023Proposes an approach to make the training of LLMs for program synthesis more efficient by unifying key components of model architectures, learning methods, infill sampling, and data distributions
PaLM 2May 2023Successor of PALM, trained on a mixture of different pre-training objectives in order to understand different aspects of language.
LIMAMay 2023A LLaMa model fine-tuned on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling.
GorillaMay 2023A retrieve-aware finetuned LLaMA-7B model, specifically for API calls.
OrcaJune 2023Presents a novel approach that addresses the limitations of instruction tuning by leveraging richer imitation signals, scaling tasks and instructions, and utilizing a teacher assistant to help with progressive learning.
FalconJune 2023An Open Source LLM trained on properly filtered and deduplicated web data alone.
Phi-1June 2023An LLM for code, trained using a textbook quality data from the web and synthetically generated textbooks and exercises with GPT-3.5.
WizardCoderJune 2023Enhances the performance of the open-source Code LLM, StarCoder, through the application of Code Evol-Instruct.
LLaMA 2July 2023Successor of LLaMA. LLaMA 2-Chat is optimized for dialogue use cases.
Tool LLMJuly 2023A LLaMA model finetuned on an instruction-tuning dataset for tool use, automatically created using ChatGPT.
HumpbackAugust 2023LLaMA finetuned using Instrustion backtranslation.
Code LLaMAAugust 2023LLaMA 2 based LLM for code.
WizardMathAugust 2023Proposes Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method, applied to Llama-2 to enhance the mathematical reasoning abilities.
LLaMA 2 LongSeptember 2023A series of long context LLMs s that support effective context windows of up to 32,768 tokens.
Phi-1.5September 2023Follows the phi-1 approach, focusing this time on common sense reasoning in natural language.
Mistral 7BOctober 2023Leverages grouped-query attention for faster inference, coupled with sliding window attention to effectively handle sequences of arbitrary length with a reduced inference cost.
LlemmaOctober 2023An LLM for mathematics, formed by continued pretraining of Code Llama on a mixture of scientific papers, web data containing mathematics, and mathematical code.
CodeFusionOctober 2023A diffusion code generation model that iteratively refines entire programs based on encoded natural language, overcoming the limitation of auto-regressive models in code generation by allowing reconsideration of earlier tokens.
Zephyr 7BOctober 2023Utilizes dDPO and AI Feedback (AIF) preference data to achieve superior intent alignment in chat-based language modeling.
Grok 1November 2023A 314B Mixture-of-Experts model, modeled after the Hitchhiker's Guide to the Galaxy, designed to be witty.
Orca 2November 2023Introduces Cautious Reasoning for training smaller models to select the most effective solution strategy based on the problem at hand, by crafting data with task-specific system instruction(s) corresponding to the chosen strategy in order to obtain teacher responses for each task and replacing the student’s system instruction with a generic one vacated of details of how to approach the task.
Phi-2December 2023A 2.7B model, developed to explore whether emergent abilities achieved by large-scale language models can also be achieved at a smaller scale using strategic choices for training, such as data selection.
TinyLlamaJanuary 2024A 1.1B language model built upon the architecture and tokenizer of Llama 2, pre-trained on around 1 trillion tokens for approximately 3 epochs, leveraging FlashAttention and Grouped Query Attention, to achieve better computational efficiency.
Mixtral 8x7BJanuary 2024A Sparse Mixture of Experts language model trained with multilingual data using a context size of 32k tokens.
H2O Danube 1.8BJanuary 2024A language model trained on 1T tokens following the core principles of LLama 2 and Mistral, leveraging and refining various techniques for pre-training large language models.
OLMoFebruary 2024A state-of-the-art, truly open language model and framework that includes training data, code, and tools for building, studying, and advancing language models.
Orca MathFebruary 2024A fine tuned Mistral-7B that excels at math problems without external tools, utilizing a high-quality synthetic dataset of 200K problems created through multi-agent collaboration and an iterative learning process that involves practicing problem-solving, receiving feedback, and learning from preference pairs incorporating the model's solutions and feedback.
GemmaFebruary 2024A family of 2B and 7B, state-of-the-art language models based on Google's Gemini models, offering advancements in language understanding, reasoning, and safety.
Aya 101Februray 2024A massively multilingual generative language model that follows instructions in 101 languages,trained by finetuning mT5.
Hawk, GriffinFebruary 2024Introduces Real Gated Linear Recurrent Unit Layer that forms the core of the new recurrent block, replacing Multi-Query Attention for better efficiency and scalability
WRAPMarch 2024Uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles to jointly pre-train LLMs on real and synthetic rephrases.
Command RMarch 2024An LLM optimized for retrieval-augmented generation and tool use, across multiple languages.
DBRXMarch 2024A 132B open, general-purpose fine grained Sparse MoE LLM surpassing GPT-3.5 and competitive with Gemini 1.0 Pro.
Grok 1.5March 2024An advancement over grok, capable of long context understanding up to 128k tokens and advanced reasoning.
Command R+April 2024Successor of Command R+ with improved performance for retrieval-augmented generation and tool use, across multiple languages.
Mixtral 8x22BApril 2024A open-weight AI model optimised for performance and efficiency, with capabilities such as fluency in multiple languages, strong mathematics and coding abilities, and precise information recall from large documents.
CodeGemmaApril 2024Open code models based on Gemma models by further training on over 500 billion tokens of primarily code.
RecurrentGemmaApril 2024Based on Griffin, uses a combination of linear recurrences and local attention instead of global attention to model long sequences efficiently.
Rho-1April 2024Introduces Selective Language Modelling that optimizes the loss only on tokens that align with a desired distribution, utilizing a reference model to score and select tokens.
Phi-3April 2024A series of language models trained on heavily filtered web and synthetic data set, achieving performance comparable to much larger models like Mixtral 8x7B and GPT-3.5.
Open ELMApril 2024A fully open language model designed to enhance accuracy while using fewer parameters and pre-training tokens. Utilizes a layer-wise scaling strategy to allocate smaller dimensions in early layers, expanding in later layers.
H2O Danube2 1.8BApril 2024An updated version of the original H2O-Danube model, with improvements including removal of sliding window attention, changes to the tokenizer, and adjustments to the training data, resulting in significant performance enhancements.
Granite Code ModelsMay 2024A family of code models ranging from 3B to 34B trained on 3.5-4.5T tokens of code written in 116 programming languages.
Codestral 22BMay 2024An open-weight model designed for code generation tasks, trained on over 80 programming languages, and licensed under the Mistral AI Non-Production License, allowing developers to use it for research and testing purposes.
Aya 23May 2024A family of multilingual language models supporting 23 languages, designed to balance breadth and depth by allocating more capacity to fewer languages during pre-training.
Gemma 2June 2024Utilizes interleaving local-global attentions and group-query attention, trained with knowledge distillation instead of next token prediction to achieve competitive performance comparable with larger models.
Orca 3 (Agent Instruct)June 2024A fine tuned Mistral-7B through Generative Teaching via synthetic data generated through the proposed AgentInstruct framework, which generates both the prompts and responses, using only raw data sources like text documents and code files as seeds.
MathstralJuly 2024a 7B model designed for math reasoning and scientific discovery based on Mistral 7B specializing in STEM subjects.
Mistral NemoJuly 2024A 12B Language Model built in collaboration between Mistral and NVIDIA, featuring a context window of 128K, an efficient tokenizer and trained with quantization awareness, enabling FP8 inference without any performance loss.
Smol LMJuly 2024A family of small models with 135M, 360M, and 1.7B parameters, utilizes Grouped-Query Attention (GQA), embedding tying, and a context length of 2048 tokens, trained on a new open source high-quality dataset.
Mistral Large 2July 2024A 123B model, offers significant improvements in code generation, mathematics, and reasoning capabilities, advanced function calling, a 128k context window, and supports dozens of languages and over 80 coding languages.

Multi Modal Language Models

PaperDateDescription
BLIPFebruary 2022A Vision-Language Pre-training (VLP) framework that introduces Multimodal mixture of Encoder-Decoder (MED) and Captioning and Filtering (CapFilt), a new dataset bootstrapping method for learning from noisy image-text pairs.
FlamingoApril 2022Visual Language Models enabling seamless handling of interleaved visual and textual data, and facilitating few-shot learning on large-scale web corpora.
BLIP 2January 2023A Vision-Language Pre-training (VLP) framework that proposes Q-Former, a trainable module to bridge the gap between a frozen image encoder and a frozen LLM to bootstrap vision-language pre-training.
LLaVA 1April 2023A large multimodal model connecting CLIP and Vicuna trained end-to-end on instruction-following data generated through GPT-4 from image-text pairs.
InstructBLIPMay 2023Introduces instruction-aware Query Transformer to extract informative features tailored to the given instruction to study vision-language instruction tuning based on the pretrained BLIP-2 models.
IdeficsJune 20239B and 80B multimodal models trained on Obelics, an open web-scale dataset of interleaved image-text documents, curated in this work.
GPT-4VSeptember 2023A multimodal model that combines text and vision capabilities, allowing users to instruct it to analyze image inputs.
LLaVA 1.5October 2023An enhanced version of the LLaVA model that incorporates a CLIP-ViT-L-336px with an MLP projection and academic-task-oriented VQA data to set new benchmarks in large multimodal models (LMM) research.
Gemini 1.0December 2023A family of highly capable multi-modal models, trained jointly across image, audio, video, and text data for the purpose of building a model with strong generalist capabilities across modalities.
MoE-LLaVAJanuary 2024A MoE-based sparse LVLM framework that activates only the top-k experts through routers during deployment, maintaining computational efficiency while achieving comparable performance to larger models.
LLaVA 1.6January 2024An improved version of a LLaVA 1.5 with enhanced reasoning, OCR, and world knowledge capabilities, featuring increased image resolution
Gemini 1.5 ProFebruary 2024A highly compute-efficient multimodal mixture-of-experts model that excels in long-context retrieval tasks and understanding across text, video, and audio modalities.
Claude 3March 2024A family of VLMs consisting of Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus, sets new industry standards for cognitive tasks, offering varying levels of intelligence, speed, and cost-efficiency.
MM1March 2024A multimodal llm that combines a ViT-H image encoder with 378x378px resolution, pretrained on a data mix of image-text documents and text-only documents, scaled up to 3B, 7B, and 30B parameters for enhanced performance across various tasks.
Grok 1.5 VApril 2024The first multimodal model in the grok series.
Idefics2April 2024Improvement upon Idefics1 with enhanced OCR capabilities, simplified architecture, and better pre-trained backbones, trained on a mixture of openly available datasets and fine-tuned on task-oriented data.
Phi 3 VisionMay 2024First multimodal model in the Phi family, bringing the ability to reason over images and extract and reason over text from images.
GPT-4oMay 2024An omni model accepting and generating various types of inputs and outputs, including text, audio, images, and video.
Gemini 1.5 FlashMay 2024A more lightweight variant of the Gemini 1.5 pro, designed for efficiency with minimal regression in quality, making it suitable for applications where compute resources are limited.
ChameleonMay 2024A family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence.
Claude 3.5 SonnetJune 2024Surpasses previous versions and competitors in intelligence, speed, and cost-efficiency, excelling in graduate-level reasoning, undergraduate-level knowledge, coding proficiency, and visual reasoning.
GPT-4o miniJuly 2024A cost-efficient small model that outperforms GPT-4 on chat preferences, enabling a broad range of tasks with low latency and supporting text, vision, and multimodal inputs and outputs.
Grok 2August 2024A frontier language model with state-of-the-art capabilities in chat, coding, and reasoning on par with Claude 3.5 Sonnet and GPT-4-Turbo.

Retrieval and Representation Learning

PaperDateDescription
Dense Passage RetrieverApril 2020Shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual encoder framework.
ColBERTApril 2020Introduces a late interaction architecture that adapts deep LMs (in particular, BERT) for efficient retrieval.
CLIPFebruary 2021A vision system that learns image representations from raw text-image pairs through pre-training, enabling zero-shot transfer to various downstream tasks.
ColBERTv2December 2021Couples an aggressive residual compression mechanism with a denoised supervision strategy to simultaneously improve the quality and space footprint of late interaction.
Matryoshka Representation LearningMay 2022Encodes information at different granularities and allows a flexible representation that can adapt to multiple downstream tasks with varying computational resources using a single embedding.
E5December 2022A family of text embeddings trained in a contrastive manner with weak supervision signals from a curated large-scale text pair dataset CCPairs.
SigLipMarch 2023A simple pairwise Sigmoid loss function for Language-Image Pre-training that operates solely on image-text pairs, allowing for larger batch sizes and better performance at smaller batch sizes.
E5 Mistral 7BDecember 2023Leverages proprietary LLMs to generate diverse synthetic data to fine tune open-source decoder-only LLMs for hundreds of thousands of text embedding tasks.
Nomic Embed Text v1February 2024A 137M parameter, open-source English text embedding model with an 8192 context length that outperforms OpenAI's models on both short and long-context tasks.
Nomic Embed Text v1.5February 2024An advanced text embedding model that utilizes Matryoshka Representation Learning to offer flexible embedding sizes with minimal performance trade-offs
NV EmbedMay 2024Introduces architectural innovations and training recipe to significantly enhance LLMs performance in general-purpose text embedding tasks.
Nomic Embed Vision v1 and v1.5June 2024Aligns a Vision Encoder with the existing text encoders without destroying the downstream performance of the text encoder, to attain a unified multimodal latent space.
E5-VJuly 2024A framework that adapts Multimodal Large Language Models for achieving universal multimodal embeddings by leveraging prompts and single modality training on text pairs, which demonstrates strong performance in multimodal embeddings without fine-tuning and eliminates the need for costly multimodal training data collection.

Parameter Efficient Fine Tuning

PaperDateDescription
LoRAJuly 2021Introduces trainable rank decomposition matrices into each layer of a pre-trained Transformer model, significantly reducing the number of trainable parameters for downstream tasks.
QLoRAMay 2023Allows efficient training of large models on limited GPU memory, through innovations like 4-bit NormalFloat (NF4), double quantization and paged optimisers.
LongLoRASeptember 2023Enables context extension for large language models, achieving significant computation savings through sparse local attention and parameter-efficient fine-tuning.

LLM Evaluation

PaperDateDescription
PrometheusOctober 2023A 13B fully open source evaluation LLM trained on Feedback Collection curated using GPT-4 (in this work).
Prometheus 2May 20247B & 8x7B evaluation LLMs that score high correlations with both human evaluators and proprietary LM-based judges on both direct assessment and pairwise ranking, obtained by merging Mistral models trained on Feedback Collection and Preference Collection (curated in this work.

Compression, Pruning, Quantization

PaperDateDescription
LLMLinguaOctober 2023A novel coarse-to-fine prompt compression method, incorporating a budget controller, an iterative token-level compression algorithm, and distribution alignment, achieving up to 20x compression with minimal performance loss.
LongLLMLinguaOctober 2023A novel approach for prompt compression to enhance performance in long context scenarios using question-aware compression and document reordering.
LLMLingua2March 2024A novel approach to task-agnostic prompt compression, aiming to enhance generalizability, using data distillation and leveraging a Transformer encoder for token classification.

Vision Models

PaperDateDescription
Vision TransformerOctober 2020Images are segmented into patches, which are treated as tokens and a sequence of linear embeddings of these patches are input to a Transformer
DeiTDecember 2020A convolution-free vision transformer that uses a teacher-student strategy with attention-based distillation tokens.
Swin TransformerMarch 2021A hierarchical vision transformer that uses shifted windows to addresses the challenges of adapting the transformer model to computer vision.
BEiTJune 2021Utilizes a masked image modeling task inspired by BERT in, involving image patches and visual tokens to pretrain vision Transformers.
MobileViTOctober 2021A lightweight vision transformer designed for mobile devices, effectively combining the strengths of CNNs and ViTs.
Masked AutoEncoderNovember 2021An encoder-decoder architecture that reconstructs input images by masking random patches and leveraging a high proportion of masking for self-supervision.

Convolutional Neural Networks

PaperDateDescription
LenetDecember 1998Introduced Convolutions.
Alex NetSeptember 2012Introduced ReLU activation and Dropout to CNNs. Winner ILSVRC 2012.
VGGSeptember 2014Used large number of filters of small size in each layer to learn complex features. Achieved SOTA in ILSVRC 2014.
Inception NetSeptember 2014Introduced Inception Modules consisting of multiple parallel convolutional layers, designed to recognize different features at multiple scales.
Inception Net v2 / Inception Net v3December 2015Design Optimizations of the Inception Modules which improved performance and accuracy.
Res NetDecember 2015Introduced residual connections, which are shortcuts that bypass one or more layers in the network. Winner ILSVRC 2015.
Inception Net v4 / Inception ResNetFebruary 2016Hybrid approach combining Inception Net and ResNet.
Dense NetAugust 2016Each layer receives input from all the previous layers, creating a dense network of connections between the layers, allowing to learn more diverse features.
XceptionOctober 2016Based on InceptionV3 but uses depthwise separable convolutions instead on inception modules.
Res NextNovember 2016Built over ResNet, introduces the concept of grouped convolutions, where the filters in a convolutional layer are divided into multiple groups.
Mobile Net V1April 2017Uses depthwise separable convolutions to reduce the number of parameters and computation required.
Mobile Net V2January 2018Built upon the MobileNetv1 architecture, uses inverted residuals and linear bottlenecks.
Mobile Net V3May 2019Uses AutoML to find the best possible neural network architecture for a given problem.
Efficient NetMay 2019Uses a compound scaling method to scale the network's depth, width, and resolution to achieve a high accuracy with a relatively low computational cost.
NF NetFebruary 2021An improved class of Normalizer-Free ResNets that implement batch-normalized networks, offer faster training times, and introduce an adaptive gradient clipping technique to overcome instabilities associated with deep ResNets.
Conv MixerJanuary 2022Processes image patches using standard convolutions for mixing spatial and channel dimensions.
ConvNeXtJanuary 2022A pure ConvNet model, evolved from standard ResNet design, that competes well with Transformers in accuracy and scalability.
ConvNeXt V2January 2023Incorporates a fully convolutional MAE framework and a Global Response Normalization (GRN) layer, boosting performance across multiple benchmarks.

Object Detection

PaperDateDescription
SSDDecember 2015Discretizes bounding box outputs over a span of various scales and aspect ratios per feature map.
Feature Pyramid NetworkDecember 2016Leverages the inherent multi-scale hierarchy of deep convolutional networks to efficiently construct feature pyramids.
Focal LossAugust 2017Addresses class imbalance in dense object detectors by down-weighting the loss assigned to well-classified examples.
DETRMay 2020A novel object detection model that treats object detection as a set prediction problem, eliminating the need for hand-designed components.

Region-based Convolutional Neural Networks

PaperDateDescription
RCNNNovember 2013Uses selective search for region proposals, CNNs for feature extraction, SVM for classification followed by box offset regression.
Fast RCNNApril 2015Processes entire image through CNN, employs RoI Pooling to extract feature vectors from ROIs, followed by classification and BBox regression.
Faster RCNNJune 2015A region proposal network (RPN) and a Fast R-CNN detector, collaboratively predict object regions by sharing convolutional features.
Mask RCNNMarch 2017Extends Faster R-CNN to solve instance segmentation tasks, by adding a branch for predicting an object mask in parallel with the existing branch.
Cascade RCNNDecember 2017Proposes a multi-stage approach where detectors are trained with progressively higher IoU thresholds, improving selectivity against false positives.

Document AI

PaperDateDescription
Table NetJanuary 2020An end-to-end deep learning model designed for both table detection and structure recognition.
DonutNovember 2021An OCR-free Encoder-Decoder Transformer model. The encoder takes in images, decoder takes in prompts & encoded images to generate the required text.
DiTMarch 2022An Image Transformer pre-trained (self-supervised) on document images
UDoPDecember 2022Integrates text, image, and layout information through a Vision-Text-Layout Transformer, enabling unified representation.
DocLLMJanuary 2024A lightweight extension to traditional LLMs that focuses on reasoning over visual documents, by incorporating textual semantics and spatial layout without expensive image encoders.

Layout Transformers

PaperDateDescription
Layout LMDecember 2019Utilises BERT as the backbone, adds two new input embeddings: 2-D position embedding and image embedding (Only for downstream tasks).
LamBERTFebruary 2020Utilises RoBERTa as the backbone and adds Layout embeddings along with relative bias.
Layout LM v2December 2020Uses a multi-modal Transformer model, to integrate text, layout, and image in the pre-training stage, to learn end-to-end cross-modal interaction.
Structural LMMay 2021Utilises BERT as the backbone and feeds text, 1D and (2D cell level) embeddings to the transformer model.
Doc FormerJune 2021Encoder-only transformer with a CNN backbone for visual feature extraction, combines text, vision, and spatial features through a multi-modal self-attention layer.
LiLTFebruary 2022Introduced Bi-directional attention complementation mechanism (BiACM) to accomplish the cross-modal interaction of text and layout.
Layout LM V3April 2022A unified text-image multimodal Transformer to learn cross-modal representations, that imputs concatenation of text embedding and image embedding.
ERNIE LayoutOctober 2022Reorganizes tokens using layout information, combines text and visual embeddings, utilizes multi-modal transformers with spatial aware disentangled attention.

Generative Adversarial Networks

PaperDateDescription
Generative Adversarial NetworksJune 2014Introduces a framework where, a generative and a discriminative model, are trained simultaneously in a minimax game.
Conditional Generative Adversarial NetworksNovember 2014A method for training GANs, enabling the generation based on specific conditions, by feeding them to both the generator and discriminator networks.
Deep Convolutional Generative Adversarial NetworksNovember 2015Demonstrates the ability of CNNs for unsupervised learning using specific architectural constraints designed.
Improved GANJune 2016Presents a variety of new architectural features and training procedures that can be applied to the generative adversarial networks (GANs) framework.
Wasserstein Generative Adversarial NetworksJanuary 2017An alternative GAN training algorithm that enhances learning stability, mitigates issues like mode collapse.
Cycle GANMarch 2017An approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples by leveraging adversarial losses and cycle consistency constraints, using two GANs.

Tabular Deep Learning

PaperDateDescription
Entity EmbeddingsApril 2016Maps categorical variables into continuous vector spaces through neural network learning, revealing intrinsic properties.
Wide and Deep LearningJune 2016Combines memorization of specific patterns with generalization of similarities.
Deep and Cross NetworkAugust 2017Combines the a novel cross network with deep neural networks (DNNs) to efficiently learn feature interactions without manual feature engineering.
Tab TransformerDecember 2020Employs multi-head attention-based Transformer layers to convert categorical feature embeddings into robust contextual embeddings.
Tabular ResNetJune 2021An MLP with skip connections.
Feature Tokenizer TransformerJune 2021Transforms all features (categorical and numerical) to embeddings and applies a stack of Transformer layers to the embeddings.

Datasets

PaperDateDescription
ObelicsJune 2023An open web-scale filtered dataset of interleaved image-text documents comprising 141M web pages, 353M associated images, and 115B text tokens, extracted from CommonCrawl
DolmaJanuary 2024An open corpus of three trillion tokens designed to support language model pretraining research.
Aya DatasetFebruray 2024A human-curated instruction-following dataset that spans 65 languages, created to bridge the language gap in datasets for natural language processing.
WebSightMarch 2024A synthetic dataset consisting of 2M pairs of HTML codes and their corresponding screenshots, generated through LLMs, aimed to accelerate research for converting a screenshot into a corresponding HTML.
CosmopediaMarch 2024Synthetic Data containing over 30M files and 25B tokens, generated by Mixtral-8x7B-Instruct-v0., aimed to reproduce the training data for Phi-1.5.
Fine WebMay 2024A large-scale dataset for pretraining LLMs, consisting of 15T tokens, shown to produce better-performing models than other open pretraining datasets.
Cosmopedia v2July 2024An enhanced version of Cosmopedia, with a lot of emphasis on prompt optimization.
DocmatixJuly 2024A massive dataset for DocVQA containing 2.4M images, 9.5M question-answer pairs, and 1.3M PDF documents, generated by taking transcriptions from the PDFA OCR dataset and using a Phi-3-small model to generate Q/A pairs.

Miscellaneous

PaperDateDescription
ColD FusionDecember 2022A method enabling the benefits of multitask learning through distributed computation without data sharing and improving model performance.
Are Emergent Abilities of Large Language Models a Mirage?April 2023This paper presents an alternative explanation for emergent abilities, i.e. emergent abilities are created by the researcher’s choice of metrics, not fundamental changes in model family behaviour on specific tasks with scale.
Scaling Data-Constrained Language ModelsMay 2023This study investigates scaling language models in data-constrained regimes.
An In-depth Look at Gemini's Language AbilitiesDecember 2023A third-party, objective comparison of the abilities of the OpenAI GPT and Google Gemini models with reproducible code and fully transparent results.
DSPyOctober 2023A programming model that abstracts LM pipelines as text transformation graphs, i.e. imperative computation graphs where LMs are invoked through declarative modules, optimizing their use through a structured framework of signatures, modules, and teleprompters to automate and enhance text transformation tasks.
Direct Preference OptimizationDecember 2023A stable, performant, and computationally lightweight algorithm that fine-tunes llms to align with human preferences without the need for reinforcement learning, by directly optimizing for the policy best satisfying the preferences with a simple classification objective.
RLHF WorkflowMay 2024Provides a detailed recipe for online iterative RLHF and achieves state-of-the-art performance on various benchmarks using fully open-source datasets.
Monte Carlo Tree Self-refineJune 2024Integrates LLMs with Monte Carlo Tree Search to enhance performance in complex mathematical reasoning tasks, leveraging systematic exploration and heuristic self-refine mechanisms to improve decision-making frameworks.
MagpieJune 2024A self-synthesis method that extracts high-quality instruction data at scale by prompting an aligned LLM with left-side templates, generating 4M instructions and their corresponding responses.
Instruction Pre-TrainingJune 2024A framework to augment massive raw corpora with instruction-response pairs enabling supervised multitask pretraining of LMs.

Literature Reviewed

Reading Lists


Reach out to Ritvik or Elvis if you have any questions.

If you are interested to contribute, feel free to open a PR.

Join our Discord