Home

Awesome

Awesome-LLM-Large-Language-Models-Notes


Known LLM models classified by year

Small introduction, paper, code etc.

YearNamePaperInfoImplementation
2017TransformerAttention is All you NeedThe focus of the original research was on translation tasks.TensorFlow + article
2018GPTImproving Language Understanding by Generative Pre-TrainingThe first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results
2018BERTBERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingAnother large pretrained model, this one designed to produce better summaries of sentencesPyTorch
2019GPT-2Language Models are Unsupervised Multitask LearnersAn improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns
2019DistilBERT - Distilled BERTDistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighterA distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance
2019BARTBART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and ComprehensionLarge pretrained models using the same architecture as the original Transformer model.
2019T5Exploring the Limits of Transfer Learning with a Unified Text-to-Text TransformerLarge pretrained models using the same architecture as the original Transformer model.
2019ALBERTALBERT: A Lite BERT for Self-supervised Learning of Language Representations
2019RoBERTa - A Robustly Optimized BERT Pretraining ApproachRoBERTa: A Robustly Optimized BERT Pretraining Approach
2019CTRLCTRL: A Conditional Transformer Language Model for Controllable Generation
2019Transformer XLTransformer-XL: Attentive Language Models Beyond a Fixed-Length ContextAdopts a recurrence methodology over past state coupled with relative positional encoding enabling longer term dependencies
2019Diablo GPTDialoGPT: Large-Scale Generative Pre-training for Conversational Response GenerationTrained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017PyTorch
2019ERNIEERNIE: Enhanced Language Representation with Informative EntitiesIn this paper, we utilize both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE), which can take full advantage of lexical, syntactic, and knowledge information simultaneously.
2020GPT-3Language Models are Few-Shot LearnersAn even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning)
2020ELECTRAELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS
2020mBARTMultilingual Denoising Pre-training for Neural Machine Translation
2021CLIP (Contrastive Language-Image Pre-Training)Learning Transferable Visual Models From Natural Language SupervisionCLIP is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.PyTorch
2021DALL-EZero-Shot Text-to-Image GenerationPyTorch
2021GopherScaling Language Models: Methods, Analysis & Insights from Training Gopher
2021Decision TransformerDecision Transformer: Reinforcement Learning via Sequence ModelingAn architecture that casts the problem of RL as conditional sequence modeling.PyTorch
2021GLam (Generalist Language Model)GLaM: Efficient Scaling of Language Models with Mixture-of-ExpertsIn this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
2022chatGPT/InstructGPTTraining language models to follow instructions with human feedbackThis trained language model is much better at following user intentions than GPT-3. The model is optimised (fine tuned) using Reinforcement Learning with Human Feedback (RLHF) to achieve conversational dialogue. The model was trained using a variety of data which were written by people to achieve responses that sounded human-like.:-:
2022ChinchillaTraining Compute-Optimal Large Language ModelsUses the same compute budget as Gopher but with 70B parameters and 4x more more data.:-:
2022LaMDA - Language Models for Dialog ApplicationsLaMDAIt is a family of Transformer-based neural language models specialized for dialog
2022DQ-BARTDQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and QuantizationPropose to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-precision student model.
2022FlamingoFlamingo: a Visual Language Model for Few-Shot LearningBuilding models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability.
2022GatoA Generalist AgentInspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy.
2022GODEL: Large-Scale Pre-Training for Goal-Directed DialogGODEL: Large-Scale Pre-Training for Goal-Directed DialogIn contrast with earlier models such as DialoGPT, GODEL leverages a new phase of grounded pre-training designed to better support adapting GODEL to a wide range of downstream dialog tasks that require information external to the current conversation (e.g., a database or document) to produce good responses.PyTorch
2023GPT-4GPT-4 Technical ReportThe model now accepts multimodal inputs: images and text:-:
2023BloombergGPTBloombergGPT: A Large Language Model for FinanceLLM specialised in financial domain trained on Bloomberg's extensive data sources
2023BLOOMBLOOM: A 176B-Parameter Open-Access Multilingual Language ModelBLOOM (BigScience Large Open-science Open-access Multilingual Language Model) is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total)
2023Llama 2Llama 2: Open Foundation and Fine-Tuned Chat ModelsPyTorch #1 PyTorch #2
2023ClaudeClaudeClaude can analyze 75k words (100k tokens). GPT4 can do just 32.7k tokens.
2023SelfCheckGPTSelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language ModelsA simple sampling-based approach that can be used to fact-check black-box models in a zero-resource fashion, i.e. without an external database.

LLM models classified by size

NameSize (# Parameters)Training TokensTraining data
GLaM1.2T
Gopher280B300B
BLOOM176BROOTS corpus
GPT-3175B
LaMDA137B168B1.56T words of public dialog data and web text
Chinchilla70B1.4T
Llama 27B, 13B and 70B
BloombergGPT50B363B+345B
Falcon40B40B1T1,000B tokens of RefinedWeb

LLM models classified by name


Classification by architecture

ArchitectureModelsTasks
Encoder-only, aka also called auto-encoding Transformer modelsALBERT, BERT, DistilBERT, ELECTRA, RoBERTaSentence classification, named entity recognition, extractive question answering
Decoder-only, aka auto-regressive (or causal) Transformer modelsCTRL, GPT, GPT-2, Transformer XLText generation given a prompt
Encoder-Decoder, aka sequence-to-sequence Transformer modelsBART, T5, Marian, mBARTSummarisation, translation, generative question answering

What's so special about HuggingFace?


Must-Read Papers on LLM


🚀Recap | Bring me up to speed!


Blog articles


Know their limitations!


Start-up funding landscape


Available tutorials


A small note on the notebook rendering


How to run the notebook in Google Colab


Implementations from scratch