Awesome

Awesome-LLM-Large-Language-Models-Notes

Known LLM models classified by year

Small introduction, paper, code etc.

Year	Name	Paper	Info	Implementation
2017	Transformer	Attention is All you Need	The focus of the original research was on translation tasks.	TensorFlow + article
2018	GPT	Improving Language Understanding by Generative Pre-Training	The first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results
2018	BERT	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	Another large pretrained model, this one designed to produce better summaries of sentences	PyTorch
2019	GPT-2	Language Models are Unsupervised Multitask Learners	An improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns
2019	DistilBERT - Distilled BERT	DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter	A distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance
2019	BART	BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension	Large pretrained models using the same architecture as the original Transformer model.
2019	T5	Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer	Large pretrained models using the same architecture as the original Transformer model.
2019	ALBERT	ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
2019	RoBERTa - A Robustly Optimized BERT Pretraining Approach	RoBERTa: A Robustly Optimized BERT Pretraining Approach
2019	CTRL	CTRL: A Conditional Transformer Language Model for Controllable Generation
2019	Transformer XL	Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context	Adopts a recurrence methodology over past state coupled with relative positional encoding enabling longer term dependencies
2019	Diablo GPT	DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation	Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017	PyTorch
2019	ERNIE	ERNIE: Enhanced Language Representation with Informative Entities	In this paper, we utilize both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE), which can take full advantage of lexical, syntactic, and knowledge information simultaneously.
2020	GPT-3	Language Models are Few-Shot Learners	An even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning)
2020	ELECTRA	ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS
2020	mBART	Multilingual Denoising Pre-training for Neural Machine Translation
2021	CLIP (Contrastive Language-Image Pre-Training)	Learning Transferable Visual Models From Natural Language Supervision	CLIP is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.	PyTorch
2021	DALL-E	Zero-Shot Text-to-Image Generation		PyTorch
2021	Gopher	Scaling Language Models: Methods, Analysis & Insights from Training Gopher
2021	Decision Transformer	Decision Transformer: Reinforcement Learning via Sequence Modeling	An architecture that casts the problem of RL as conditional sequence modeling.	PyTorch
2021	GLam (Generalist Language Model)	GLaM: Efficient Scaling of Language Models with Mixture-of-Experts	In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
2022	chatGPT/InstructGPT	Training language models to follow instructions with human feedback	This trained language model is much better at following user intentions than GPT-3. The model is optimised (fine tuned) using Reinforcement Learning with Human Feedback (RLHF) to achieve conversational dialogue. The model was trained using a variety of data which were written by people to achieve responses that sounded human-like.	:-:
2022	Chinchilla	Training Compute-Optimal Large Language Models	Uses the same compute budget as Gopher but with 70B parameters and 4x more more data.	:-:
2022	LaMDA - Language Models for Dialog Applications	LaMDA	It is a family of Transformer-based neural language models specialized for dialog
2022	DQ-BART	DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization	Propose to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-precision student model.
2022	Flamingo	Flamingo: a Visual Language Model for Few-Shot Learning	Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability.
2022	Gato	A Generalist Agent	Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy.
2022	GODEL: Large-Scale Pre-Training for Goal-Directed Dialog	GODEL: Large-Scale Pre-Training for Goal-Directed Dialog	In contrast with earlier models such as DialoGPT, GODEL leverages a new phase of grounded pre-training designed to better support adapting GODEL to a wide range of downstream dialog tasks that require information external to the current conversation (e.g., a database or document) to produce good responses.	PyTorch
2023	GPT-4	GPT-4 Technical Report	The model now accepts multimodal inputs: images and text	:-:
2023	BloombergGPT	BloombergGPT: A Large Language Model for Finance	LLM specialised in financial domain trained on Bloomberg's extensive data sources
2023	BLOOM	BLOOM: A 176B-Parameter Open-Access Multilingual Language Model	BLOOM (BigScience Large Open-science Open-access Multilingual Language Model) is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total)
2023	Llama 2	Llama 2: Open Foundation and Fine-Tuned Chat Models		PyTorch #1 PyTorch #2
2023	Claude	Claude	Claude can analyze 75k words (100k tokens). GPT4 can do just 32.7k tokens.
2023	SelfCheckGPT	SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models	A simple sampling-based approach that can be used to fact-check black-box models in a zero-resource fashion, i.e. without an external database.

LLM models classified by size

Name	Size (# Parameters)	Training Tokens	Training data
GLaM	1.2T
Gopher	280B	300B
BLOOM	176B		ROOTS corpus
GPT-3	175B
LaMDA	137B	168B	1.56T words of public dialog data and web text
Chinchilla	70B	1.4T
Llama 2	7B, 13B and 70B
BloombergGPT	50B	363B+345B
Falcon40B	40B	1T	1,000B tokens of RefinedWeb

M=Million | B=billion | T=Trillion

LLM models classified by name

ALBERT | Alpaca
BART | BERT | Big Bird | BLOOM |
Chinchilla | CLIP | CTRL | chatGPT | Claude
DALL-E | DALL-E-2 | Decision Transformers | DialoGPT | DistilBERT | DQ-BART |
ELECTRA | ERNIE
Flamingo | Falcon40B
Gato | Gopher | GLaM | GLIDE | GPT | GPT-2 | GPT-3 | GPT-4 | GPT-Neo | Godel | GPT-J
Imagen | InstructGPT
Jurassic-1
LaMDA | Llama 2
mBART | Megatron | Minerva | MT-NLG
OPT
Palm | Pegasus
RoBERTa
SeeKer | Swin Transformer | Switch | SelfCheckGPT
Transformer | T5 | Trajectory Transformers | Transformer XL | Turing-NLG
ViT
Wu Dao 2.0 |
XLM-RoBERTa | XLNet

Classification by architecture

Architecture	Models	Tasks
Encoder-only, aka also called auto-encoding Transformer models	ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa	Sentence classification, named entity recognition, extractive question answering
Decoder-only, aka auto-regressive (or causal) Transformer models	CTRL, GPT, GPT-2, Transformer XL	Text generation given a prompt
Encoder-Decoder, aka sequence-to-sequence Transformer models	BART, T5, Marian, mBART	Summarisation, translation, generative question answering

What's so special about HuggingFace?

HuggingFace, a popular NLP library, but it also offers an easy way to deploy models via their Inference API. When you build a model using the HuggingFace library, you can then train it and upload it to their Model Hub. Read more about this here.
List of notebook

Must-Read Papers on LLM

🚀Recap | Bring me up to speed!

Catching up on the weird world of LLMs

Blog articles

Know their limitations!

Start-up funding landscape

NLP Startup Funding in 2022

Available tutorials

A small note on the notebook rendering

Two notebooks are available:
- One with coloured boxes and outside folder GitHub_MD_rendering
- One in black-and-white under folder GitHub_MD_rendering

How to run the notebook in Google Colab

The easiest option would be for you to clone this repository.
Navigate to Google Colab and open the notebook directly from Colab.
You can then also write it back to GitHub provided permission to Colab is granted. The whole procedure is automated.

Awesome

Awesome-LLM-Large-Language-Models-Notes

Known LLM models classified by year

LLM models classified by size

LLM models classified by name

Classification by architecture

What's so special about HuggingFace?

Must-Read Papers on LLM

🚀Recap | Bring me up to speed!

Blog articles

Know their limitations!

Start-up funding landscape

Available tutorials

A small note on the notebook rendering

How to run the notebook in Google Colab

Implementations from scratch