Awesome

Awesome Transformers

A curated list of awesome transformer models.

If you want to contribute to this list, send a pull request or reach out to me on twitter: @abacaj. Let's make this list useful.

There are a number of models available that are not entirely open source (non-commercial, etc), this repository should serve to also make you aware of that. Tracking the original source/company of the model will help.

I would also eventually like to add model use cases. So it is easier for others to find the right one to fine-tune.

Format:

Model name: short description, usually from paper
- Model link (usually huggingface or github)
- Paper link
- Source as company or group
- Model license

Encoder (autoencoder) models
- ALBERT
- BERT
- DistilBERT
- DeBERTaV3
- Electra
- RoBERTa
Decoder (autoregressive) models
- BioGPT
- CodeGen
- LLaMa
- GPT
- GPT-2
- GPT-J
- GPT-NEO
- GPT-NEOX
- NeMo Megatron-GPT
- OPT
- BLOOM
- GLM
- YaLM
Encoder+decoder (seq2seq) models
- T5
- FLAN-T5
- Code-T5
- Bart
- Pegasus
- MT5
- UL2
- FLAN-UL2
- EdgeFormer
Multimodal models
- Donut
- LayoutLMv3
- TrOCR
- CLIP
- Unified-IO
Vision models
- DiT
- DETR
- EfficientFormer
Audio models
- Whisper
- VALL-E
Recommendation models
- P5
Grounded Situation Recognition models
- GSRTR
- CoFormer

Encoder models

ALBERT: "A Lite" version of BERT
- Model
- Paper
- Google
- Apache v2
BERT: Bidirectional Encoder Representations from Transformers <a name="bert"></a>
- Model
- Paper
- Google
- Apache v2
DistilBERT: Distilled version of BERT smaller, faster, cheaper and lighter <a name="distilbert"></a>
- Model
- Paper
- HuggingFace
- Apache v2
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing <a name="debertav3"></a>
- Model
- Paper
- Microsoft
- MIT
Electra: Pre-training Text Encoders as Discriminators Rather Than Generators <a name="electra"></a>
- Model
- Paper
- Google
- Apache v2
RoBERTa: Robustly Optimized BERT Pretraining Approach <a name="roberta"></a>
- Model
- Paper
- Facebook
- MIT

Decoder models

BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
- Model
- Paper
- Microsoft
- MIT
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis <a name="codegen"></a>
- Model
- Paper
- Salesforce
- BSD 3-Clause
LLaMa: Open and Efficient Foundation Language Models <a name="llama"></a>
- Model
- Paper
- Facebook
- Requires approval, non-commercial
GPT: Improving Language Understanding by Generative Pre-Training <a name="gpt"></a>
- Model
- Paper
- OpenAI
- MIT
GPT-2: Language Models are Unsupervised Multitask Learners <a name="gpt-2"></a>
- Model
- Paper
- OpenAI
- MIT
GPT-J: A 6 Billion Parameter Autoregressive Language Model <a name="gpt-j"></a>
- Model
- Paper
- EleutherAI
- Apache v2
GPT-NEO: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow <a name="gpt-neo"></a>
- Model
- Paper
- EleutherAI
- MIT
GPT-NEOX-20B: An Open-Source Autoregressive Language Model <a name="gpt-neox"></a>
- Model
- Paper
- EleutherAI
- Apache v2
NeMo Megatron-GPT: Megatron-GPT 20B is a transformer-based language model. <a name="nemo"></a>
- Model
- Paper
- NVidia
- CC BY 4.0
OPT: Open Pre-trained Transformer Language Models <a name="opt"></a>
- Model
- Paper
- Facebook
- Requires approval, non-commercial
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model <a name="bloom"></a>
- Model
- Paper
- BigScience
- OpenRAIL, use-based restrictions
GLM: An Open Bilingual Pre-Trained Model <a name="glm"></a>
- Model
- Paper
- Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University
- Custom license, see restrictions
YaLM: Pretrained language model with 100B parameters <a name="yalm"></a>
- Model
- Paper
- Yandex
- Apache v2

Encoder+decoder (seq2seq) models

T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <a name="t5"></a>
- Model
- Paper
- Google
- Apache v2
FLAN-T5: Scaling Instruction-Finetuned Language Models <a name="flan-t5"></a>
- Model
- Paper
- Google
- Apache v2
Code-T5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation <a name="code-t5"></a>
- Model
- Paper
- Salesforce
- BSD 3-Clause
Bart: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension <a name="bart"></a>
- Model
- Paper
- Facebook
- Apache v2
Pegasus: Pre-training with Extracted Gap-sentences for Abstractive Summarization <a name="pegasus"></a>
- Model
- Paper
- Google
- Apache v2
MT5: A Massively Multilingual Pre-trained Text-to-Text Transformer <a name="mt5"></a>
- Model
- Paper
- Google
- Apache v2
UL2: Unifying Language Learning Paradigms <a name="ul2"></a>
- Model
- Paper
- Google
- Apache v2
FLAN-UL2: A New Open Source Flan 20B with UL2 <a name="flanul2"></a>
- Model
- Paper
- Google
- Apache v2
EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation <a name="edgeformer"></a>
- Model
- Paper
- Microsoft
- MIT

Multimodal models

Donut: OCR-free Document Understanding Transformer
- Model
- Paper
- ClovaAI
- MIT
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking <a name="layoutlmv3"></a>
- Model
- Paper
- Microsoft
- CC BY-NC-SA 4.0 (non-commercial)
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models <a name="trocr"></a>
- Model
- Paper
- Microsoft
- Inherits MIT license
CLIP: Learning Transferable Visual Models From Natural Language Supervision <a name="clip"></a>
- Model
- Paper
- OpenAI
- MIT
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks <a name="unifiedio"></a>
- Model
- Paper
- allenai
- Apache v2

Vision models

DiT: Self-supervised Pre-training for Document Image Transformer
- Model
- Paper
- Microsoft
- Inherits MIT license
DETR: End-to-End Object Detection with Transformers <a name="detr"></a>
- Model
- Paper
- Facebook
- Apache v2
EfficientFormer: Vision Transformers at MobileNet Speed <a name="efficientformer"></a>
- Model
- Paper
- Snap
- Apache v2

Audio models

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
- Model
- Paper
- OpenAI
- MIT
VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers <a name="valle"></a>
- Model (unofficial)
  - MIT but has a dependency on a CC-BY-NC library
- Model (unofficial)
  - Apache v2
- Paper
- Microsoft

Recommendation models

Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)
- Model
- Paper
- Rutgers
- Apache v2

Grounded Situation Recognition models

Grounded Situation Recognition with Transformers
- Model
- Paper
- POSTECH
- Apache v2
Collaborative Transformers for Grounded Situation Recognition <a name="coformer"></a>
- Model
- Paper
- POSTECH
- Apache v2