Awesome
Introduction to Transformers: an NLP Perspective
Transformers, proposed by Vaswani et al. (2017), have dominated empirical machine learning models of natural language processing (NLP). Here we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applications. Given that Transformers and related deep learning techniques might be evolving in ways we have never seen, we cannot dive into all the model details or cover all the technical areas. Instead, we focus on just those concepts that are helpful for gaining a good understanding of Transformers and their variants. We also summarize the key ideas that impact this field, thereby yielding some insights into the strengths and limitations of these models.
You can find the pdf file here: Introduction to Transformers: an NLP Perspective.
This work is intended for students and researchers in NLP who have basic knowledge of linear algebra and probabilities. Although some familiarity with machine learning (in particular, neural networks and deep learning) is advantageous, readers can still gain a general understanding of Transformers by skipping the sections or sub-sections that require specialized background knowledge.
Here is a Chinese version of this webpage: https://github.com/NiuTrans/Introduction-to-Transformers/blob/main/README-zh.md
Selected Papers
For reference, we select some papers for each of the topics. There is such a vast amount of research that it is impossible to provide a complete list of related works. Instead of attempting a comprehensive survey of all research areas, we provide a very short list of papers to facilitate a quick understanding of the key issues.
Background Knowledge for Learning Transformers
- Neural Machine Translation by Jointly Learning to Align and Translate ICLR 2015 paper Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio.
- Sequence to Sequence Learning with Neural Networks NeurIPS 2014 paper Ilya Sutskever, Oriol Vinyals, Quoc V. Le.
- Distributed Representations of Words and Phrases and their Compositionality NeurIPS 2013 paper Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, Jeffrey Dean.
- A Neural Probabilistic Language Model NeurIPS 2000 paper Yoshua Bengio, Réjean Ducharme, Pascal Vincent.
- Layer Normalization NeurIPS 2016 paper Lei Jimmy Ba, Jamie Ryan Kiros, Geoffrey E. Hinton.
- Deep Residual Learning for Image Recognition CVPR 2016 paper Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
- Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors arXiv 2012 paper Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov.
- Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation arXiv 2016 paper Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean.
- Adam: A Method for Stochastic Optimization ICLR 2015 paper Diederik P. Kingma, Jimmy Ba.
Positional Encoding
- Self-Attention with Relative Position Representations NAACL 2018 paper Peter Shaw, Jakob Uszkoreit, Ashish Vaswani.
- Transformer-XL: Attentive Language Models beyond a Fixed-Length Context ACL 2019 paper Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, Ruslan Salakhutdinov.
- DeBERTa: Decoding-Enhanced BERT with Disentangled Attention ICLR 2021 paper Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
- Roformer: Enhanced Transformer with Rotary Position Embedding arXiv 2021 paper Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu.
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation ICLR 2022 paper Ofir Press, Noah Smith, Mike Lewis.
Syntax-aware Attention & Probing
- What does BERT Look at? An Analysis of BERT's Attention ACL 2019 paper Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning.
- Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned ACL 2019 paper Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, Ivan Titov.
- A Structural Probe for Finding Syntax in Word Representations ACL 2019 paper John Hewitt, Christopher D. Manning.
- What do You Learn from Context? Probing for Sentence Structure in Contextualized Word Representations ICLR 2019 paper Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R. Bowman, Dipanjan Das, Ellie Pavlick.
- Emergent Linguistic Structure in Artificial Neural Networks Trained by Self-Supervision PNAS 2020 paper Christopher D. Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, Omer Levy.
Sparse Attention
- Longformer: The Long-Document Transformer arXiv 2020 paper Iz Beltagy, Matthew E. Peters, Arman Cohan.
- Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting AAAI 2021 paper Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, Wancai Zhang.
- Generating Long Sequences with Sparse Transformers arXiv 2019 paper Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever.
- Big bird: Transformers for Longer Sequences NeurIPS 2020 paper Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
- Image Transformer ICML 2018 paper Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, Dustin Tran.
- Efficient Content-Based Sparse Attention with Routing Transformers TACL 2021 paper Aurko Roy, Mohammad Saffar, Ashish Vaswani, David Grangier.
- Adaptively Sparse Transformers ACL 2019 paper Gonçalo M. Correia, Vlad Niculae, André F. T. Martins.
- ETC: Encoding Long and Structured Inputs in Transformers EMNLP 2020 paper Joshua Ainslie, Santiago Ontañón, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, Li Yang.
- Efficiently Modeling Long Sequences with Structured State Spaces ICLR 2022 paper Albert Gu, Karan Goel, Christopher Ré.
- Sparse Sinkhorn Attention ICML 2020 paper Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, Da-Cheng Juan.
- Adaptive Attention Span in Transformers ACL 2019 paper Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, Armand Joulin.
- Efficient Transformers: A Survey arXiv 2020 paper Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler.
Alternatives to Self-Attention
- Pay Less Attention with Lightweight and Dynamic Convolutions ICLR 2019 paper Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, Michael Auli.
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention ICML 2020 paper Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret.
- Rethinking Attention with Performers ICLR 2021 paper Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, Adrian Weller.
- Random Feature Attention ICLR 2021 paper Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, Lingpeng Kong.
- Synthesizer: Rethinking Self-Attention for Transformer Models ICML 2021 paper Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng.
- On the Parameterization and Initialization of Diagonal State Space Models NeurIPS 2022 paper Albert Gu, Karan Goel, Ankit Gupta, Christopher Ré.
- Hungry Hungry Hippos: Towards Language Modeling with State Space Models ICLR 2023 paper Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré.
Architecture Improvement
- Universal Transformers ICLR 2019 paper Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Lukasz Kaiser.
- ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence Generation ACL 2022 paper Bei Li, Quan Du, Tao Zhou, Yi Jing, Shuhan Zhou, Xin Zeng, Tong Xiao, Jingbo Zhu, Xuebo Liu, Min Zhang.
- Multiscale Vision Transformers ICCV 2021 paper Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer.
- Lite Transformer with Long-Short Range Attention ICLR 2020 paper Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, Song Han.
- The Evolved Transformer ICML 2019 paper David So, Quoc Le, Chen Liang.
Deep Models
- Learning Deep Transformer Models for Machine Translation ACL 2019 paper Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, Lidia S. Chao.
- Understanding the Difficulty of Training Transformers EMNLP 2020 paper Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, Jiawei Han.
- Lipschitz Constrained Parameter Initialization for Deep Transformers ACL 2020 paper Hongfei Xu, Qiuhui Liu, Josef van Genabith, Deyi Xiong, Jingyi Zhang.
- Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention EMNLP 2019 paper Biao Zhang, Ivan Titov, Rico Sennrich.
- Deepnet: Scaling Transformers to 1,000 Layers arXiv 2022 paper Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Furu Wei.
- Learning Light-Weight Translation Models from Deep Transformer AAAI 2021 paper Bei Li, Ziyang Wang, Hui Liu, Quan Du, Tong Xiao, Chunliang Zhang, Jingbo Zhu.
Wide Models
- PaLM: Scaling Language Modeling with Pathways JMLR 2023 paper Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, Noah Fiedel.
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity JMLR 2022 paper William Fedus, Barret Zoph, Noam Shazeer.
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding ICLR 2021 paper Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen.
- Scaling Laws for Neural Language Models arXiv 2020 paper Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei.
Recurrent & Memory & Retrieval-Augmented Models
- Compressive Transformers for Long-Range Sequence Modelling ICLR 2020 paper Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, Timothy P. Lillicrap.
- Accelerating Neural Transformer via an Average Attention Network ACL 2018 paper Biao Zhang, Deyi Xiong, Jinsong Su.
- The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation ACL 2018 paper Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, Macduff Hughes.
- ∞-Former: Infinite Memory Transformer ACL 2022 paper Pedro Henrique Martins, Zita Marinho, André Martins.
- Retrieval Augmented Language Model Pre-training ICML 2020 paper Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Mingwei Chang.
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks NeurIPS 2020 paper Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
- Memorizing Transformers ICLR 2022 paper Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, Christian Szegedy.
Quantization
- A White Paper on Neural Network Quantization arXiv 2021 paper Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, Tijmen Blankevoort.
- Training with Quantization Noise for Extreme Model Compression ICLR 2021 paper Pierre Stock, Angela Fan, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou, Armand Joulin.
- Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model ICML 2019 paper Aishwarya Bhandare, Vamsi Sripathi, Deepthi Karkada, Vivek Menon, Sun Choi, Kushal Datta, Vikram Saletore.
- Fully Quantized Transformer for Machine Translation EMNLP 2020 paper Gabriele Prato, Ella Charlaix, Mehdi Rezagholizadeh.
- Towards Fully 8-Bit Integer Inference for the Transformer Model IJCAI 2020 paper Ye Lin, Yanyang Li, Tengbo Liu, Tong Xiao, Tongran Liu, Jingbo Zhu.
Parameter & Activation Sharing
- Reformer: The Efficient Transformer ICLR 2020 paper Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya.
- Low Resource Dependency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser ACL 2015 paper Long Duong, Trevor Cohn, Steven Bird, Paul Cook.
- Sharing Attention Weights for Fast Transformer IJCAI 2019 paper Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, Tongran Liu.
- Fast Transformer Decoding: One Write-Head is All You Need arXiv 2019 paper Noam Shazeer.
- Parameter Sharing between Dependency Parsers for Related Languages EMNLP 2018 paper Miryam de Lhoneux, Johannes Bjerva, Isabelle Augenstein, Anders Søgaard.
Compression
- Sequence-Level Knowledge Distillation EMNLP 2016 paper Yoon Kim, Alexander M. Rush.
- Relational Knowledge Distillation CVPR 2019 paper Wonpyo Park, Dongju Kim, Yan Lu, Minsu Cho.
- Improved Knowledge Distillation via Teacher Assistant AAAI 2020 paper Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, Hassan Ghasemzadeh.
- BPE-Dropout: Simple and Effective Subword Regularization ACL 2020 paper Ivan Provilkov, Dmitrii Emelianenko, Elena Voita.
- Block Pruning for Faster Transformers EMNLP 2021 paper François Lagunas, Ella Charlaix, Victor Sanh, Alexander M. Rush.
- Structured Pruning of Large Language Models EMNLP 2020 paper Ziheng Wang, Jeremy Wohlwend, Tao Lei.
Theoretical Analysis
- Theoretical Limitations of Self-Attention in Neural Sequence Models TACL 2020 paper Michael Hahn.
- Are Transformers Universal Approximators of Sequence-to-Sequence Functions? ICLR 2020 paper Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar.
- On the Turing Completeness of Modern Neural Network Architectures ICLR 2019 paper Jorge Pérez, Javier Marinkovic, Pablo Barceló.
- A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity ICLR 2023 paper Hongkang Li, Meng Wang, Sijia Liu, Pin-Yu Chen.
- Saturated Transformers are Constant-Depth Threshold Circuits TACL 2022 paper William Merrill, Ashish Sabharwal, Noah A. Smith.
- Transformers as Recognizers of Formal Languages: A Survey on Expressivity arXiv 2023 paper Lena Strobl, William Merrill, Gail Weiss, David Chiang, Dana Angluin.
- Low-Rank Bottleneck in Multi-head Attention Models ICML 2020 paper Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar.
Pre-trained Transformers for Language Understanding
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding NAACL 2019 paper Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova.
- SpanBERT: Improving Pre-training by Representing and Predicting Spans TACL 2020 paper Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, Omer Levy.
- ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations ICLR 2020 paper Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach arXiv 2019 paper Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
- XLNet: Generalized Autoregressive Pretraining for Language Understanding NeurIPS 2019 paper Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
- Unsupervised Cross-Lingual Representation Learning at Scale ACL 2020 paper Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov.
Pre-trained Transformers for Language Generation
- Language Models are Few-Shot Learners NeurIPS 2020 paper Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei.
- LLaMA: Open and Efficient Foundation Language Models arXiv 2023 paper Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample.
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model arXiv 2022 paper Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, et al..
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer JMLR 2020 paper Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism arXiv 2019 paper Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro.
- MASS: Masked Sequence to Sequence Pre-training for Language Generation ICML 2019 paper Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension ACL 2020 paper Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer.
Other Applications
- Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows ICCV 2021 paper Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale ICLR 2021 paper Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
- Conformer: Convolution-Augmented Transformer for Speech Recognition INTERSPEECH 2020 paper Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang.
- Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations NeurIPS 2020 paper Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Michael Auli.
- Data2vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language ICML 2022 paper Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli.
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks NeurIPS 2019 paper Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee.
Useful Resources
- [system] Fairseq
- [system] Tensor2Tensor
- [system] Huggingface Transformers
- [system] FasterTransformer
- [system] BERT
- [system] LLaMA
- [tutorial] The Illustrated Transformer
- [tutorial] The Annotated Transformer
- [tutorial] Attention Mechanisms and Transformers (Dive into Deep Learning)
- [tutorial] Transformers and Multi-Head Attention (UvA Deep Learning Tutorials)
Acknowledgements
We would like to thank Yongyu Mu, Chenglong Wang, Bei Li, Weiqiao Shan, Yuchun Fan, Kaiyan Chang, Tong Zheng, and Huiwen Bao for their contributions to this work.
For any questions and comments, please email us at xiaotong [at] mail.neu.edu.cn or heshengmo [at] foxmail.com.