Awesome
<p align="center"> This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online. </p>Note Quick legend on available resource types:
⭐ - open source project, usually a GitHub repository with its number of stars
📙 - resource you can read, usually a blog post or a paper
🗂️ - a collection of additional resources
🔱 - non-open source tool, framework or paid service
🎥️ - a resource you can watch
🎙️ - a resource you can listen to
<p align="center"><b>Table of Contents</b></p>
Note Section keywords: paper summaries, compendium, awesome list
Compendiums and awesome lists on the topic of NLP:
- 🗂️ The NLP Index - Searchable Index of NLP Papers by Quantum Stat / NLP Cypher
- ⭐ Awesome NLP by keon [GitHub, 16528 stars]
- ⭐ Speech and Natural Language Processing Awesome List by elaboshira [GitHub, 2189 stars]
- ⭐ Awesome Deep Learning for Natural Language Processing (NLP) [GitHub, 1274 stars]
- ⭐ Text Mining and Natural Language Processing Resources by stepthom [GitHub, 557 stars]
- 🗂️ Brainsources for #NLP enthusiasts by Philip Vollet
- ⭐ Awesome AI/ML/DL - NLP Section [GitHub, 1473 stars]
- 🗂️ NLP articles by Devopedia
NLP Conferences, Paper Summaries and Paper Compendiums:
Papers and Paper Summaries
- ⭐ 100 Must-Read NLP Papers 100 Must-Read NLP Papers [GitHub, 3732 stars]
- ⭐ NLP Paper Summaries by dair-ai [GitHub, 1475 stars]
- ⭐ Curated collection of papers for the NLP practitioner [GitHub, 1075 stars]
- ⭐ Papers on Textual Adversarial Attack and Defense [GitHub, 1501 stars]
- ⭐ Recent Deep Learning papers in NLU and RL by Valentin Malykh [GitHub, 296 stars]
- ⭐ A Survey of Surveys (NLP & ML): Collection of NLP Survey Papers [GitHub, 1997 stars]
- ⭐ A Paper List for Style Transfer in Text [GitHub, 1609 stars]
- 🎥 Video recordings index for papers
Conference Summaries
- ⭐ NLP top 10 conferences Compendium by soulbliss [GitHub, 459 stars]
- 📙 ICLR 2020 Trends
- 📙 SpacyIRL 2019 Conference in Overview
- 📙 Paper Digest - Conferences and Papers in Overview
NLP Progress and NLP Tasks:
- ⭐ NLP Progress by sebastianruder [GitHub, 22568 stars]
- ⭐ NLP Tasks by Kyubyong [GitHub, 3017 stars]
NLP Datasets:
- ⭐ NLP Datasets by niderhoff [GitHub, 5741 stars]
- ⭐ Datasets by Huggingface [GitHub, 19096 stars]
- 🗂️ Big Bad NLP Database
- ⭐ UWA Unambiguous Word Annotations - Word Sense Disambiguation Dataset
- ⭐ MLDoc - Corpus for Multilingual Document Classification in Eight Language [GitHub, 152 stars]
Word and Sentence embeddings:
- ⭐ Awesome Embedding Models by Hironsan [GitHub, 1752 stars]
- ⭐ Awesome list of Sentence Embeddings by Separius [GitHub, 2219 stars]
- ⭐ Awesome BERT by Jiakui [GitHub, 1846 stars]
Notebooks, Scripts and Repositories
- ⭐ The Super Duper NLP Repo [Website, 2020]
Non-English resources and Compendiums
- ⭐ NLP Resources for Bahasa Indonesian [GitHub, 480 stars]
- ⭐ Indic NLP Catalog [GitHub, 552 stars]
- ⭐ Pre-trained language models for Vietnamese [GitHub, 653 stars]
- ⭐ Natural Language Toolkit for Indic Languages (iNLTK) [GitHub, 814 stars]
- ⭐ Indic NLP Library [GitHub, 550 stars]
- ⭐ AI4Bharat-IndicNLP Portal
- ⭐ ARBML - Implementation of many Arabic NLP and ML projects [GitHub, 387 stars]
- ⭐ zemberek-nlp - NLP tools for Turkish [GitHub, 1146 stars]
- ⭐ TDD AI - An open-source platform for all Turkish datasets, language models, and NLP tools.
- ⭐ KLUE - Korean Language Understanding Evaluation [GitHub, 560 stars]
- ⭐ Persian NLP Benchmark - benchmark for evaluation and comparison of various NLP tasks in Persian language [GitHub, 73 stars]
- ⭐ nlp-greek - Greek language sources [GitHub, 5 stars]
- ⭐ Awesome NLP Resources for Hungarian [GitHub, 221 stars]
Pre-trained NLP models
- ⭐ List of pre-trained NLP models [GitHub, 170 stars]
- ⭐ Pretrained language models developed by Huawei Noah's Ark Lab [GitHub, 3019 stars]
- ⭐ Spanish Language Models and resources [GitHub, 251 stars]
NLP History
General
- ⭐ Modern Deep Learning Techniques Applied to Natural Language Processing [GitHub, 1328 stars]
- 📙 A Review of the Neural History of Natural Language Processing [Blog, October 2018]
2020 Year in Review
- 📙 Natural Language Processing in 2020: The Year In Review [Blog, December 2020]
- 📙 ML and NLP Research Highlights of 2020 [Blog, January 2021]
🔙 Back to the Table of Contents
NLP-only podcasts
- 🎙️ NLP Highlights [Years: 2017 - now, Status: active]
- 🎙️ The NLP Zone Episodes [Years: 2021 - now, Status: active]
Many NLP episodes
- 🎙️ TWIML AI [Years: 2016 - now, Status: active]
- 🎙️ Practical AI [Years: 2018 - now, Status: active]
- 🎙️ The Data Exchange [Years: 2019 - now, Status: active]
- 🎙️ Gradient Dissent [Years: 2020 - now, Status: active]
- 🎙️ Machine Learning Street Talk [Years: 2020 - now, Status: active]
- 🎙️ DataFramed - latest trends and insights on how to scale the impact of data science in organizations [Years: 2019 - now, Status: active]
Some NLP episodes
- 🎙️ The Super Data Science Podcast [Years: 2016 - now, Status: active]
- 🎙️ Data Hack Radio [Years: 2018 - now, Status: active]
- 🎙️ AI Game Changers [Years: 2020, Status: active]
- 🎙️ The Analytics Show [Years: 2019 - now, Status: active]
- 📙 NLP News by Sebastian Ruder
- 📙 This Week in NLP by Robert Dale
- 📙 Papers with Code
- 📙 The Batch by deeplearning.ai
- 📙 Paper Digest by PaperDigest
- 📙 NLP Cypher by QuantumStat
- 🎥 NLP Zurich [YouTube Recordings]
- 🎥 Hacking-Machine-Learning [YouTube Recordings]
- 🎥 NY-NLP (New York)
- 🎥 Yannic Kilcher
- 🎥 HuggingFace
- 🎥 Kaggle Reading Group
- 🎥 Rasa Paper Reading
- 🎥 Stanford CS224N: NLP with Deep Learning
- 🎥 NLPxing
- 🎥 ML Explained - A.I. Socratic Circles - AISC
- 🎥 Deeplearning.ai
- 🎥 Machine Learning Street Talk
🔙 Back to the Table of Contents
General NLU
- ⭐ GLUE - General Language Understanding Evaluation (GLUE) benchmark
- ⭐ SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
- ⭐ decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
- ⭐ dialoglue - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue [GitHub, 280 stars]
- ⭐ DynaBench - Dynabench is a research platform for dynamic data collection and benchmarking
- ⭐ Big-Bench - collaborative benchmark for measuring and extrapolating the capabilities of language models [GitHub, 2835 stars]
Summarization
- ⭐ WikiAsp - WikiAsp: Multi-document aspect-based summarization Dataset
- ⭐ WikiLingua - A Multilingual Abstractive Summarization Dataset
Question Answering
- ⭐ SQuAD - Stanford Question Answering Dataset (SQuAD)
- ⭐ XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
- ⭐ GrailQA - Strongly Generalizable Question Answering (GrailQA)
- ⭐ CSQA - Complex Sequential Question Answering
Multilingual and Non-English Benchmarks
- 📙 XTREME - Massively Multilingual Multi-task Benchmark
- ⭐ GLUECoS - A benchmark for code-switched NLP
- ⭐ IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
- ⭐ LinCE - Linguistic Code-Switching Evaluation Benchmark
- ⭐ Russian SuperGlue - Russian SuperGlue Benchmark
Bio, Law, and other scientific domains
- ⭐ BLURB - Biomedical Language Understanding and Reasoning Benchmark
- ⭐ BLUE - Biomedical Language Understanding Evaluation benchmark
- ⭐ LexGLUE - A Benchmark Dataset for Legal Language Understanding in English
Transformer Efficiency
- ⭐ Long-Range Arena - Long Range Arena for Benchmarking Efficient Transformers (Pre-print) [GitHub, 716 stars]
Speech Processing
- ⭐ SUPERB - Speech processing Universal PERformance Benchmark
Other
- ⭐ CodeXGLUE - A benchmark dataset for code intelligence
- ⭐ CrossNER - CrossNER: Evaluating Cross-Domain Named Entity Recognition
- ⭐ MultiNLI - Multi-Genre Natural Language Inference corpus
- ⭐ iSarcasm: A Dataset of Intended Sarcasm - iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic
🔙 Back to the Table of Contents
General
- 📙 A Recipe for Training Neural Networks by Andrej Karpathy [Keywords: research, training, 2019]
- 📙 Recent Advances in NLP via Large Pre-Trained Language Models: A Survey [Paper, November 2021]
Embeddings
Repositories
- ⭐ Pre-trained ELMo Representations for Many Languages [GitHub, 1458 stars]
- ⭐ sense2vec - Contextually-keyed word vectors [GitHub, 1617 stars]
- ⭐ wikipedia2vec [GitHub, 935 stars]
- ⭐ StarSpace [GitHub, 3938 stars]
- ⭐ fastText [GitHub, 25871 stars]
Blogs
- 📙 Language Models and Contextualised Word Embeddings by David S. Batista [Blog, 2018]
- 📙 An Essential Guide to Pretrained Word Embeddings for NLP Practitioners by AnalyticsVidhya [Blog, 2020]
- 📙 Polyglot Word Embeddings Discover Language Clusters [Blog, 2020]
- 📙 The Illustrated Word2vec by Jay Alammar [Blog, 2019]
Cross-lingual Word and Sentence Embeddings
- ⭐ vecmap - VecMap (cross-lingual word embedding mappings) [GitHub, 644 stars]
- ⭐ sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 14981 stars]
Byte Pair Encoding
- ⭐ bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 1179 stars]
- ⭐ subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 2185 stars]
- ⭐ python-bpe - Byte Pair Encoding for Python [GitHub, 223 stars]
Transformer-based Architectures
General
- 📙 The Transformer Family by Lilian Weng [Blog, 2020]
- 📙 Playing the lottery with rewards and multiple languages - about the effect of random initialization [ICLR 2020 Paper]
- 📙 Attention? Attention! by Lilian Weng [Blog, 2018]
- 📙 the transformer … “explained”? [Blog, 2019]
- 🎥️ Attention is all you need; Attentional Neural Network Models by Łukasz Kaiser [Talk, 2017]
- 📙 Attention Is Off By One [July, 2023]
- 🎥️ Understanding and Applying Self-Attention for NLP [Talk, 2018]
- 📙 The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures [Paper, April 2021]
- 📙 Pre-Trained Models: Past, Present and Future [Paper, June 2021]
- 📙 A Survey of Transformers [Paper, June 2021]
Transformer
- 📙 The Annotated Transformer by Harvard NLP [Blog, 2018]
- 📙 The Illustrated Transformer by Jay Alammar [Blog, 2018]
- 📙 Illustrated Guide to Transformers by Hong Jing [Blog, 2020]
- 📙 Sequential Transformer with Adaptive Attention Span by Facebook. Blog [Blog, 2019]
- 📙 Evolution of Representations in the Transformer by Lena Voita [Blog, 2019]
- 📙 Reformer: The Efficient Transformer [Blog, 2020]
- 📙 Longformer — The Long-Document Transformer by Viktor Karlsson [Blog, 2020]
- 📙 TRANSFORMERS FROM SCRATCH [Blog, 2019]
- 📙 Transformers in Natural Language Processing — A Brief Survey by George Ho [Blog, May 2020]
- ⭐ Lite Transformer - Lite Transformer with Long-Short Range Attention [GitHub, 596 stars]
- 📙 Transformers from Scratch [Blog, Oct 2021]
BERT
- 📙 A Visual Guide to Using BERT for the First Time by Jay Alammar [Blog, 2019]
- 📙 The Dark Secrets of BERT by Anna Rogers [Blog, 2020]
- 📙 Understanding searches better than ever before [Blog, 2019]
- 📙 Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework [Blog, 2019]
- ⭐ SemBERT - Semantics-aware BERT for Language Understanding [GitHub, 286 stars]
- ⭐ BERTweet - BERTweet: A pre-trained language model for English Tweets [GitHub, 574 stars]
- ⭐ Optimal Subarchitecture Extraction for BERT [GitHub, 470 stars]
- ⭐ CharacterBERT: Reconciling ELMo and BERT [GitHub, 195 stars]
- 📙 When BERT Plays The Lottery, All Tickets Are Winning [Blog, Dec 2020]
- ⭐ BERT-related Papers a list of BERT-related papers [GitHub, 2032 stars]
Other Transformer Variants
T5
- 📙 T5 Understanding Transformer-Based Self-Supervised Architectures [Blog, August 2020]
- 📙 T5: the Text-To-Text Transfer Transformer [Blog, 2020]
- ⭐ multilingual-t5 - Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model [GitHub, 1245 stars]
BigBird
- 📙 Big Bird: Transformers for Longer Sequences original paper by Google Research [Paper, July 2020]
Reformer / Linformer / Longformer / Performers
- 🎥️ Reformer: The Efficient Transformer - [Paper, February 2020] [Video, October 2020]
- 🎥️ Longformer: The Long-Document Transformer - [Paper, April 2020] [Video, April 2020]
- 🎥️ Linformer: Self-Attention with Linear Complexity - [Paper, June 2020] [Video, June 2020]
- 🎥️ Rethinking Attention with Performers - [Paper, September 2020] [Video, September 2020]
- ⭐ performer-pytorch - An implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 1084 stars]
Switch Transformer
- 📙 Switch Transformers: Scaling to Trillion Parameter Models original paper by Google Research [Paper, January 2021]
GPT-family
General
- 📙 The Illustrated GPT-2 by Jay Alammar [Blog, 2019]
- 📙 The Annotated GPT-2 by Aman Arora
- 📙 OpenAI’s GPT-2: the model, the hype, and the controversy by Ryan Lowe [Blog, 2019]
- 📙 How to generate text by Patrick von Platen [Blog, 2020]
GPT-3
Learning Resources
- 📙 Zero Shot Learning for Text Classification by Amit Chaudhary [Blog, 2020]
- 📙 GPT-3 A Brief Summary by Leo Gao [Blog, 2020]
- 📙 GPT-3, a Giant Step for Deep Learning And NLP by Yoel Zeldes [Blog, June 2020]
- 📙 GPT-3 Language Model: A Technical Overview by Chuan Li [Blog, June 2020]
- 📙 Is it possible for language models to achieve language understanding? by Christopher Potts
Applications
- ⭐ Awesome GPT-3 - list of all resources related to GPT-3 [GitHub, 4589 stars]
- 🗂️ GPT-3 Projects - a map of all GPT-3 start-ups and commercial projects
- 🗂️ GPT-3 Demo Showcase - GPT-3 Demo Showcase, 180+ Apps, Examples, & Resources
- 🔱 OpenAI API - API Demo to use OpenAI GPT for commercial applications
Open-source Efforts
- 📙 GPT-Neo - in-progress GPT-3 open source replication HuggingFace Hub
- ⭐ GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile
- 📙 Effectively using GPT-J with few-shot learning [Blog, July 2021]
Other
- 📙 What is Two-Stream Self-Attention in XLNet by Xu LIANG [Blog, 2019]
- 📙 Visual Paper Summary: ALBERT (A Lite BERT) by Amit Chaudhary [Blog, 2020]
- 📙 Turing NLG by Microsoft
- 📙 Multi-Label Text Classification with XLNet by Josh Xin Jie Lee [Blog, 2019]
- ⭐ ELECTRA [GitHub, 2326 stars]
- ⭐ Performer implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 1084 stars]
Distillation, Pruning and Quantization
Reading Material
- 📙 Distilling knowledge from Neural Networks to build smaller and faster models by FloydHub [Blog, 2019]
- 📙 Compression of Deep Learning Models for Text: A Survey [Paper, April 2021]
Tools
- ⭐ Bert-squeeze - code to reduce the size of Transformer-based models or decrease their latency at inference time [GitHub, 79 stars]
- ⭐ XtremeDistil - XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks [GitHub, 153 stars]
Automated Summarization
- 📙 PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization by Google AI [Blog, June 2020]
- ⭐ CTRLsum - CTRLsum: Towards Generic Controllable Text Summarization [GitHub, 146 stars]
- ⭐ XL-Sum - XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages [GitHub, 252 stars]
- ⭐ SummerTime - an open-source text summarization toolkit for non-experts [GitHub, 265 stars]
- ⭐ PRIMER - PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization [GitHub, 151 stars]
- ⭐ summarus - Models for automatic abstractive summarization [GitHub, 170 stars]
Knowledge Graphs and NLP
- 📙 Fusing Knowledge into Language Model [Presentation, Oct 2021]
Note Section keywords: best practices, MLOps
🔙 Back to the Table of Contents
Best Practices for building NLP Projects
- 🎥 In Search of Best Practices for NLP Projects [Slides, Dec. 2020]
- 🎥 EMNLP 2020: High Performance Natural Language Processing by Google Research, Recording, Nov. 2020]
- 📙 Practical Natural Language Processing - A Comprehensive Guide to Building Real-World NLP Systems [Book, June 2020]
- 📙 How to Structure and Manage NLP Projects [Blog, May 2021]
- 📙 Applied NLP Thinking - Applied NLP Thinking: How to Translate Problems into Solutions [Blog, June 2021]
- 🎥 Introduction to NLP for Industry Use - DataTalksClub presentation on Introduction to NLP for Industry Use [Recording, December 2021]
- 📙 Measuring Embedding Drift - Best practices for monitoring drift of NLP models [Blog, December 2022]
MLOps for NLP
MLOps, especially when applied to NLP, is a set of best practices around automating various parts of the workflow when building and deploying NLP pipelines.
In general, MLOps for NLP includes having the following processes in place:
- Data Versioning - make sure your training, annotation and other types of data are versioned and tracked
- Experiment Tracking - make sure that all of your experiments are automatically tracked and saved where they can be easily replicated or retraced
- Model Registry - make sure any neural models you train are versioned and tracked and it is easy to roll back to any of them
- Automated Testing and Behavioral Testing - besides regular unit and integration tests, you want to have behavioral tests that check for bias or potential adversarial attacks
- Model Deployment and Serving - automate model deployment, ideally also with zero-downtime deploys like Blue/Green, Canary deploys etc.
- Data and Model Observability - track data drift, model accuracy drift etc.
Additionally, there are two more components that are not as prevalent for NLP and are mostly used for Computer Vision and other sub-fields of AI:
- Feature Store - centralized storage of all features developed for ML models than can be easily reused by any other ML project
- Metadata Management - storage for all information related to the usage of ML models, mainly for reproducing behavior of deployed ML models, artifact tracking etc.
MLOps Compilations & Awesome Lists
- ⭐ awesome-mlops [GitHub, 12526 stars]
- ⭐ best-of-ml-python [GitHub, 16309 stars]
- 🗂️ MLOps.Toys - a curated list of MLOps projects
Reading Material
- 📙 Machine Learning Operations (MLOps): Overview, Definition, and Architecture [Paper, May 2022]
- 📙 Requirements and Reference Architecture for MLOps:Insights from Industry [Paper, Oct 2022]
- 📙 MLOps: What It Is, Why it Matters, and How To Implement It by Neptune AI [Blog, July 2021]
- 📙 Best MLOps Tools You Need to Know as a Data Scientist by Neptune AI [Blog, July 2021]
- 📙 State of MLOps 2021 by Valohai [Blog, August 2021]
- 📙 The MLOps Stack by Valohai [Blog, October 2020]
- 📙 Data Version Control for Machine Learning Applications by Megagon AI [Blog, July 2021]
- 📙 The Rapid Evolution of the Canonical Stack for Machine Learning [Blog, July 2021]
- 📙 MLOps: Comprehensive Beginner’s Guide [Blog, March 2021]
- 📙 What I’ve learned about MLOps from speaking with 100+ ML practitioners [Blog, May 2021]
- 📙 DataRobot Challenger Models - MLOps Champion/Challenger Models
- 📙 State of MLOps Blog by Dr. Ori Cohen
- 📙 MLOps Ecosystem Overview [Blog, 2021]
Learning Material
- 🗂 MLOps cource by Made With ML
- 🗂 GitHub MLOps - collection of resources on how to facilitate Machine Learning Ops with GitHub
- 🗂 ML Observability Fundamentals Course Learn how to monitor and root-cause issues with production NLP models
MLOps Communities
- The MLOps Community - blogs, slack group, newsletter and more all about MLOps
Data Versioning
- ⭐ DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
- 🔱 Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
- 🔱 Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
Experiment Tracking
- ⭐ mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- 🔱 Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
- 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
- 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
- 🔱 SigOpt - automate training & tuning, visualize & compare runs [Paid Service]
- ⭐ Optuna - hyperparameter optimization framework [GitHub, 10650 stars]
- ⭐ Clear ML - experiment, orchestrate, deploy, and build data stores, all in one place [Free and Open Source] Link to GitHub
- ⭐ Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 8093 stars]
Model Registry
- ⭐ DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
- ⭐ mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- ⭐ ModelDB - open-source system for Machine Learning model versioning, metadata, and experiment management [GitHub, 1696 stars]
- 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
- 🔱 Valohai - End-to-end ML pipelines [Paid Service]
- 🔱 Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
- 🔱 polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
- 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
Automated Testing and Behavioral Testing
- ⭐ CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 2003 stars]
- ⭐ TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
- ⭐ WildNLP - Corrupt an input text to test NLP models' robustness [GitHub, 76 stars]
- ⭐ Great Expectations - Write tests for your data [GitHub, 9874 stars]
- ⭐ Deepchecks - Python package for comprehensively validating your machine learning models and data [GitHub, 3582 stars]
Model Deployability and Serving
- ⭐ mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- 🔱 Amazon SageMaker [Paid Service]
- 🔱 Valohai - End-to-end ML pipelines [Paid Service]
- 🔱 NLP Cloud - Production-ready NLP API [Paid Service]
- 🔱 Saturn Cloud [Paid Service]
- 🔱 SELDON - machine learning deployment for enterprise [Paid Service]
- 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
- 🔱 polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
- ⭐ TorchServe - flexible and easy to use tool for serving PyTorch models [GitHub, 4174 stars]
- 🔱 Kubeflow - The Machine Learning Toolkit for Kubernetes [GitHub, 10600 stars]
- ⭐ KFServing - Serverless Inferencing on Kubernetes [GitHub, 3504 stars]
- 🔱 TFX - TensorFlow Extended - end-to-end platform for deploying production ML pipelines [Paid Service]
- 🔱 Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
- 🔱 Cortex - containers as a service on AWS [Paid Service]
- 🔱 Azure Machine Learning - end-to-end machine learning lifecycle [Paid Service]
- ⭐ End2End Serverless Transformers On AWS Lambda [GitHub, 121 stars]
- ⭐ NLP-Service - sample demo of NLP as a service platform built using FastAPI and Hugging Face [GitHub, 13 stars]
- 🔱 Dagster - data orchestrator for machine learning [Free and Open Source]
- 🔱 Verta - AI and machine learning deployment and operations [Paid Service]
- ⭐ Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 8093 stars]
- ⭐ flyte - workflow automation platform for complex, mission-critical data and ML processes at scale [GitHub, 5525 stars]
- ⭐ MLRun - Machine Learning automation and tracking [GitHub, 1425 stars]
- 🔱 DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI
Model Debugging
- ⭐ imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 1375 stars]
- ⭐ Cockpit - A Practical Debugging Tool for Training Deep Neural Networks [GitHub, 474 stars]
Model Accuracy Prediction
- ⭐ WeightWatcher - WeightWatcher tool for predicting the accuracy of Deep Neural Networks [GitHub, 1453 stars]
Data and Model Observability
General
- ⭐ Arize AI - embedding drift monitoring for NLP models
- ⭐ Arize-Phoenix - ML observability for LLMs, vision, language, and tabular models
- ⭐ whylogs - open source standard for data and ML logging [GitHub, 2636 stars]
- ⭐ Rubrix - open-source tool for exploring and iterating on data for artificial intelligence projects [GitHub, 3843 stars]
- ⭐ MLRun - Machine Learning automation and tracking [GitHub, 1425 stars]
- 🔱 DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI
- 🔱 Cortex - containers as a service on AWS [Paid Service]
Model Centric
- 🔱 Algorithmia - minimize risk with advanced reporting and enterprise-grade security and governance across all data, models, and infrastructure [Paid Service]
- 🔱 Dataiku - dataiku is for teams who want to deliver advanced analytics using the latest techniques at big data scale [Paid Service]
- ⭐ Evidently AI - tools to analyze and monitor machine learning models [Free and Open Source] Link to GitHub
- 🔱 Fiddler - ML Model Performance Management Tool [Paid Service]
- 🔱 Hydrosphere - open-source platform for managing ML models [Paid Service]
- 🔱 Verta - AI and machine learning deployment and operations [Paid Service]
- 🔱 Domino Model Ops - Deploy and Manage Models to Drive Business Impact [Paid Service]
Data Centric
- 🔱 Datafold - data quality through diffs, profiling, and anomaly detection [Paid Service]
- 🔱 acceldata - improve reliability, accelerate scale, and reduce costs across all data pipelines [Paid Service]
- 🔱 Bigeye - monitoring and alerting to your datasets in minutes [Paid Service]
- 🔱 datakin - end-to-end, real-time data lineage solution [Paid Service]
- 🔱 Monte Carlo - data integrity, drifts, schema, lineage [Paid Service]
- 🔱 SODA - data monitoring, testing and validation [Paid Service]
Feature Stores
- 🔱 Tecton - enterprise feature store for machine learning [Paid Service]
- ⭐ FEAST - open source feature store for machine learning Website [GitHub, 5525 stars]
- 🔱 Hopsworks Feature Store - data management system for managing machine learning features [Paid Service]
Metadata Management
- ⭐ ML Metadata - a library for recording and retrieving metadata associated with ML developer and data scientist workflows [GitHub, 617 stars]
- 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
MLOps Frameworks
- ⭐ Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 8093 stars]
- ⭐ kedro - Python framework for creating reproducible, maintainable and modular data science code [GitHub, 9883 stars]
- ⭐ Seldon Core - MLOps framework to package, deploy, monitor and manage thousands of production machine learning models [GitHub, 4353 stars]
- ⭐ ZenML - MLOps framework to create reproducible ML pipelines for production machine learning [GitHub, 3972 stars]
- 🔱 Google Vertex AI - build, deploy, and scale ML models faster, with pre-trained and custom tooling within a unified AI platform [Paid Service]
- ⭐ Diffgram - Complete training data platform for machine learning delivered as a single application [GitHub, 1834 stars]
- 🔱 Continual.ai - build, deploy, and operationalize ML models easier and faster with a declarative interface on cloud data warehouses like Snowflake, BigQuery, RedShift, and Databricks. [Paid Service]
Transformer-based Architectures
🔙 Back to the Table of Contents
General
- 📙 Why BERT Fails in Commercial Environments by Intel AI [Blog, 2020]
- 📙 Fine Tuning BERT for Text Classification with FARM by Sebastian Guggisberg [Blog, 2020]
- ⭐ Pretrain Transformers Models in PyTorch using Hugging Face Transformers [GitHub, 254 stars]
- 🎥️ Practical NLP for the Real World [Presentation, 2019]
- 🎥️ From Paper to Product – How we implemented BERT by Christoph Henkelmann [Talk, 2020]
Multi-GPU Transformers
- ⭐ Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [GitHub, 776 stars]
Training Transformers Effectively
- ⭐ Training BERT with Compute/Time (Academic) Budget [GitHub, 309 stars]
Embeddings as a Service
- ⭐ embedding-as-service [GitHub, 204 stars]
- ⭐ Bert-as-service [GitHub, 12399 stars]
NLP Recipes Industrial Applications:
- ⭐ NLP Recipes by microsoft [GitHub, 6367 stars]
- ⭐ NLP with Python by susanli2016 [GitHub, 2721 stars]
- ⭐ Basic Utilities for PyTorch NLP by PetrochukM [GitHub, 2210 stars]
NLP Applications in Bio, Finance, Legal and other industries
- ⭐ Blackstone - A spaCy pipeline and model for NLP on unstructured legal text [GitHub, 636 stars]
- ⭐ Sci spaCy - spaCy pipeline and models for scientific/biomedical documents [GitHub, 1688 stars]
- ⭐ FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks [GitHub, 197 stars]
- ⭐ LexNLP - Information retrieval and extraction for real, unstructured legal text [GitHub, 692 stars]
- ⭐ NerDL and NerCRF - Tutorial on Named Entity Recognition for Healthcare with SparkNLP
- ⭐ Legal Text Analytics - A list of selected resources dedicated to Legal Text Analytics [GitHub, 613 stars]
- ⭐ BioIE - A curated list of resources relevant to doing Biomedical Information Extraction [GitHub, 338 stars]
Note Section keywords: speech recognition
🔙 Back to the Table of Contents
General Speech Recognition
- ⭐ wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6370 stars]
- ⭐ DeepSpeech - Baidu's DeepSpeech architecture [GitHub, 25166 stars]
- 📙 Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
- ⭐ kaldi - Kaldi is a toolkit for speech recognition [GitHub, 14177 stars]
- ⭐ awesome-kaldi - resources for using Kaldi [GitHub, 532 stars]
- ⭐ ESPnet - End-to-End Speech Processing Toolkit [GitHub, 8355 stars]
- 📙 HuBERT - Self-supervised representation learning for speech recognition, generation, and compression [Blog, June 2021]
Text to Speech / Speech Generation
- ⭐ FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub, 857 stars]
- ⭐ TTS - a deep learning toolkit for Text-to-Speech [GitHub, 34356 stars]
- 🔱 NotebookLM - Google Gemini powered personal assistant / podcast generator
Speech to Text
- ⭐ whisper - Robust Speech Recognition via Large-Scale Weak Supervision, by OpenAI [GitHub, 68884 stars]
- ⭐ vibe - GUI tool to work with whisper, multilingual and cuda support included [GitHub, 931 stars]
Datasets
- ⭐ VoxPopuli - large-scale multilingual speech corpus for representation learning [GitHub, 507 stars]
Note Section keywords: topic modeling
🔙 Back to the Table of Contents
Blogs
- 📙 Topic Modelling with PySpark and Spark NLP by Maria Obedkova [Spark, Blog, 2020]
- 📙 A Unique Approach to Short Text Clustering (Algorithmic Theory) by Brittany Bowers [Blog, 2020]
Frameworks for Topic Modeling
Repositories
- ⭐ Top2Vec [GitHub, 2924 stars]
- ⭐ Anchored Correlation Explanation Topic Modeling [GitHub, 303 stars]
- ⭐ Topic Modeling in Embedding Spaces [GitHub, 540 stars] Paper
- ⭐ TopicNet - A high-level interface for BigARTM library [GitHub, 140 stars]
- ⭐ BERTopic - Leveraging BERT and a class-based TF-IDF to create easily interpretable topics [GitHub, 6038 stars]
- ⭐ OCTIS - A python package to optimize and evaluate topic models [GitHub, 718 stars]
- ⭐ Contextualized Topic Models [GitHub, 1196 stars]
- ⭐ GSDMM - GSDMM: Short text clustering [GitHub, 353 stars]
Note Section keywords: keyword extraction
🔙 Back to the Table of Contents
Text Rank
- ⭐ PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 2132 stars]
- ⭐ textrank - TextRank implementation for Python 3 [GitHub, 1248 stars]
RAKE - Rapid Automatic Keyword Extraction
- ⭐ rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1061 stars]
- ⭐ yake - Single-document unsupervised keyword extraction [GitHub, 1632 stars]
- ⭐ RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 375 stars]
- ⭐ rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1061 stars]
Other Approaches
- ⭐ flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5583 stars]
- ⭐ BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 254 stars]
- ⭐ keyBERT - Minimal keyword extraction with BERT [GitHub, 3471 stars]
- ⭐ KeyphraseVectorizers - vectorizers that extract keyphrases with part-of-speech patterns [GitHub, 251 stars]
Further Reading
- 📙 Adding a custom tokenizer to spaCy and extracting keywords from Chinese texts by Haowen Jiang [Blog, Feb 2021]
- 📙 How to Extract Relevant Keywords with KeyBERT [Blog, June 2021]
Note Section keywords: ethics, responsible NLP
🔙 Back to the Table of Contents
NLP and ML Interpretability
NLP-centric
- Explainability for Natural Language Processing - KDD'2021 Tutorial Slides [Presentation, August 2021]
- ⭐ ecco - Tools to visuals and explore NLP language models [GitHub, 1974 stars]
- ⭐ NLP Profiler - A simple NLP library allows profiling datasets with text columns [GitHub, 243 stars]
- ⭐ transformers-interpret - Model explainability that works seamlessly with transformers [GitHub, 1278 stars]
- ⭐ Awesome-explainable-AI - collection of research materials on explainable AI/ML [GitHub, 1400 stars]
- ⭐ LAMA - LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models [GitHub, 1346 stars]
General
- ⭐ Language Interpretability Tool (LIT) [GitHub, 3474 stars]
- ⭐ WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 468 stars]
- ⭐ Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 413 stars]
- ⭐ InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 6238 stars]
- ⭐ thermostat - Collection of NLP model explanations and accompanying analysis tools [GitHub, 143 stars]
- ⭐ Dodrio - Exploring attention weights in transformer-based models with linguistic knowledge [GitHub, 342 stars]
- ⭐ imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 1375 stars]
Ethics, Bias, and Equality in NLP
- 📙 Bias in Natural Language Processing @EMNLP 2020 [Blog, Nov 2020]
- 🎥️ Machine Learning as a Software Engineering Enterprise - NeurIPS 2020 Keynote [Presentation, Dec 2020]
- 🗂️ Ethics in NLP - resources from ACLs Ethics in NLP track
- 🗂️ The Institute for Ethical AI & Machine Learning
- 📙 Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models [Paper, Feb 2021]
- ⭐ Fairness-in-AI - this package is used to detect and mitigate biases in NLP tasks [GitHub, 77 stars]
- ⭐ nlg-bias - dataset + classifier tools to study social perception biases in natural language generation [GitHub, 65 stars]
- 🗂️ bias-in-nlp - list of papers related to bias in NLP [GitHub, 9 stars]
Adversarial Attacks for NLP
- 📙 Privacy Considerations in Large Language Models [Blog, Dec 2020]
- ⭐ DeepWordBug - Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers [GitHub, 73 stars]
- ⭐ Adversarial-Misspellings - Combating Adversarial Misspellings with Robust Word Recognition [GitHub, 62 stars]
Hate Speech Analysis
- ⭐ HateXplain - BERT for detecting abusive language [GitHub, 187 stars]
Note Section keywords: frameworks
🔙 Back to the Table of Contents
General Purpose
- ⭐ spaCy by Explosion AI [GitHub, 29784 stars]
- ⭐ flair by Zalando [GitHub, 13855 stars]
- ⭐ AllenNLP by AI2 [GitHub, 11740 stars]
- ⭐ stanza (former Stanford NLP) [GitHub, 7253 stars]
- ⭐ spaCy stanza [GitHub, 723 stars]
- ⭐ nltk [GitHub, 13489 stars]
- ⭐ gensim - framework for topic modeling [GitHub, 15597 stars]
- ⭐ pororo - Platform of neural models for natural language processing [GitHub, 1279 stars]
- ⭐ NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2936 stars]
- ⭐ FARM [GitHub, 1734 stars]
- ⭐ gobbli by RTI International [GitHub, 275 stars]
- ⭐ headliner - training and deployment of seq2seq models [GitHub, 229 stars]
- ⭐ SyferText - A privacy preserving NLP framework [GitHub, 197 stars]
- ⭐ DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1263 stars]
- ⭐ TextHero - Text preprocessing, representation and visualization [GitHub, 2882 stars]
- ⭐ textblob - TextBlob: Simplified Text Processing [GitHub, 9109 stars]
- ⭐ AdaptNLP - A high level framework and library for NLP [GitHub, 407 stars]
- ⭐ textacy - NLP, before and after spaCy [GitHub, 2209 stars]
- ⭐ texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2388 stars]
- ⭐ jiant - jiant is an NLP toolkit [GitHub, 1639 stars]
Data Augmentation
- ⭐ WildNLP Text manipulation library to test NLP models [GitHub, 76 stars]
- ⭐ snorkel Framework to generate training data [GitHub, 5791 stars]
- ⭐ NLPAug Data augmentation for NLP [GitHub, 4419 stars]
- ⭐ SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 363 stars]
- ⭐ faker - Python package that generates fake data for you [GitHub, 17648 stars]
- ⭐ textflint - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 639 stars]
- ⭐ Parrot - Practical and feature-rich paraphrasing framework [GitHub, 871 stars]
- ⭐ AugLy - data augmentations library for audio, image, text, and video [GitHub, 4950 stars]
- ⭐ TextAugment - Python 3 library for augmenting text for natural language processing applications [GitHub, 396 stars]
Adversarial NLP Attacks & Behavioral Testing
- ⭐ TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
- ⭐ CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 6172 stars]
- ⭐ CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 2003 stars]
Transformer-oriented
- ⭐ transformers by HuggingFace [GitHub, 132974 stars]
- ⭐ Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 2543 stars]
- ⭐ haystack - Transformers at scale for question answering & neural search. [GitHub, 16997 stars]
Dialogue Systems and Speech
- ⭐ DeepPavlov by MIPT [GitHub, 6676 stars]
- ⭐ ParlAI by FAIR [GitHub, 10477 stars]
- ⭐ rasa - Framework for Conversational Agents [GitHub, 18726 stars]
- ⭐ wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6370 stars]
- ⭐ ChatterBot - conversational dialog engine for creating chatbots [GitHub, 14039 stars]
- ⭐ SpeechBrain - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 8674 stars]
- ⭐ dialoguefactory Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars]
Word/Sentence-embeddings oriented
- ⭐ MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 3181 stars]
- ⭐ vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 644 stars]
- ⭐ sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 14981 stars]
Social Media Oriented
- ⭐ Ekphrasis - text processing tool, geared towards text from social networks [GitHub, 661 stars]
Phonetics
- ⭐ DeepPhonemizer - grapheme to phoneme conversion with deep learning [GitHub, 352 stars]
Morphology
- ⭐ LemmInflect - python module for English lemmatization and inflection [GitHub, 259 stars]
- ⭐ Inflect - generate plurals, ordinals, indefinite articles [GitHub, 964 stars]
- ⭐ simplemma - simple multilingual lemmatizer for Python [GitHub, 964 stars]
Multi-lingual tools
- ⭐ polyglot - Multi-lingual NLP Framework [GitHub, 2309 stars]
- ⭐ trankit - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 730 stars]
Distributed NLP / Multi-GPU NLP
- ⭐ Spark NLP [GitHub, 3826 stars]
- ⭐ Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [GitHub, 776 stars]
Machine Translation
- ⭐ COMET -A Neural Framework for MT Evaluation [GitHub, 493 stars]
- ⭐ marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 1236 stars]
- ⭐ argos-translate - Open source neural machine translation in Python [GitHub, 3771 stars]
- ⭐ Opus-MT - Open neural machine translation models and web services [GitHub, 605 stars]
- ⭐ dl-translate - A deep learning-based translation library built on Huggingface transformers [GitHub, 440 stars]
- ⭐ CTranslate2 - CTranslate2 end-to-end machine translation [GitHub, 3300 stars]
Entity and String Matching
- ⭐ PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 736 stars]
- ⭐ pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 937 stars]
- ⭐ fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 9220 stars]
- ⭐ jellyfish - approximate and phonetic matching of strings [GitHub, 2049 stars]
- ⭐ textdistance - Compute distance between sequences [GitHub, 3367 stars]
- ⭐ DeepMatcher - Compute distance between sequences [GitHub, 555 stars]
- ⭐ RE2 - Simple and Effective Text Matching with Richer Alignment Features [GitHub, 339 stars]
- ⭐ Machamp - Machamp: A Generalized Entity Matching Benchmark [GitHub, 17 stars]
Discourse Analysis
- ⭐ ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 543 stars]
PII scrubbing
- ⭐ scrubadub - Clean personally identifiable information from dirty dirty text [GitHub, 394 stars]
Hastag Segmentation
- ⭐ hashformers - automatically inserting the missing spaces between the words in a hashtag [GitHub, 68 stars]
Books Analysis / Literary Analysis / Semantic Search
- ⭐ booknlp - a natural language processing pipeline that scales to books and other long documents (in English) [GitHub, 785 stars]
- ⭐ bookworm - ingests novels, builds an implicit character network and a deeply analysable graph [GitHub, 76 stars]
- ⭐ SemanticFinder - frontend-only live semantic search with transformers.js [GitHub, 224 stars]
Non-English oriented
Japanese
- ⭐ fugashi - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 391 stars]
- ⭐ SudachiPy - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 390 stars]
- ⭐ Konoha - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 226 stars]
- ⭐ jProcessing - Japanese Natural Langauge Processing Libraries [GitHub, 148 stars]
- ⭐ Ginza - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 745 stars]
- ⭐ kuromoji - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 953 stars]
- ⭐ nagisa - Japanese tokenizer based on recurrent neural networks [GitHub, 382 stars]
- ⭐ KyTea - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 201 stars]
- ⭐ Jigg - Pipeline framework for easy natural language processing [GitHub, 74 stars]
- ⭐ Juman++ - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 376 stars]
- ⭐ RakutenMA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 473 stars]
- ⭐ toiro - a comparison tool of Japanese tokenizers [GitHub, 118 stars]
Thai
- ⭐ AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai [GitHub, 79 stars]
- ⭐ ThaiLMCut - Word Tokenizer for Thai Language [GitHub, 15 stars]
Chinese
- ⭐ Spacy-pkuseg - The pkuseg toolkit for multi-domain Chinese word segmentation [GitHub, 53 stars]
Ukrainian
- ⭐ recruitment-dataset - Recruitment Dataset Preprocessing and Recommender System (Ukrainian, English)
Other
- ⭐ textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 103 stars]
- ⭐ Kashgari Transfer Learning with focus on Chinese [GitHub, 2389 stars]
- ⭐ Underthesea - Vietnamese NLP Toolkit [GitHub, 1383 stars]
- ⭐ PTT5 - Pretraining and validating the T5 model on Brazilian Portuguese data [GitHub, 84 stars]
Text Data Labelling & Classification
- ⭐ Small-Text - Active Learning for Text Classifcation in Python [GitHub, 549 stars]
- ⭐ Doccano - open source annotation tool for machine learning practitioners [GitHub, 9460 stars]
- ⭐ Adala - Autonomous DAta (Labeling) Agent framework [GitHub, 927 stars]
- ⭐ EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1585 stars]
- 🔱 Prodigy - annotation tool powered by active learning [Paid Service]
Note Section keywords: learn NLP
🔙 Back to the Table of Contents
General
- 📙 Learn NLP the practical way [Blog, Nov. 2019]
- 📙 Learn NLP the Stanford way (+Part 2) [Blog, Nov 2020]
- 📙 Choosing the right course for a Practical NLP Engineer
- 📙 12 Best Natural Language Processing Courses & Tutorials to Learn Online
- ⭐ Treasure of Transformers - Natural Language processing papers, videos, blogs, official repos along with colab Notebooks [GitHub, 912 stars]
- 🎥️ Rasa Algorithm Whiteboard - YouTube series by Rasa explaining various Data Science and NLP Algorithms
- 🎥️ ExplosionAI Videos - YouTube series by ExplosionAI teaching you how to use spacy and apply it for NLP
Courses
- 🎥️ CS25: Transformers United Stanford - Fall 2021 [Course, Fall 2021]
- 📙 NLP Course | For You - Great and interactive course on NLP
- 📙 Advanced NLP with spaCy - how to use spaCy to build advanced natural language understanding systems
- 📙 Transformer models for NLP by HuggingFace
- 🎥️ Stanford NLP Seminar - slides from the Stanford NLP course
Books
- 📙 Natural Language Processing with Transformers - [Book, February 2022]
- 📙 Applied Natural Language Processing in the Enterprise - [Book, May 2021]
- 📙 Practical Natural Language Processing - [Book, June 2020]
- 📙 Dive into Deep Learning - An interactive deep learning book with code, math, and discussions
- 📙 Natural Language Processing and Computational Linguistics - Speech, Morphology and Syntax (Cognitive Science)
- 📙 Top NLP Books to Read 2020 - Blog post by Raymong Cheng [Blog, Sep 2020]
Tutorials
- ⭐ nlp-tutorial - A list of NLP(Natural Language Processing) tutorials built on PyTorch [GitHub, 1366 stars]
- ⭐ nlp-tutorial - Natural Language Processing Tutorial for Deep Learning Researchers [GitHub, 14110 stars]
- ⭐ Hands-On NLTK Tutorial [GitHub, 540 stars]
- ⭐ Modern Practical Natural Language Processing [GitHub, 266 stars]
- ⭐ Transformers-Tutorials - demos with the Transformers library by HuggingFace [GitHub, 9176 stars]
- 🗂️ CalmCode Tutorials - Set of Python Data Science Tutorials
- r/LanguageTechnology - NLP Reddit forum
🔙 Back to the Table of Contents
Tokenization
- ⭐ tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 8940 stars]
- ⭐ SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 10141 stars]
- ⭐ SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 135 stars]
Data Augmentation and Weak Supervision
Libraries and Frameworks
- ⭐ WildNLP Text manipulation library to test NLP models [GitHub, 76 stars]
- ⭐ NLPAug Data augmentation for NLP [GitHub, 4419 stars]
- ⭐ SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 363 stars]
- ⭐ TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
- ⭐ skweak - software toolkit for weak supervision applied to NLP tasks [GitHub, 917 stars]
- ⭐ NL-Augmenter - Collaborative Repository of Natural Language Transformations [GitHub, 773 stars]
- ⭐ EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1585 stars]
- ⭐ snorkel Framework to generate training data [GitHub, 5791 stars]
- ⭐ dialoguefactory Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars]
Reading Material and Tutorials
- ⭐ A Survey of Data Augmentation Approaches for NLP [Paper, May 2021] GitHub Link
- 📙 A Visual Survey of Data Augmentation in NLP [Blog, 2020]
- 📙 Weak Supervision: A New Programming Paradigm for Machine Learning [Blog, March 2019]
Named Entity Recognition (NER)
- ⭐ Datasets for Entity Recognition [GitHub, 1497 stars]
- ⭐ Datasets to train supervised classifiers for Named-Entity Recognition [GitHub, 338 stars]
- ⭐ Bootleg - Self-Supervision for Named Entity Disambiguation at the Tail [GitHub, 212 stars]
- ⭐ Few-NERD - Large-scale, fine-grained manually annotated named entity recognition dataset [GitHub, 385 stars]
Relation Extraction
- ⭐ tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 355 stars]
- ⭐ tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 69 stars]
- ⭐ tac-self-attention Relation extraction with position-aware self-attention [GitHub, 64 stars]
- ⭐ Re-TACRED Re-TACRED: Addressing Shortcomings of the TACRED Dataset [GitHub, 51 stars]
Coreference Resolution
- ⭐ NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks by HuggingFace [GitHub, 2850 stars]
- ⭐ coref - BERT and SpanBERT for Coreference Resolution [GitHub, 443 stars]
Sentiment Analysis
- ⭐ Reading list for Awesome Sentiment Analysis papers by declare-lab [GitHub, 517 stars]
- ⭐ Awesome Sentiment Analysis by xiamx [GitHub, 913 stars]
Domain Adaptation
- ⭐ Neural Adaptation in Natural Language Processing - curated list [GitHub, 261 stars]
Low Resource NLP
- ⭐ CMU LTI Low Resource NLP Bootcamp 2020 - CMU Language Technologies Institute low resource NLP bootcamp 2020 [GitHub, 597 stars]
Spell Correction / Error Correction
- ⭐ Gramformer - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 1502 stars]
- ⭐ NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 665 stars]
- ⭐ SymSpellPy - Python port of SymSpell [GitHub, 796 stars]
- 📙 Speller100 by Microsoft [Blog, Feb 2021]
- ⭐ JamSpell - spell checking library - accurate, fast, multi-language [GitHub, 608 stars]
- ⭐ pycorrector - spell correction for Chinese [GitHub, 5517 stars]
- ⭐ contractions - Fixes contractions such as
you're
to youare
[GitHub, 308 stars] - 📙 Fine Tuning T5 for Grammar Correction by Sachin Abeywardana [Blog, Nov 2022]
Style Transfer for NLP
- ⭐ Styleformer - Neural Language Style Transfer framework [GitHub, 475 stars]
- ⭐ StylePTB - A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [GitHub, 60 stars]
Automata Theory for NLP
- ⭐ pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 937 stars]
Obscene words detection
- ⭐ LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 2899 stars]
Reddit Analysis
- ⭐ Subreddit Analyzer - comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit [GitHub, 489 stars]
Skill Detection
- ⭐ SkillNER - rule based NLP module to extract job skills from text [GitHub, 153 stars]
Reinforcement Learning for NLP
- ⭐ nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 192 stars]
AutoML / AutoNLP
- ⭐ AutoNLP - Faster and easier training and deployments of SOTA NLP models [GitHub, 3836 stars]
- ⭐ TPOT - Python Automated Machine Learning tool [GitHub, 9691 stars]
- ⭐ Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 2359 stars]
- ⭐ HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 710 stars]
- 🔱 AutoML Natural Language - Google's paid AutoML NLP service
- ⭐ Optuna - hyperparameter optimization framework [GitHub, 10650 stars]
- ⭐ FLAML - fast and lightweight AutoML library [GitHub, 3871 stars]
- ⭐ Gradsflow - open-source AutoML & PyTorch Model Training Library [GitHub, 306 stars]
OCR - Optical Character Recognition
- 🎥️ A framework for designing document processing solutions [Blog, June 2022]
Document AI
Text Generation
- ⭐ keytotext - a model which will take keywords as inputs and generate sentences as outputs [GitHub, 445 stars]
- 📙 Controllable Neural Text Generation [Blog, Jan 2021]
- ⭐ BARTScore Evaluating Generated Text as Text Generation [GitHub, 317 stars]
Title / Headlines Generation
- ⭐ TitleStylist Learning to Generate Headlines with Controlled Styles [GitHub, 76 stars]
NLP research reproducibility
- 📙 A Systematic Review of Reproducibility Research in Natural Language Processing [Paper, March 2021]
License CC0
Attributions
Resources
- All linked resources belong to original authors
Icons
- Akropolis by parkjisun from the Noun Project
- Book of Ester by Gilad Sotil from the Noun Project
- quill by Juan Pablo Bravo from the Noun Project
- acting by Flatart from the Noun Project
- olympic by supalerk laipawat from the Noun Project
- aristocracy by Eucalyp from the Noun Project
- Horn by Eucalyp from the Noun Project
- temple by Eucalyp from the Noun Project
- constellation by Eucalyp from the Noun Project
- ancient greek round pattern by Olena Panasovska from the Noun Project
- Harp by Vectors Point from the Noun Project
- Atlas by parkjisun from the Noun Project
- Parthenon by Eucalyp from the Noun Project
- papyrus by IconMark from the Noun Project
- papyrus by Smalllike from the Noun Project
- pegasus by Saeful Muslim from the Noun Project
Fonts
<h3 align="center">The Pandect Series also includes</h3> <p align="middle"> <a href="https://github.com/ivan-bilan/The-Microservices-Pandect"> <img src="https://raw.githubusercontent.com/ivan-bilan/The-Engineering-Manager-Pandect/main/Resources/Images/microservices_pandect_promo.png" width="390" /> </a> <a href="https://github.com/ivan-bilan/The-Engineering-Manager-Pandect"> <img src="https://raw.githubusercontent.com/ivan-bilan/The-Engineering-Manager-Pandect/main/Resources/Images/em_pandect_promo.png" width="370" /> </a> </p>