Awesome

Natural Language Procesing

This repository includes basic concepts of Natural Language Processing, textbooks and blogs of good reputation, popular papers and so on.

This is also the Natural Language Processing part of Machine Learning Resources created by a group of people including jindongwang.

Contributors are welcomed to work together and make it BETTER!

Resource of Textbooks and Lectures

Mathemetical and Statistical Foundation

Linear Algebra
- 18.06 MIT(Gilbert Strang)[pdf][video]
Matrix Analysis
Convex Optimization
- EE364A Stanford(Stephen Boyd)[pdf][website]
- Introductory Lectures on Convex Programming(Yu.Nesterov)[pdf]

Machine Learning

Deep Learning

Natural Language Processing

Foundations of Statistical Natural Language Processing - Chris Manning
Speech and Language Processing - Daniel Jurafsky and James H. Martin
统计学习方法 - 李航
Advanced Natural Language Processing - MIT
CS 224n Natural Language Processing with Deep Learning - Stanford
Deep Learning for NLP at Oxford with Deepmind - Oxford
11-747 NN4NLP
11-737 Multilingual NLP
Some Knowledge about Machine Learning
A list of datasets

Models and Applications

Probalistic Graphical Model
- Hidden Markov Model
- Conditional Random Fields
Topic Model
- Latent Dirichlet Allocation(paper)
Deep Learning Model
- Long Short Term Memory(LSTM) Sepp Hochreiter, 1997
  - Interpretation Omer Levy, UWashington, 2018
- Recurrent Neuron Network - Seq2Seq(Tensorflow Tutorial) - Machine Translation Tensorflow implement
- Convolutional Neuron Network
- Attention Model
- Overview(Chinese)
- Generative Adversial Network(GAN)
- Transformer
- Training Tips
- Bidirectional Encoder Representation from Transformers(BERT) Jacob Devlin, Google 2018

Blog and Tutorials

Topics and Tasks

Category of areas is based on tracks in ACL 2018, ACL 2020, EMNLP 2020

Summerization

Task
- Summerization
- Opinion Summarization
- Evaluation
Model
- Extractive
- Generative
- Hybrid
Dataset
- XSum, EMNLP2018 [paper]
- CNN/DailyMail
- NEWSROOM
- Multi-News
- Gigaword
- arXiv
- PubMed
- BIGPATENT
- WikiHow
- Reddit TIFU (long, short)
- AESLC
- BillSum

Embedding

Model
- Word2Vec
Pre-trained Embedding
- Glove
- word2vec
- FastText
Contextual Word Embedding
- ELMo
- GPT
- BERT
- XLNet
- BART
- T-5

Sentimental Analysis and Argument Mining

Name Entity Recognition

Tagging, Chunking

Task
- Word Segmentation
- Syntactic Parsing
Model
- Hidden Markov Model (HMM)
- Conditional Random Fields (CRFs)
- Finetuned Language Models

Syntax, Parsing

Task
- Constituency Parsing
- Dependency Parsing
- Visual Grounded Syntactic Aquisition
Model
Dataset
- PennTreeBank (PTB)

Document Analysis

Sentence-level Semantics

Tasks
- Semantic Parsing
  - AMR-to-text
  - Text-to-AMR
  - Table-to-text
  - Code Generation
Model
- TRANX
Dataset

Semantics: Lexical

Tasks
- Word Sense Disambiguation

Information Extraction and Text Mining

Tasks
- Topic Extraction
- Sentimental Extraction
- Aspect Extraction

Machine Translation

Task
- Machine Translation
- Non-autogressive Machine Translation
- Word-alignment
Model
Dataset
- WMT

Text Generation

Text Classification

Task
- SPAM Classification
- Sentiment Analysis
Model
- CNN-sentence, EMNLP2014
- CharCNN, NeurIPS2015
Dataset

Dialogue and Interactive Systems

Question Answering

Task
Dataset
- CNN/DailyMail
- SQuAD
  - Benchmark: F1-86.967 BERT + Synthetic Self-Training (ensemble) Jan 10, 2019
- RACE
  - Benchmark: RACE-83.2 RACEC-M-86.5 RACE-H-81.3 RoBERTa July 2019

Resources and Evaluation

Linguistic Theories and Cognitive Modeling

Multilinguality

Task
- Code-Switching
- Mutilingual Translation
Model
Dataset

Phonology, Morphology and Word Segmentation

Textual Inference

Vision, Robotics, Speech, Multimodal

Language Modeling

Tasks
Model
- N-gram
- ELMo, NAACL2018
- GPT
- GPT-2, arXiv2019
- GPT-3, NeurIPS2020
- BERT, NAACL2019
  - RoBERTa, arXiv 2019
  - SpanBERT, TACL 2020
  - Efficient
    - ALBERT, arXiv 2020
    - SqueezeBERT, SustainNLP@EMNLP 2020
  - Domain Specific
  - Langauge Specific [Latin BERT, German BERT, Italian BERT, Chinese BERT]
  - BERTology, TACL 2020
- XLNet, NeurIPS2019
- MASS, ICML2019 [code]
- ELECTRA, ICLR2020 [code]
- T5, JMLR2020
- BART, ACL2020
Finetuning
- Invasive (LM not fixed)
  - Regular finetuning
  - Re-initlization for few-shot learning ICLR2021
- Non-invasive (LM fixed)
  - Prefix-tuning, arXiv2021
Language Model as
- BERTScore, ICLR2020
- Few-shot learner
  - Bias in few-shot examples, arXiv2021
- Knowledge base EMNLP2019, Tutorial@AAAI2021
Dataset
- CommonCrawl
- Wiki-Text
- STORIES
- C4 [huggingface]

Computational Social Science and Social Media

Discourse and Pragmatics

Information Retrieval and Text Mining

Language Grounding to Vision, Robotics and Beyond

Papers
- Experience Grounds Language, EMNLP2020

Machine Learning for NLP

Theory and Formalism in NLP

Ethics in NLP

Commonsense Knowledge

Tasks
- Fact Verification
- Commonsense Reasoning
- Word-level Rationales
- Factually Consistent Generation
Model
- ConceptNet, AAAI2017
- COMET, ACL2019 [paper]
Dataset

Interpretability

NLP Applications

Tasks
- Grammartical Error Correction (GEC) [BEA@NAACL2018, BEA@ACL2019, BEA@ACL2020, BEA@EACL2021]
- Lexical Substitution
- Lexical Simplification
Model
- BERT-based Lexical Substitution/GEC, [ACL2019, AAAI2020]
Dataset
- ETS

Resources and Benchmarks

Huggingface Dataset
GLUE
SuperGLUE
Leaderboards
- NLP-Progress
- PaperwithCode

Interesting NLP

Google Books Ngram Viewer

Package

Machine Learning Package and Framework
- sciki-learn
- Tensorflow
- Caffe2
- Pytorch
- MXNet
NLTK
gensim
jieba
Stanford NLP
Transformers (huggingface)

如何加入 How to contribute

如果你对本项目感兴趣，非常欢迎你加入！

正常参与：请直接fork、pull都可以
如果要上传文件：请不要直接上传到项目中，否则会造成git版本库过大。正确的方法是上传它的超链接。如果你要上传的文件本身就在网络中（如paper都会有链接），直接上传即可；如果是自己想分享的一些文件、数据等，鉴于国内网盘的情况，请按照如下方式上传：
- (墙内)目前没有找到比较好的方式，只能通过链接，或者自己网盘的链接来做。
- (墙外)首先在UPLOAD直接上传（不需要注册账号）；上传成功后，在DOWNLOAD里找到你刚上传的文件，共享链接即可。

如何开始项目协同合作

快速了解github协同工作 Learn how to collaborate through github

及时更新fork项目 Update through fork

如何使用git提交 How to submit in git

Fetch and Merge in Git