Awesome
Natural Language Procesing
This repository includes basic concepts of Natural Language Processing, textbooks and blogs of good reputation, popular papers and so on.
This is also the Natural Language Processing part of Machine Learning Resources created by a group of people including jindongwang.
Contributors are welcomed to work together and make it BETTER!
Resource of Textbooks and Lectures
Mathemetical and Statistical Foundation
-
Linear Algebra
-
Matrix Analysis
-
Convex Optimization
Machine Learning
- The Elements of Statistical Learning(ESL) - HTF
- CS228 Probabilistic Graphical Model - Stanford
- 10708 Probabilistic Graphical Model - CMU
Deep Learning
- Deep Learning - Ian Goodfellow, Yoshua Bengio, Aaron Courville
- CS231n Convolutional Neural Networks for Visual Recognition - Stanford
Natural Language Processing
- Foundations of Statistical Natural Language Processing - Chris Manning
- Speech and Language Processing - Daniel Jurafsky and James H. Martin
- 统计学习方法 - 李航
- Advanced Natural Language Processing - MIT
- CS 224n Natural Language Processing with Deep Learning - Stanford
- Deep Learning for NLP at Oxford with Deepmind - Oxford
- 11-747 NN4NLP
- 11-737 Multilingual NLP
- Some Knowledge about Machine Learning
- A list of datasets
Models and Applications
-
Probalistic Graphical Model
- Hidden Markov Model
- Conditional Random Fields
-
Topic Model
- Latent Dirichlet Allocation(paper)
-
Deep Learning Model
- Long Short Term Memory(LSTM) Sepp Hochreiter, 1997
- Interpretation Omer Levy, UWashington, 2018
- Recurrent Neuron Network - Seq2Seq(Tensorflow Tutorial) - Machine Translation Tensorflow implement
- Convolutional Neuron Network
- Attention Model
- Overview(Chinese)
- Generative Adversial Network(GAN)
- Transformer
- Training Tips
- Bidirectional Encoder Representation from Transformers(BERT) Jacob Devlin, Google 2018
- Long Short Term Memory(LSTM) Sepp Hochreiter, 1997
Blog and Tutorials
- Tensorflow implement on RNN and undocumented features
- The Unreasonable Effectiveness of Recurrent Neural Networks
Topics and Tasks
Category of areas is based on tracks in ACL 2018, ACL 2020, EMNLP 2020
Summerization
- Task
- Summerization
- Opinion Summarization
- Evaluation
- Model
- Extractive
- Generative
- Hybrid
- Dataset
- XSum, EMNLP2018 [paper]
- CNN/DailyMail
- NEWSROOM
- Multi-News
- Gigaword
- arXiv
- PubMed
- BIGPATENT
- WikiHow
- Reddit TIFU (long, short)
- AESLC
- BillSum
Embedding
- Model
- Word2Vec
- Pre-trained Embedding
- Glove
- word2vec
- FastText
- Contextual Word Embedding
- ELMo
- GPT
- BERT
- XLNet
- BART
- T-5
Sentimental Analysis and Argument Mining
Name Entity Recognition
Tagging, Chunking
- Task
- Word Segmentation
- Syntactic Parsing
- Model
- Hidden Markov Model (HMM)
- Conditional Random Fields (CRFs)
- Finetuned Language Models
Syntax, Parsing
- Task
- Constituency Parsing
- Dependency Parsing
- Visual Grounded Syntactic Aquisition
- Model
- Dataset
Document Analysis
Sentence-level Semantics
- Tasks
- Semantic Parsing
- AMR-to-text
- Text-to-AMR
- Table-to-text
- Code Generation
- Semantic Parsing
- Model
- Dataset
Semantics: Lexical
- Tasks
- Word Sense Disambiguation
Information Extraction and Text Mining
- Tasks
- Topic Extraction
- Sentimental Extraction
- Aspect Extraction
Machine Translation
- Task
- Machine Translation
- Non-autogressive Machine Translation
- Word-alignment
- Model
- Dataset
- WMT
Text Generation
Text Classification
- Task
- SPAM Classification
- Sentiment Analysis
- Model
- Dataset
Dialogue and Interactive Systems
Question Answering
- Task
- Dataset
- CNN/DailyMail
- SQuAD
- Benchmark: F1-86.967 BERT + Synthetic Self-Training (ensemble) Jan 10, 2019
- RACE
- Benchmark: RACE-83.2 RACEC-M-86.5 RACE-H-81.3 RoBERTa July 2019
Resources and Evaluation
Linguistic Theories and Cognitive Modeling
Multilinguality
- Task
- Code-Switching
- Mutilingual Translation
- Model
- Dataset
Phonology, Morphology and Word Segmentation
Textual Inference
Vision, Robotics, Speech, Multimodal
Language Modeling
- Tasks
- Model
- N-gram
- ELMo, NAACL2018
- GPT
- GPT-2, arXiv2019
- GPT-3, NeurIPS2020
- BERT, NAACL2019
- RoBERTa, arXiv 2019
- SpanBERT, TACL 2020
- Efficient
- Domain Specific
- Langauge Specific [Latin BERT, German BERT, Italian BERT, Chinese BERT]
- BERTology, TACL 2020
- XLNet, NeurIPS2019
- MASS, ICML2019 [code]
- ELECTRA, ICLR2020 [code]
- T5, JMLR2020
- BART, ACL2020
- Finetuning
- Invasive (LM not fixed)
- Regular finetuning
- Re-initlization for few-shot learning ICLR2021
- Non-invasive (LM fixed)
- Prefix-tuning, arXiv2021
- Invasive (LM not fixed)
- Language Model as
- BERTScore, ICLR2020
- Few-shot learner
- Bias in few-shot examples, arXiv2021
- Knowledge base EMNLP2019, Tutorial@AAAI2021
- Dataset
- CommonCrawl
- Wiki-Text
- STORIES
- C4 [huggingface]
Computational Social Science and Social Media
Discourse and Pragmatics
Information Retrieval and Text Mining
Language Grounding to Vision, Robotics and Beyond
Machine Learning for NLP
Theory and Formalism in NLP
Ethics in NLP
Commonsense Knowledge
-
Tasks
- Fact Verification
- Commonsense Reasoning
- Word-level Rationales
- Factually Consistent Generation
-
Model
-
Dataset
Interpretability
NLP Applications
- Tasks
- Grammartical Error Correction (GEC) [BEA@NAACL2018, BEA@ACL2019, BEA@ACL2020, BEA@EACL2021]
- Lexical Substitution
- Lexical Simplification
- Model
- Dataset
Resources and Benchmarks
- Huggingface Dataset
- GLUE
- SuperGLUE
- Leaderboards
Interesting NLP
Package
- Machine Learning Package and Framework
- sciki-learn
- Tensorflow
- Caffe2
- Pytorch
- MXNet
- NLTK
- gensim
- jieba
- Stanford NLP
- Transformers (huggingface)
如何加入 How to contribute
如果你对本项目感兴趣,非常欢迎你加入!
- 正常参与:请直接fork、pull都可以
- 如果要上传文件:请不要直接上传到项目中,否则会造成git版本库过大。正确的方法是上传它的超链接。如果你要上传的文件本身就在网络中(如paper都会有链接),直接上传即可;如果是自己想分享的一些文件、数据等,鉴于国内网盘的情况,请按照如下方式上传:
如何开始项目协同合作
快速了解github协同工作 Learn how to collaborate through github
及时更新fork项目 Update through fork
如何使用git提交 How to submit in git