Awesome
Awesome Resource for NLP
New Update : Capsule Network, Sarcasm Detection
Table of Contents
- Table of Contents
- Libraries
- Essesntial Mathematics
- Dictionary
- Lexicon
- Parsing
- Discourse
- Language Model
- Sarcasm Detection
- Machine Translation
- Text Generation
- Text Classification
- Text Summarization
- Sentiment
- Word/Document Embeddings
- Word Representation
- Question Answer
- Information Extraction
- Natural Language Inference
- Capsule Networks
- Commonsense
- Other
- Contribute
<span id='libraries'>Useful Libraries</span>
- NumPy Stanford's lecture CS231N deals with NumPy, which is fundamental in machine learning calculations.
- NLTK It's a suite of libraries and programs for symbolic and statistical natural language processing
- Tensorflow A tutorial provided by Tensorflow. It gives great explanations on the basics with visual aids. Useful in Deep NLP
- PyTorch An awesome tutorial on Pytorch provided by Facebook with great quality.
- tensor2tensor Sequence to Sequence tool kit by Google written in Tensorflow.
- fairseq Sequence to Sequence tool kit by Facebook written in Pytorch.
- Hugging Face Transformers A library based on Transformer provided by Hugging Face that allows easy access to pre-trained models. One of the key NLP libraries to not only developers but researchers as well.
- Hugging Face Tokenizers A tokenizer library that Hugging Face maintains. It boosts fast operations as the key functions are written in Rust. The latest tokenizers such as BPE can be tried out with Hugging Face tokenizers.
- spaCy A tutorial written by Ines, the core developer of the noteworthy spaCy.
- torchtext A tutorial on torchtext, a package that makes data preprocessing handy. Has more details than the official documentation.
- SentencePiece Google's open source library that builds BPE-based vocabulary using subword information.
- Gensim Python library for topic modelling, document indexing and similarity retrieval with large corpora.
- polyglot A natural language pipeline which supports massive multilingual applications.
- Textblob Provides simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, WordNet integration, parsing, word inflection
- Quepy A python framework to transform natural language questions to queries in a database query language.
- Pattern Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization
<span id='maths'>Essential Mathematics</span>
- Statistics and Probabilities
- Statistics 110 A lecture on Probability that can be easily understood by non-engineering major students.
- Brandon Foltz's Statistics Brandon Foltz's Probability and Statistics lectures are posted on Youtube and is rather short, so it can be easily accessed during daily commute.
- Linear Algebra
- Linear Algebra Awesome lecture of professor Gilbert Strang.
- Essence of Linear Algebra Linear algebraic lecture on Youtube channel 3Blue1Brown
- Basics
- Mathematics for Machine Learning Book on all the mathematical knowledge accompanied with machine learning.
- Essence of calculus Calculus lecture by the channel 3Blue1Brown mentioned above, helpful for those who want an overview of calculus likewise.
<span id='dictionary'>Dictionary</span>
- Bilingual Dictionary
- CC-CEDICT A bilingual dictionary between English and Chinese.
- Pronouncing Dictionary
- CMUdict The Carnegie Mellon University Pronouncing Dictionary is an open-source machine-readable pronunciation dictionary for North American English that contains over 134,000 words and their pronunciations.
<span id='lexicon'>Lexicon</span>
- PDEV Pattern Dictionary of English Verbs.
- VerbNet A lexicon that groups verbs based on their semantic/syntactic linking behavior.
- FrameNet A lexicon based on frame semantics.
- WordNet A lexicon that describes semantic relationships (such as synonymy and hyperonymy) between individual words.
- PropBank A corpus of one million words of English text, annotated with argument role labels for verbs; and a lexicon defining those argument roles on a per-verb basis.
- NomBank A dataset marks the sets of arguments that cooccur with nouns in the PropBank Corpus (the Wall Street Journal Corpus of the Penn Treebank), just as PropBank records such information for verbs.
- SemLink A project whose aim is to link together different lexical resources via set of mappings. (VerbNet, PropBank, FrameNet, WordNet)
- Framester Framester is a hub between FrameNet, WordNet, VerbNet, BabelNet, DBpedia, Yago, DOLCE-Zero, as well as other resources. Framester does not simply creates a strongly connected knowledge graph, but also applies a rigorous formal treatment for Fillmore's frame semantics, enabling full-fledged OWL querying and reasoning on the created joint frame-based knowledge graph.
<span id='parsing'>Parsing</span>
- PTB The Penn Treebank (PTB).
- Universal Dependencies Universal Dependencies (UD) is a framework for cross-linguistically consistent grammatical annotation and an open community effort with over 200 contributors producing more than 100 treebanks in over 60 languages.
- Tweebank Tweebank v2 is a collection of English tweets annotated in Universal Dependencies that can be exploited for the training of NLP systems to enhance their performance on social media texts.
- SemEval-2016 Task 9 SemEval-2016 Task 9 (Chinese Semantic Dependency Parsing) Datasets.
<span id='discourse'>Discourse</span>
- PDTB2.0 PDTB, version 2.0. annotates 40600 discourse relations, distributed into the following five types: Explicit, Implicit, etc.
- PDTB3.0 In Version 3, an additional 13,000 tokens were annotated, certain pairwise annotations were standardized, new senses were included and the corpus was subject to a series of consistency checks.
- Back-translation Annotated Implicit Discourse Relations This resource contains annotated implicit discourse relation instances. These sentences are annotated automatically by the back-translation of parallel corpora.
- DiscourseChineseTEDTalks This dataset includes annotation for 16 TED Talks in Chinese.
<span id='lm'>Language Model</span>
- PTB Penn Treebank Corpus in LM Version.
- Google Billion Word dataset 1 billion word language modeling benchmark.
- WikiText The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger.
<span id='sarcasm'>Sarcasm Detection</span>
- CASCADE ContextuAl SarCasm DEtector (CASCADE) adopts a hybrid approach of both content- and context-driven modeling for sarcasm detection in online social media discussions. Further they used content-based feature extractors such as convolutional neural networks
- A Deeper Look into Sarcastic Tweets Using Deep Convolutional Neural Networks International Journal of Advanced Research in Computer Engineering & Technology, Volume 6, Issue 1, Jan 2017. They propose an automated system for detection of sarcasm on Twitter by using features related to sentiment
- AdaRNN Adaptive Recursive Neural Network (AdaRNN) for target-dependent Twitter sentiment classification. It adaptively propagates the sentiments of words to target depending on the context and syntactic relationships between them
- Detecting Sarcasm with Deep Convolutional Neural Networks Related Medium Article.It propose to first train a sentiment model (based on a CNN) for learning sentiment-specific feature extraction. The model learns local features in lower layers which are then converted into global features in the higher layers.
<span id='mt'>Machine Translation</span>
- Europarl The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.
- UNCorpus The United Nations Parallel Corpus v1.0 is composed of official records and other parliamentary documents of the United Nations that are in the public domain.
- CWMT The Zh-EN data collected and shared by China Workshop on Machine Translation (CWMT) community. There are three types of data for Chinese-English machine translation: Monolingual Chinese text, Parallel Chinese-English text, Multiple-Reference text.
- WMT Monolingual language model training data, such as Common Crawl\News Crawl in CS\DE\EN\FI\RO\RU\TR and Parallel data.
- OPUS OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus.
<span id='textgeneration'>Text Generation</span>
- Tencent Automatic Article Commenting A large-scale Chinese dataset with millions of real comments and a human-annotated subset characterizing the comments’ varying quality. This dataset consists of around 200K news articles and 4.5M human comments along with rich meta data for article categories and user votes of comments.
- Summarization
- BigPatent A summarization dataset consists of 1.3 million records of U.S. patent documents along with human written abstractive summaries.
- Data-to-Text
- Wikipedia Person and Animal Dataset This dataset gathers 428,748 person and 12,236 animal infobox with description based on Wikipedia dump (2018/04/01) and Wikidata (2018/04/12).
- WikiBio This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms. For each article, it provide the first paragraph and the infobox (both tokenized).
- Rotowire This dataset consists of (human-written) NBA basketball game summaries aligned with their corresponding box- and line-scores.
- MLB Details in Data-to-text Generation with Entity Modeling, ACL 2019
<span id='text_classification'>Text Classification</span>
- 20Newsgroups The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
- AG's corpus of news articles AG is a collection of more than 1 million news articles.
- Yahoo-Answers-Topic-Classification This corpus contains 4,483,032 questions and their corresponding answers from Yahoo! Answers service.
- Google-Snippets This dataset contains the web search results related to 8 different domains such as business, computers and engineering.
- BenchmarkingZeroShot This repository contains the code and the data for the EMNLP2019 paper "Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach".
<span id ='text_summarization'>Text Summarization</span>
- Text Summarization with Gensim The gensim implementation is based on the popular "TextRank" algorithm
- Unsupervised Text Summarization Awesome article describing text summarization using Sentence Embeddings
- Improving Abstraction in Text Summarization Proposing two techniques for improvement
- Text Summarization and Categorization More related to scientific and health related data
- Text summarization with TensorFlow. A basic study on text summarization of 2016
<span id='sentiment'>Sentiment</span>
- MPQA 3.0 This corpus contains news articles and other text documents manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.). The main changes in this version of the MPQA corpus are the additions of new eTarget (entity/event) annotations.
- SentiWordNet SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity.
- NRC Word-Emotion Association Lexicon The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).
- Stanford Sentiment TreeBank SST is the dataset of the paper: Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts Conference on Empirical Methods in Natural Language Processing (EMNLP 2013)
- SemEval-2013 Twitter SemEval 2013 Twitter dataset, which contains phrase-level sentiment annotation.
- Sentihood SentiHood is a dataset for the task of targeted aspect-based sentiment analysis, which contains 5215 sentences. SentiHood: Targeted Aspect Based Sentiment Analysis Dataset for Urban Neighbourhoods, COLING 2016.
- SemEval-2014 Task 4 This task is concerned with aspect based sentiment analysis (ABSA). Two domain-specific datasets for laptops and restaurants, consisting of over 6K sentences with fine-grained aspect-level human annotations have been provided for training.
<span id='embedding'> Word/Document Embeddings</span>
- The Current Best of Universal Word/Sentence Embeddings. It encode words and sentences in fixed-length dense vectors to drastically improve the processing of textual data.
- Document Embedding with Paragraph Vectors 2015. From Google.
- GloVe Word Embeddings Demo Demo of how to use GloVe Word Embeddings
- FastText A Library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab with many pretrained models
- Text Classification With Word2Vec Practical implementation on how to do text classification with word2vec using GLoVe
- Document Embedding Introduction to basics and importance of document Embeddings
- From Word Embeddings To Document Distances Intoduces Word Mover’s Distance (WMD) that measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to "travel" to reach the embedded words of another document.
- Doc2Vec Tutorial on the Lee Dataset
- Word Embeddings in Python with SpaCy and Gensim
- The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning). Dec 2018.
- Deep Contextualized Word Represenations. ElMo. PyTorch implementation. TF Implementation
- Fine-tuning for Text Classification. Implementation code.
- Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. Shows how universal sentence representations trained using the supervised data
- Learned in Translation: Contextualized Word Vectors. CoVe uses a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation (MT) to contextualize word vectors
- Distributed Representations of Sentences and Documents. Paragraph vectors. See doc2vec tutorial at gensim
- sense2vec. A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings
- Skip Thought Vectors. An encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage
- Sequence to Sequence Learning with Neural Networks. It uses a multilayered LSTM to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector
- The Amazing Power of Word Vectors. Material related to word2vec from different five research papers
- Contextual String Embeddings for Sequence Labeling. Properties include that they (a) are trained without any explicit notion of words, and (b) are contextualized by their surrounding text
- BERT Explained - State of the art language model for NLP. A great explaination of the fundamentals of how BERT works.
- Review of BERT based models. And some recent clues/insights into what makes BERT so effective
<span id='wordrepresentation'>Word Representation</span>
- Word Embedding
- Google News Word2vec The model contains 300-dimensional vectors for 3 million words and phrases which trained on part of Google News dataset (about 100 billion words).
- GloVe Pre-trained Pre-trained word vectors using GloVe. Wikipedia + Gigaword 5, Common Crawl, Twitter.
- fastText Pre-trained Pre-trained word vectors for 294 languages, trained on Wikipedia using fastText.
- BPEmb BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.
- Dependency-based Word Embedding Pre-trained word embeddings based on Dependency information, from Dependency-Based Word Embeddings, ACL 2014..
- Meta-Embeddings performs ensembles of some pretrained word embedding versions, from Meta-Embeddings: Higher-quality word embeddings via ensembles of Embedding Sets, ACL 2016.
- LexVec Pre-trained Vectors based on the LexVec word embedding model. Common Crawl, English Wikipedia and NewsCrawl.
- MUSE MUSE is a Python library for multilingual word embeddings, which provide multilingual embeddings for 30 languages and 110 large-scale ground-truth bilingual dictionaries .
- CWV This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora.
- charNgram2vec This repository provieds the re-implemented code for pre-training character n-gram embeddings presented in Joint Many-Task (JMT) paper, A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks, EMNLP2017.
- Word Representation with Context
- ELMo Pre-trained contextual representations from large scale bidirectional language models provide large improvements for nearly all supervised NLP tasks.
- BERT BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. (2018.10)
- OpenGPT GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text.
<span id="qa">Question Answer</span>
- Machine Reading Comprehension
- SQuAD Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage.
- CMRC2018 CMRC2018 is released by the Second Evaluation Workshop on Chinese Machine Reading Comprehension. The dataset is composed by near 20,000 real questions annotated by hu- man on Wikipedia paragraphs.
- DCRD Delta Reading Comprehension Dataset is an open domain traditional Chinese machine reading comprehension (MRC) dataset, it contains 10,014 paragraphs from 2,108 Wikipedia articles and 30,000+ questions generated by annotators.
- TriviaQA TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. This dataset is from the Wikipedia domain and Web domain.
- NewsQA NewsQA is a crowd-sourced machine reading comprehension dataset of 120K Q&A pairs.
- HarvestingQA This folder contains the one million paragraph-level QA-pairs dataset (split into Train, Dev and Test set) described in: Harvesting Paragraph-Level Question-Answer Pairs from Wikipedia (ACL 2018).
- ProPara ProPara aims to promote the research in natural language understanding in the context of procedural text. This requires identifying the actions described in the paragraph and tracking state changes happening to the entities involved.
- MCScript MCScript is a new dataset for the task of machine comprehension focussing on commonsense knowledge. It comprises 13,939 questions on 2,119 narrative texts and covers 110 different everyday scenarios. Each text is annotated with one of 110 scenarios.
- MCScript2.0 MCScript2.0 is a machine comprehension corpus for the end-to-end evaluation of script knowledge. It contains approx. 20,000 questions on approx. 3,500 texts, crowdsourced based on a new collection process that results in challenging questions. Half of the questions cannot be answered from the reading texts, but require the use of commonsense and, in particular, script knowledge.
- CommonsenseQA CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers.
- NarrativeQA NarrativeQA includes the list of documents with Wikipedia summaries, links to full stories, and questions and answers. For a detailed description of this see the paper "The NarrativeQA Reading Comprehension Challenge".
- HotpotQA HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems.
- <span id="SimilarQuestionIden">Duplicate/Similar Question Identification</span>
- Quora Question Pairs Quora Question Pairs dataset consists of over 400,000 lines of potential question duplicate pairs. [Kaggle Version Format]
- Ask Ubuntu This repo contains a preprocessed collection of questions taken from AskUbuntu.com 2014 corpus dump. It also comes with 400*20 mannual annotations, marking pairs of questions as "similar" or "non-similar", from Semi-supervised Question Retrieval with Gated Convolutions, NAACL2016.
<span id="ie">Information Extraction</span>
- Entity
- Shimaoka Fine-grained This dataset contains two standard and publicly available datasets for Fine-grained Entity Classification, provided in a preprocessed tokenized format, details in Neural architectures for fine-grained entity type classification, EACL 2017.
- Ultra-Fine Entity Typing A new entity typing task: given a sentence with an entity mention, the goal is to predict a set of free-form phrases (e.g. skyscraper, songwriter, or criminal) that describe appropriate types for the target entity.
- Nested Named Entity Corpus A fine-grained, nested named entity dataset over the full Wall Street Journal portion of the Penn Treebank (PTB), which annotation comprises 279,795 mentions of 114 entity types with up to 6 layers of nesting.
- Named Entity Recognition on Code-switched Data Code-switching (CS) is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication. It contains the training and development data for tuning and testing systems in the following language pairs: Spanish-English (SPA-ENG), and Modern Standard Arabic-Egyptian (MSA-EGY).
- MIT Movie Corpus The MIT Movie Corpus is a semantically tagged training and test corpus in BIO format. The eng corpus are simple queries, and the trivia10k13 corpus are more complex queries.
- MIT Restaurant Corpus The MIT Restaurant Corpus is a semantically tagged training and test corpus in BIO format.
- Relation Extraction
- Datasets of Annotated Semantic Relationships RECOMMEND This repository contains annotated datasets which can be used to train supervised models for the task of semantic relationship extraction.
- TACRED TACRED is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges. Details in Position-aware Attention and Supervised Data Improve Slot Filling, EMNLP 2017.
- FewRel FewRel is a Few-shot Relation classification dataset, which features 70, 000 natural language sentences expressing 100 relations annotated by crowdworkers.
- SemEval 2018 Task7 The training data and evaluation script for SemEval 2018 Task 7: Semantic Relation Extraction and Classification in Scientific Papers.
- Chinese-Literature-NER-RE A discourse-level Named Entity Recognition and Relation Extraction dataset for Chinese literature text. It contains 726 articles, 29,096 sentences and over 100,000 characters in total.
- Event
- ACE 2005 Training Data The corpus consists of data of various types annotated for entities, relations and events was created by Linguistic Data Consortium with support from the ACE Program, across three languages: English, Chinese, Arabic.
- Chinese Emergency Corpus (CEC) Chinese Emergency Corpus (CEC) is built by Data Semantic Laboratory in Shanghai University. This corpus is divided into 5 categories – earthquake, fire, traffic accident, terrorist attack and intoxication of food.
- TAC-KBP Event Evaluation is a sub-track in TAC Knowledge Base Population (KBP), which started from 2015. The goal of TAC Knowledge Base Population (KBP) is to develop and evaluate technologies for populating knowledge bases (KBs) from unstructured text.
- Narrative Cloze Evaluation Data Evaluate understanding of a script by predicting the next event given several context events. Details in Unsupervised Learning of Narrative Schemas and their Participants, ACL 2009.
- Event Tensor A evaluation dataset about Schema Generation/Sentence Similarity/Narrative Cloze, which is proposed by Event Representations with Tensor-based Compositions, AAAI 2018..
- SemEval-2015 Task 4 TimeLine: Cross-Document Event Ordering. Given a set of documents and a target entity, the task is to build an event TimeLine related to that entity, i.e. to detect, anchor in time and order the events involving the target entity.
- RED Richer Event Description consists of coreference, bridging and event-event relations (temporal, causal, subevent and reporting relations) annotations over 95 English newswire, discussion forum and narrative text documents, covering all events, times and non-eventive entities within each document.
- InScript The InScript corpus contains a total of 1000 narrative texts crowdsourced via Amazon Mechanical Turk. It is annotated with script information in the form of scenario-specific events and participants labels.
- AutoLabelEvent The data of the work in Automatically Labeled Data Generation for Large Scale Event Extraction, ACL2017.
- EventInFrameNet The data of the work in Leveraging FrameNet to Improve Automatic Event Detection, ACL2016.
- MEANTIME The MEANTIME Corpus (the NewsReader Multilingual Event ANd TIME Corpus) consists of a total of 480 news articles: 120 English Wikinews articles on four topics and their translations in Spanish, Italian, and Dutch. It has been annotated manually at multiple levels, including entities, events, temporal information, semantic roles, and intra-document and cross-document event and entity coreference.
- BioNLP-ST 2013 BioNLP-ST 2013 features the six event extraction tasks: Genia Event Extraction for NFkB knowledge base construction, Cancer Genetics, Pathway Curation, Corpus Annotation with Gene Regulation Ontology, Gene Regulation Network in Bacteria, and Bacteria Biotopes (semantic annotation by an ontology).
- Event Temporal and Causal Relations
- CaTeRS Causal and Temporal Relation Scheme (CaTeRS),which is unique in simultaneously capturing a com- prehensive set of temporal and causal relations between events. CaTeRS contains a total of 1,600 sentences in the context of 320 five-sentence short stories sampled from ROCStories corpus.
- Causal-TimeBank Causal-TimeBank is the TimeBank corpus taken from TempEval-3 task, which puts new information about causality in the form of C-SIGNALs and CLINKs annotation. 6,811 EVENTs (only instantiated events by MAKEINSTANCE tag of TimeML), 5,118 TLINKs (temporal links), 171 CSIGNALs (causal signals), 318 CLINKs (causal links).
- EventCausalityData The EventCausality dataset provides relatively dense causal annotations on 25 newswire articles collected from CNN in 2010.
- EventStoryLine A benchmark dataset for the temporal and causal relation detection.
- TempEval-3 The TempEval-3 shared task aims to advance research on temporal information processing.
- TemporalCausalReasoning A dataset with both temporal and causal relations annotation. The temporal relations were annotated based on the scheme proposed in "A Multi-Axis Annotation Scheme for Event Temporal Relations" using CrowdFlower; the causal relations were mapped from the "EventCausalityData".
- TimeBank TimeBank 1.2 contains 183 news articles that have been annotated with temporal information, adding events, times and temporal links(TLINKs) between events and times.
- TimeBank-EventTime Corpus This dataset is a subset of the TimeBank Corpus with a new annotation scheme to anchor events in time. Detailed description.
- Event Factuality
- UW Event Factuality Dataset This dataset contains annotations of text from the TempEval-3 corpus with factuality assessment labels.
- FactBank 1.0 FactBank 1.0, consists of 208 documents (over 77,000 tokens) from newswire and broadcast news reports in which event mentions are annotated with their degree of factuality.
- CommitmentBank The CommitmentBank is a corpus of 1,200 naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator (question, modal, negation, antecedent of conditional).
- UDS Universal Decompositional Semantics It Happened Dataset, covers the entirety of the English Universal Dependencies v1.2 (EUD1.2) treebank, a large event factuality dataset.
- DLEF A document level event factuality (DLEF) dataset, which includes the source (English and Chinese), detailed guidelines for both document- and sentence-level event factuality.
- Event Coreference
- ECB 1.0 This corpus consists of a collection of Google News documents annotated with within- and cross-document event coreference information. The documents are grouped according to the Google News Cluster, each group of documents representing the same seminal event (or topic).
- EECB 1.0 Compared to ECB 1.0, this dataset is extended in two directions: (i) fully annotated sentences, and (ii) entity coreference relations. In addition, annotators removed relations other than coreference (e.g., subevent, purpose, related, etc.).
- ECB+ The ECB+ corpus is an extension to the ECB 1.0. A newly added corpus component consists of 502 documents that belong to the 43 topics of the ECB but that describe different seminal events than those already captured in the ECB.
- Open Information Extraction
- oie-benchmark This repository contains code for converting QA-SRL annotations to Open-IE extractions and comparing Open-IE parsers against a converted benchmark corpus.
- NeuralOpenIE A training dataset from Neural Open Information Extraction, ACL 2018. here are a total of 36,247,584 hsentence, tuplei pairs extracted from Wikipedia dump using OPENIE4.
- Other
- WikilinksNED A large-scale Named Entity Disambiguation dataset of text fragments from the web, which is significantly noisier and more challenging than existing news-based datasets.
<span id="nli">Natural Language Inference</span>
- SNLI The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE).
- MultiNLI The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation.
- Scitail The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. The domain makes this dataset different in nature from previous datasets, and it consists of more factual sentences rather than scene descriptions.
- PAWS A new dataset with 108,463 well-formed paraphrase and non-paraphrase pairs with high lexical overlap. PAWS: Paraphrase Adversaries from Word Scrambling
<span id ="capsule">Capsule Networks</span>
- Investigating Capsule Networks with Dynamic Routing for Text Classification.It show how capsule networks exhibit significant improvement when transfer single-label to multi-label text classification over the competitors
- Attention-Based Capsule Networks with Dynamic Routing for Relation Extraction. They explore the capsule networks used for relation extraction in a multi-instance multi-label learning framework and propose a novel neural approach based on capsule networks with attention mechanisms
- Identifying Aggression and Toxicity in Comments using Capsule Network. 2018. It is early days for Capsule Networks, which was introduced by Geoffrey Hinton, et al., in 2017 as an attempt to introduce an NN architecture superior to the classical CNNs. The idea aims to capture hierarchincal relationships in the input layer through dynamic routing between "capsules" of neurons. Due likely to the affinitity of the theme of addressing hierarchical complexities, the idea's extention to the NLP field has since been a sujbect of active research, such as in the papers listed above.
- Dynamic Routing Between Capsules.They propose an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule
- Matrix Ccapsules With Expectation-Maximization Routing. The transformation matrices of capsule net are trained discriminatively by backpropagating through the unrolled iterations of EM between each pair of adjacent capsule layers
<span id="commonsense">Commonsense</span>
- ConceptNet ConceptNet is a multilingual knowledge base, representing words and phrases that people use and the common-sense relationships between them.
- Commonsense Knowledge Representation ConceptNet-related resources. Details in Commonsense Knowledge Base Completion. Proc. of ACL, 2016
- ATOMIC, an atlas of everyday commonsense reasoning, organized through 877k textual descriptions of inferential knowledge. ATOMIC focuses on inferential knowledge organized as typed if-then relations with variables.
- SenticNet SenticNet provides a set of semantics, sentics, and polarity associated with 100,000 natural language concepts. SenticNet consists of a set of tools and techniques for sentiment analysis combining commonsense reasoning, psychology, linguistics, and machine learning.
<span id="other">Other</span>
- QA-SRL This dataset use question-answer pairs to model verbal predicate-argument structure. The questions start with wh-words (Who, What, Where, What, etc.) and contains a verb predicate in the sentence; the answers are phrases in the sentence.
- QA-SRL 2.0 This repository is the reference point for QA-SRL Bank 2.0, the dataset described in the paper Large-Scale QA-SRL Parsing, ACL 2018.
- NEWSROOM CORNELL NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications.
- CoNLL 2010 Uncertainty Detection The aim of this task is to identify sentences in texts which contain unreliable or uncertain information. Training Data contains biological abstracts and full articles from the BioScope (biomedical domain) corpus and paragraphs from Wikipedia possibly containing weasel information.
- COLING 2018 automatic identification of verbal MWE Corpora were annotated by human annotators with occurrences of verbal multiword expressions (VMWEs) according to common annotation guidelines. For example, "He picked one up."
- Scientific NLP
- PubMed 200k RCT PubMed 200k RCT is new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences.
- Automatic Academic Paper Rating A dataset for automatic academic paper rating (AAPR), which automatically determine whether to accept academic papers. The dataset consists of 19,218 academic papers by collecting data on academic pa- pers in the field of artificial intelligence from the arxiv.
- ACL Title and Abstract Dataset This dataset gathers 10,874 title and abstract pairs from the ACL Anthology Network (until 2016).
- SCIERC A dataset includes annotations for entities, relations, and coreference clusters in scientific articles.
- SciBERT SciBERT is a BERT model trained on scientific text. A broad set of scientific nlp datasets under the data/ directory across ner, parsring, pico and text classification.
- 5AbstractsGroup The dataset contains academic papers from five different domains collected from the Web of Science, namely business, artifical intelligence, sociology, transport and law.
- SciCite A new large dataset of citation intent from Structural Scaffolds for Citation Intent Classification in Scientific Publications
- ACL-ARC A dataset of citation intents in the computational linguistics domain (ACL-ARC) introduced by Measuring the Evolution of a Scientific Field through Citation Frames.
- GASP The dataset consists of list of cited abstracts associated with the corresponding source abstract. The goal is to generete the abstract of a target paper given the abstracts of cited papers.
<span id="contribute">Contribute</span> Contributions welcome!