Home

Awesome

Data Augmentation Techniques for NLP

If you'd like to add your paper, do not email us. Instead, read the protocol for adding a new entry and send a pull request.

We group the papers by text classification, translation, summarization, question-answering, sequence tagging, parsing, grammatical-error-correction, generation, dialogue, multimodal, mitigating bias, mitigating class imbalance, adversarial examples, compositionality, and automated augmentation.

This repository is based on our paper, "A survey of data augmentation approaches in NLP (Findings of ACL '21)". You can cite it as follows:

@inproceedings{feng-etal-2021-survey,
    title = "A Survey of Data Augmentation Approaches for {NLP}",
    author = "Feng, Steven Y.  and
      Gangal, Varun  and
      Wei, Jason  and
      Chandar, Sarath  and
      Vosoughi, Soroush  and
      Mitamura, Teruko  and
      Hovy, Eduard",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.84",
    doi = "10.18653/v1/2021.findings-acl.84",
    pages = "968--988",
}

Authors: <a href="https://scholar.google.ca/citations?hl=en&user=zwiszZIAAAAJ">Steven Y. Feng</a>, <a href="https://scholar.google.com/citations?user=rWZq2nQAAAAJ&hl=en">Varun Gangal</a>, <a href="https://scholar.google.com/citations?user=wA5TK_0AAAAJ&hl=en">Jason Wei</a>, <a href="https://scholar.google.co.in/citations?user=yxWtZLAAAAAJ&hl=en">Sarath Chandar</a>, <a href="https://scholar.google.ca/citations?user=45DAXkwAAAAJ&hl=en">Soroush Vosoughi</a>, <a href="https://scholar.google.com/citations?user=gjsxBCkAAAAJ&hl=en">Teruko Mitamura</a>, <a href="https://scholar.google.com/citations?user=PUFxrroAAAAJ&hl=en">Eduard Hovy</a>

Special thanks to Ryan Shentu, Fiona Feng, Karen Liu, Emily Nie, Tanya Lu, and Bonnie Ma for helping out with this repo. Note: WIP. More papers will be added from our survey paper to this repo soon. Inquiries should be directed to stevenyfeng@gmail.com or by opening an issue here.

Also, check out our talk for Google Research (Steven Feng and Varun Gangal) here, and our podcast episode (Steven Feng and Eduard Hovy) here and here.

Text Classification

PaperDatasets
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods (ACL '95)Paper-Specific/Legacy Corpus
Synonym Replacement (Character-Level Convolutional Networks for Text Classification, NeurIPS '15)AG’s News, DBPedia, Yelp, Yahoo Answers, Amazon
That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets (EMNLP '15)twitter
Robust Training under Linguistic Adversity (EACL '17) codeMovie review, customer review, SUBJ, SST
Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations (NAACL '18) codeSST, SUBJ, MRQA, RT, TREC
Variational Pretraining for Semi-supervised Text Classification (ACL '19) codeIMDB, AG News, Yahoo, hatespeech
EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks (EMNLP '19) codeSST, CR, SUBJ, TREC, PC
A Closer Look At Feature Space Data Augmentation For Few-Shot Intent Classification (DeepLo @ EMNLP '19)SNIPS
Nonlinear Mixup: Out-Of-Manifold Data Augmentation for Text Classification (AAAI '20)TREC, SST, Subj, MR
MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification (ACL '20) codeAG News, DBpedia, Yahoo, IMDb
Unsupervised Data Augmentation for Consistency Training (NeurIPS '20) codeYelp, IMDb, amazon, DBpedia
Not Enough Data? Deep Learning to the Rescue! (AAAI '20)ATIS, TREC, WVA
Data Augmentation using Pre-trained Transformer Models LifeLongNLP @ AACL '20, codeSNIPS, TREC, SST2
SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness (EMNLP '20) codeIWSLT'14
Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation (EMNLP '20)ICWSM 20’ Data Challenge, SemEval '17 sentiment analysis, SemEval '18 irony
Textual Data Augmentation for Efficient Active Learning on Tiny Datasets (EMNLP '20)SST2, TREC
Text Augmentation in a Multi-Task View (EACL '21)SST2, TREC, SUBJ
GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation (arXiv '21)SST2, CR, TREC, SUBJ, MPQA, CoLA
Few-Shot Text Classification with Triplet Loss, Data Augmentation, and Curriculum Learning (NAACL '21) codeHUFF, COV-Q, AMZN, FEWREL
Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification (EMNLP '21) codeIMDB, SST2, SST5, TREC, YELP2, YELP5
AEDA: An Easier Data Augmentation Technique for Text Classification (EMNLP '21) codeSST, CR, SUBJ, TREC, PC

Translation

PaperDatasets
Backtranslation (Improving Neural Machine Translation Models with Monolingual Data, ACL '16)WMT '15 en-de, IWSLT '15 en-tr
Adapting Neural Machine Translation with Parallel Synthetic Data (WMT '17)COMMON, 1 Billion Words, dev2013, XRCE, IT, E-Com
Data Augmentation for Low-Resource Neural Machine Translation (ACL '17) codeWMT '14/'15/'16 en-de/de-en
Synthetic Data for Neural Machine Translation of Spoken-Dialects (arxiv '17)LDC2012T09, OpenSubtitles-2013
Multi-Source Neural Machine Translation with Data Augmentation (IWSLT '18)TED Talks
SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation (EMNLP '18)IWSLT '15 en-vi, IWSLT '16 de-en, WMT '15 en-de
Generalizing Back-Translation in Neural Machine Translation (WMT '19)ed NewsCrawl2, WMT'18 de-en
Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine Translation (ACL '19)DGT-TM en-ml/en-hu
Augmenting Neural Machine Translation with Knowledge Graphs (arxiv '19)WMT '14 -'18
Generalized Data Augmentation for Low-Resource Translation (ACL '19) codeENG-HRL-LRL, HRL-LRL
Improving Robustness of Machine Translation with Synthetic Noise (NAACL '19) codeEP, TED, MTNT en-fr en-jpn
Soft Contextual Data Augmentation for Neural Machine Translation (ACL '19) codeIWSLT '14 de/es/he-en, WMT '14 en-de
Data augmentation using back-translation for context-aware neural machine translation (DiscoMT @ EMNLP '19) codeIWSLT'17 en-ja/en-fr, BookCorpus, Europarl v7, National Diet of Japan
Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back-Translation (W-NUT @ EMNLP '19)WMT'15/'19 en/fr, MTNT, IWSLT'17, MuST-C
Data augmentation for pipeline-based speech translation (Baltic HLT '20)WMT '17
Lexical-Constraint-Aware Neural Machine Translation via Data Augmentation (IJCAI '20) codeWMT '16 de-en, NIST zh-en
A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation (Information '20)IWSLT '14 en-de
Syntax-aware Data Augmentation for Neural Machine Translation (arxiv '20)WMT '14 en-de, IWSLT '14 de-en
SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness (EMNLP '20) codeIWSLT'14
Data diversification: A simple strategy for neural machine translation (NeurIPS '20) codeWMT '14 en-de/en-fr, IWSLT '13/'14/'15 en-de/de-en/en-fr
AdvAug: Robust Adversarial Augmentation for Neural Machine Translation (ACL '20)NIST zh-en, WMT '14 en-de
Dictionary-based Data Augmentation for Cross-Domain Neural Machine Translation (arxiv '20)WMT '14/'19
Sentence Boundary Augmentation For Neural Machine Translation Robustness (arxiv '20)IWSLT '14/'15/'18 en-de, WMT '18 en-de
Valar nmt : Vastly lacking resources neural machine translation (Stanford CS224N)Bible, Misc, Europarl v8, Newstest '18

Summarization

PaperDatasets
Transforming Wikipedia into Augmented Data for Query-Focused Summarization (arxiv '19)DUC
Iterative Data Augmentation with Synthetic Data (Abstract Text Summarization: A Low Resource Challenge (EMNLP '19)Swisstext, commoncrawl
Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation (NAACL '21)CNN-DailyMail
Data Augmentation for Abstractive Query-Focused Multi-Document Summarization (AAAI '21) codeQMDSCNN, QMDSIR, WikiSum, DUC 2006, DUC 2007

Question Answering

PaperDatasets
QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension (ICLR '18)SQuAD, TriviaQA
An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering (EMNLP '19 Workshop)MRQA
Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering (arxiv '19)SQuAD, Trivia-QA, CMRC, DRCD
XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering (arxiv '19)XNLI, SQuAD
Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering (arxiv '20)MLQA, XQuAD, SQuAD-it, PIAF
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering (ACL '20) codeWIQA, QuaRel, HotpotQA

Sequence Tagging

PaperDatasets
Data Augmentation via Dependency Tree Morphing for Low-Resource Languages (EMNLP '18) codeuniversal dependencies project
DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks (EMNLP '20) codeCoNLL2002/2003
An Analysis of Simple Data Augmentation for Named Entity Recognition (COLING '20)MaSciP, i2b2- 2010
SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup (EMNLP '20) codeCoNLL-03, ACE05, Webpage

Parsing

PaperDatasets
Data Recombination for Neural Semantic Parsing (ACL '16) codeGeoQuery, ATIS, Overnight
A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages (EMNLP '19)Universal Dependencies treebanks version 2.2
Named Entity Recognition for Social Media Texts with Semantic Augmentation (EMNLP '20)codeWNUT16, WNUT17, Weibo
Good-Enough Compositional Data Augmentation (ACL '20) codeSCAN
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing (ICLR '21)SPIDER, WIKISQL, WIKITABLEQUESTIONS

Grammatical Error Correction

PaperDatasets
GenERRate: Generating Errors for Use in Grammatical Error Detection (BEA '09)Ungram-BNC
Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners (IJCNLP '11) codeLang-8
Artificial error generation for translation-based grammatical error correction (University of Cambridge Technical Report '16)Several Datasets
Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction. (NAACL'18)Lang-8, CoNLL-2014, CoNLL-2013, JFLEG
Using Wikipedia Edits in Low Resource Grammatical Error Correction. (WNUT @ EMNLP '18)Falko-MERLIN GEC Corpus
Sequence-to-sequence Pre-training with Data Augmentation for Sentence Rewriting (arxiv '19)CoNLL-2014 , JFLEG
Controllable Data Synthesis Method for Grammatical Error Correction (arxiv '19) codeNUCLE, Lang-8, One-Billion, CoNLL2013, CoNLL2014
Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data. (BEA @ ACL '19)FCE, NUCLE, W&I+LOCNESS, Lang-8
Corpora Generation for Grammatical Error Correction (NAACL'19)CoNLL-2014, JFLEG, Lang-8
Erroneous data generation for Grammatical Error Correction (BEA @ ACL '19)Lang-8,n CoNLL, JFLEG, CoNLL-2014, ABCN, FCE
Sequence-to-sequence Pre-training with Data Augmentation for Sentence Rewriting (arxiv '19) codeGYAFC, WMT14, WMT18
A neural grammatical error correction system built on better pre-training and sequential transfer learning. (BEA @ ACL '19)FCE, NUCLE, W&I+LOCNESS, Lang-8, Gutenberg, Tatoeba, WikiText-103
Improving Grammatical Error Correction with Data Augmentation by Editing Latent Representation (COLING'20)FCE, NUCLE, W&I+LOCNESS, Lang-8
A Comparative Study of Synthetic Data Generation Methods for Grammatical Error Correction (BEA @ ACL '20)W&I+LOCNESS, FCE, News Crawl 2, W&I+L train, FCE-train, NUCLE, Lang-8, W&I+L dev, FCE-test, Tatoeba, WikiText-103
A syntactic rule-based framework for parallel data synthesis in Japanese GEC (MIT Thesis '20)Lang-8

Generation

PaperDatasets
TNT-NLG, System 2: Data repetition and meaning representation manipulation to improve neural generation (E2E NLG Challenge System Descriptions)TODO
Findings of the Third Workshop on Neural Generation and Translation (WNGT @ EMNLP '19)RotoWire English-German
A Good Sample is Hard to Find: Noise Injection Sampling and Self-Training for Neural Language Generation Models (INLG '19) codeE2E Challenge Dataset, Laptops, TVs
GenAug: Data Augmentation for Finetuning Text Generators (DeeLIO @ EMNLP '20) codeYelp
Denoising Pre-Training and Data Augmentation Strategies for Enhanced RDF Verbalization with Transformers (WebNLG+ @ INLG '20)WebNLG

Dialogue

PaperDatasets
Sequence-to-Sequence Data Augmentation for Dialogue Language Understanding (COLING '18) codeATIS, Dec94, Stanford dialogue
Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context (arxiv '19) codeMultiWOZ
Data Augmentation by Data Noising for Open-vocabulary Slots in Spoken Language Understanding (Student Research Workshop @ NAACL '19)ATIS, Snips, MR
Data Augmentation with Atomic Templates for Spoken Language Understanding (EMNLP '19) codeDSTC 2&3, DSTC2
Data Augmentation for Spoken Language Understanding via Joint Variational Generation (AAAI '19)ATIS, Snips, MIT
Effective Data Augmentation Approaches to End-to-End Task-Oriented Dialogue (IALP '19)CamRest676, KVRET
Paraphrase Augmented Task-Oriented Dialog Generation (ACL '20) codeTCamRest676, MultiWOZ
Dialog State Tracking with Reinforced Data Augmentation (AAAI '20)WoZ, MultiWoZ
Data Augmentation for Copy-Mechanism in Dialogue State Tracking (arxiv '20)WoZ, DSTC2, Multi
Simple is Better! Lightweight Data Augmentation for Low Resource Slot Filling and Intent Classification (PACLIC '20) codeATIS, SNIPS, FB
Conversation Graph: Data Augmentation, Training, and Evaluation for Non-Deterministic Dialogue Management (TACL '21)M2M, MultiWOZ
GOLD: Improving Out-of-Scope Detection in Dialogues using Data Augmentation (EMNLP '21) codeSMCalFlow, ROSTD
Improving Automated Evaluation of Open Domain Dialog via Diverse Reference Augmentation (ACL '21 Findings) codeDailyDialog

Multimodal

PaperDatasets
Data Augmentation for Visual Question Answering (INLG '17)COCO-VQA, COCO-QA
Low Resource Multi-modal Data Augmentation for End-to-end ASR (CoRR ’18)TODO
Multi-Modal Data Augmentation for End-to-end ASR (Interspeech '18)Voxforge, HUB4
Augmenting Image Question Answering Dataset by Exploiting Image Captions (LREC '18)IQA
Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks (AVEC '18)TODO
Multimodal Dialogue State Tracking By QA Approach with Data Augmentation (DSTC8 @ AAAI '20)DSTC7-AVSD
Data augmentation techniques for the Video Question Answering task (arxiv '20)TGIF-QA, MSVD-QA
Data Augmentation for Training Dialog Models Robust to Speech Recognition Errors (NLP for ConvAI @ ACL '20)DSTC2
Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering (ECCV '20)TODO
Text Augmentation Using BERT for Image Captioning (Applied Sciences '20)MSCOCO
MDA: Multimodal Data Augmentation Framework for Boosting Performance on Image-Text Sentiment/Emotion Classification Tasks (IEEE Intelligent Systems '20)TODO

Mitigating Bias

PaperDatasets
Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. (NAACL '18) codeWinoBias, OntoNotes
Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology (ACL '19) codeTODO
CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech (ACL '19) DatasetNew Dataset Created
It’s All in the Name: Mitigating Gender Bias with Name-Based Counterfactual Data Substitution (EMNLP '19) codeSSA, Stanford Large Movie Review, SimLex-999
Gender Bias in Neural Natural Language Processing. (Springer '20)Wikitext-2, CoNLL-2012
Improving Robustness by Augmenting Training Sentences with Predicate-Argument Structures (arxiv '20)SWAG, CoNLL2009, MultiNLI, HANS

Mitigating Class Imbalance

PaperDatasets
SMOTE: Synthetic Minority Over-sampling Technique (Journal of Artificial Intelligence Research '02)Pima, Phoneme, Adult, E-state, Satimage, Forest Cover, Oil, Mammography, Can
Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem (EMNLP '07)TODO
MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation (Knowledge-Based Systems '15)bibtex, cal500, corel5k, slashdot, tmc2007, mediamill, medical, scene, enron, emotions
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary (Journal of Artificial Intelligence Research '18)TODO

Adversarial examples

PaperDatsets
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks (NAACL '18) codeSST, SICK
AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples (ACL '18) codeWordNet, PPDB, SICK, SNLI, SciTail
Breaking NLI Systems with Sentences that Require Simple Lexical Inferences (ACL '18)SNLI, SciTail, MultiNLI
Certified Robustness to Adversarial Word Substitutions (EMNLP '19) codeIMDB, SNLI
PAWS: Paraphrase Adversaries from Word Scrambling (NAACL '19) codePAWS (QQP + Wikipedia)
Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency (ACL '19) codeIMDB, AG’s News, Yahoo Answers

Compositionality

PaperDatsets
Good-Enough Compositional Data Augmentation (ACL '20) codeSCAN
Sequence-Level Mixed Sample Data Augmentation (EMNLP '20) codeIWSLT ’14, WMT ’14

Automated Augmentation

PaperDatsets
Learning Data Manipulation for Augmentation and Weighting (NeurIPS '19) codeSST, IMDB, TREC, CIFAR-10
Data Manipulation: Towards Effective Instance Learning for Neural Dialogue Generation via Learning to Augment and Reweight (ACL '20)DailyDialog, OpenSubtitles
Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification (EMNLP '21) codeIMDB, SST2, SST5, TREC, YELP2, YELP5

Popular Resources