Home

Awesome

<p align="center"> <br> <img src="https://github.com/makcedward/nlpaug/blob/master/res/logo_small.png"/> <br> <p> <p align="center"> <a href="https://travis-ci.org/makcedward/nlpaug"> <img alt="Build" src="https://travis-ci.org/makcedward/nlpaug.svg?branch=master"> </a> <a href="https://www.codacy.com/app/makcedward/nlpaug?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=makcedward/nlpaug&amp;utm_campaign=Badge_Grade"> <img alt="Code Quality" src="https://api.codacy.com/project/badge/Grade/2d6d1d08016a4f78818161a89a2dfbfb"> </a> <a href="https://pepy.tech/badge/nlpaug"> <img alt="Downloads" src="https://pepy.tech/badge/nlpaug"> </a> </p>

nlpaug

This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestra multi augmenter together.

Features

<h3 align="center">Textual Data Augmentation Example</h3> <br><p align="center"><img src="https://github.com/makcedward/nlpaug/blob/master/res/textual_example.png"/></p> <h3 align="center">Acoustic Data Augmentation Example</h3> <br><p align="center"><img src="https://github.com/makcedward/nlpaug/blob/master/res/audio_example.png"/></p>
SectionDescription
Quick DemoHow to use this library
AugmenterIntroduce all available augmentation methods
InstallationHow to install this library
Recent ChangesLatest enhancement
Extension ReadingMore real life examples or researchs
ReferenceReference of external resources such as data or model

Quick Demo

Augmenter

AugmenterTargetAugmenterActionDescription
TextualCharacterKeyboardAugsubstituteSimulate keyboard distance error
TextualOcrAugsubstituteSimulate OCR engine error
TextualRandomAuginsert, substitute, swap, deleteApply augmentation randomly
TextualWordAntonymAugsubstituteSubstitute opposite meaning word according to WordNet antonym
TextualContextualWordEmbsAuginsert, substituteFeeding surroundings word to BERT, DistilBERT, RoBERTa or XLNet language model to find out the most suitlabe word for augmentation
TextualRandomWordAugswap, crop, deleteApply augmentation randomly
TextualSpellingAugsubstituteSubstitute word according to spelling mistake dictionary
TextualSplitAugsplitSplit one word to two words randomly
TextualSynonymAugsubstituteSubstitute similar word according to WordNet/ PPDB synonym
TextualTfIdfAuginsert, substituteUse TF-IDF to find out how word should be augmented
TextualWordEmbsAuginsert, substituteLeverage word2vec, GloVe or fasttext embeddings to apply augmentation
TextualBackTranslationAugsubstituteLeverage two translation models for augmentation
TextualReservedAugsubstituteReplace reserved words
TextualSentenceContextualWordEmbsForSentenceAuginsertInsert sentence according to XLNet, GPT2 or DistilGPT2 prediction
TextualAbstSummAugsubstituteSummarize article by abstractive summarization method
TextualLambadaAugsubstituteUsing language model to generate text and then using classification model to retain high quality results
SignalAudioCropAugdeleteDelete audio's segment
SignalLoudnessAugsubstituteAdjust audio's volume
SignalMaskAugsubstituteMask audio's segment
SignalNoiseAugsubstituteInject noise
SignalPitchAugsubstituteAdjust audio's pitch
SignalShiftAugsubstituteShift time dimension forward/ backward
SignalSpeedAugsubstituteAdjust audio's speed
SignalVtlpAugsubstituteChange vocal tract
SignalNormalizeAugsubstituteNormalize audio
SignalPolarityInverseAugsubstituteSwap positive and negative for audio
SignalSpectrogramFrequencyMaskingAugsubstituteSet block of values to zero according to frequency dimension
SignalTimeMaskingAugsubstituteSet block of values to zero according to time dimension
SignalLoudnessAugsubstituteAdjust volume

Flow

AugmenterAugmenterDescription
PipelineSequentialApply list of augmentation functions sequentially
PipelineSometimesApply some augmentation functions randomly

Installation

The library supports python 3.5+ in linux and window platform.

To install the library:

pip install numpy requests nlpaug

or install the latest version (include BETA features) from github directly

pip install numpy git+https://github.com/makcedward/nlpaug.git

or install over conda

conda install -c makcedward nlpaug

If you use BackTranslationAug, ContextualWordEmbsAug, ContextualWordEmbsForSentenceAug and AbstSummAug, installing the following dependencies as well

pip install torch>=1.6.0 transformers>=4.11.3 sentencepiece

If you use LambadaAug, installing the following dependencies as well

pip install simpletransformers>=0.61.10

If you use AntonymAug, SynonymAug, installing the following dependencies as well

pip install nltk>=3.4.5

If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first and installing the following dependencies as well

from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model

pip install gensim>=4.1.2

If you use SynonymAug (PPDB), downloading file from the following URI. You may not able to run the augmenter if you get PPDB file from other website

http://paraphrase.org/#/download

If you use PitchAug, SpeedAug and VtlpAug, installing the following dependencies as well

pip install librosa>=0.9.1 matplotlib

Recent Changes

1.1.11 Jul 6, 2022

See changelog for more details.

Extension Reading

Reference

This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.

Citation

@misc{ma2019nlpaug,
  title={NLP Augmentation},
  author={Edward Ma},
  howpublished={https://github.com/makcedward/nlpaug},
  year={2019}
}

This package is cited by many books, workshop and academic research papers (70+). Here are some of examples and you may visit here to get the full list.

Workshops cited nlpaug

Book cited nlpaug

Research paper cited nlpaug

Contributions

<table> <tr> <td align="center"><a href="https://github.com/sakares"><img src="https://avatars.githubusercontent.com/u/1306031" width="100px;" alt=""/><br /><sub><b>sakares saengkaew</b></sub></a><br /></td> <td align="center"><a href="https://github.com/bdalal"><img src="https://avatars.githubusercontent.com/u/3478378?s=400&v=4" width="100px;" alt=""/><br /><sub><b>Binoy Dalal</b></sub></a><br /></td> <td align="center"><a href="https://github.com/emrecncelik"><img src="https://avatars.githubusercontent.com/u/20845117?v=4" width="100px;" alt=""/><br /><sub><b>Emrecan Çelik</b></sub></a><br /></td> </tr> </table>