Awesome
Corrupt an input text to test NLP models' robustness.
For details refer to https://nlp-demo.readthedocs.io
Installation
pip install wild-nlp
Supported aspects
All together we defined and implemented 11 aspects of text corruption.
-
Articles
Randomly removes or swaps articles into wrong ones.
-
Digits2Words
Converts numbers into words. Handles floating numbers as well.
-
Misspellings
Misspells words appearing in the Wikipedia list of:
- commonly misspelled English words
- homophones
-
Punctuation
Randomly adds or removes specified punctuation marks.
-
QWERTY
Simulates errors made while writing on a QWERTY-type keyboard.
-
RemoveChar
Randomly removes:
- characters from words or
- white spaces from sentences
-
SentimentMasking
Replaces random, single character with for example an asterisk in:
- negative or
- positive words from Opinion Lexicon:
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
-
Swap
Randomly swaps two characters within a word, excluding punctuations.
-
Change char
Randomly change characters according to chosen dictionary, default is 'ocr' to simulate simple OCR errors.
-
White spaces
Randomly add or remove white spaces (listed as a parameter).
- Sub string
Randomly add a substring to simulate more comples signs.
- All aspects can be chained together with the wildnlp.aspects.utils.compose function.
Supported datasets
Aspects can be applied to any text. Below is the list of datasets for which we already implemented processing pipelines.
-
CoNLL
The CoNLL-2003 shared task data for language-independent named entity recognition.
-
IMDB
The IMDB dataset containing movie reviews for a sentiment analysis. The dataset consists of 50 000 reviews of two classes, negative and positive.
-
SNLI
The SNLI dataset supporting the task of natural language inference.
-
SQuAD
The SQuAD dataset for the Machine Comprehension problem.
Usage
from wildnlp.aspects.dummy import Reverser, PigLatin
from wildnlp.aspects.utils import compose
from wildnlp.datasets import SampleDataset
# Create a dataset object and load the dataset
dataset = SampleDataset()
dataset.load()
# Crate a composed corruptor function.
# Functions will be applied in the same order they appear.
composed = compose(Reverser(), PigLatin())
# Apply the function to the dataset
modified = dataset.apply(composed)
Acknowledgments
Adam Slucki and Dominika Basaj were financially supported by grant POIR.01.01.01-00-0328/17-01. Przemyslaw Biecek was financially supported by NCN Opus grant 2017/27/B/ST6/0130.