Home

Awesome

PreNLP

PyPI License GitHub stars GitHub forks

Preprocessing Library for Natural Language Processing

Installation

Requirements

With pip

prenlp can be installed using pip as follows:

pip install prenlp

Usage

Data

Dataset Loading

Popular datasets for NLP tasks are provided in prenlp. All datasets is stored in /.data directory.

DatasetLanguageArticlesSentencesTokensVocabSize
WikiText-2English720-2,551,84333,27813.3MB
WikiText-103English28,595-103,690,236267,735517.4MB
WikiText-koKorean477,9462,333,930131,184,780662,949667MB
NamuWiki-koKorean661,03216,288,639715,535,7781,130,0083.3GB
WikiText-ko+NamuWiki-koKorean1,138,97818,622,569846,720,5581,360,5383.95GB

General use cases are as follows:

WikiText-2 / WikiText-103
>>> wikitext2 = prenlp.data.WikiText2()
>>> len(wikitext2)
3
>>> train, valid, test = prenlp.data.WikiText2()
>>> train[0]
'= Valkyria Chronicles III ='
IMDB
>>> imdb_train, imdb_test = prenlp.data.IMDB()
>>> imdb_train[0]
["Minor Spoilers<br /><br />Alison Parker (Cristina Raines) is a successful top model, living with the lawyer Michael Lerman (Chris Sarandon) in his apartment. She tried to commit ...", 'pos']

Normalization

Frequently used normalization functions for text pre-processing are provided in prenlp.

url, HTML tag, emoticon, email, phone number, etc.

General use cases are as follows:

>>> from prenlp.data import Normalizer
>>> normalizer = Normalizer(url_repl='[URL]', tag_repl='[TAG]', emoji_repl='[EMOJI]', email_repl='[EMAIL]', tel_repl='[TEL]', image_repl='[IMG]')

>>> normalizer.normalize('Visit this link for more details: https://github.com/')
'Visit this link for more details: [URL]'

>>> normalizer.normalize('Use HTML with the desired attributes: <img src="cat.jpg" height="100" />')
'Use HTML with the desired attributes: [TAG]'

>>> normalizer.normalize('Hello šŸ¤©, I love you šŸ’“ !')
'Hello [EMOJI], I love you [EMOJI] !'

>>> normalizer.normalize('Contact me at lyeoni.g@gmail.com')
'Contact me at [EMAIL]'

>>> normalizer.normalize('Call +82 10-1234-5678')
'Call [TEL]'

>>> normalizer.normalize('Download our logo image, logo123.png, with transparent background.')
'Download our logo image, [IMG], with transparent background.'

Tokenizer

Frequently used (subword) tokenizers for text pre-processing are provided in prenlp.

SentencePiece, NLTKMosesTokenizer, Mecab

SentencePiece

>>> from prenlp.tokenizer import SentencePiece
>>> SentencePiece.train(input='corpus.txt', model_prefix='sentencepiece', vocab_size=10000)
>>> tokenizer = SentencePiece.load('sentencepiece.model')
>>> tokenizer('Time is the most valuable thing a man can spend.')
['ā–Time', 'ā–is', 'ā–the', 'ā–most', 'ā–valuable', 'ā–thing', 'ā–a', 'ā–man', 'ā–can', 'ā–spend', '.']
>>> tokenizer.tokenize('Time is the most valuable thing a man can spend.')
['ā–Time', 'ā–is', 'ā–the', 'ā–most', 'ā–valuable', 'ā–thing', 'ā–a', 'ā–man', 'ā–can', 'ā–spend', '.']
>>> tokenizer.detokenize(['ā–Time', 'ā–is', 'ā–the', 'ā–most', 'ā–valuable', 'ā–thing', 'ā–a', 'ā–man', 'ā–can', 'ā–spend', '.'])
Time is the most valuable thing a man can spend.

Moses tokenizer

>>> from prenlp.tokenizer import NLTKMosesTokenizer
>>> tokenizer = NLTKMosesTokenizer()
>>> tokenizer('Time is the most valuable thing a man can spend.')
['Time', 'is', 'the', 'most', 'valuable', 'thing', 'a', 'man', 'can', 'spend', '.']

Comparisons with tokenizers on IMDb

Below figure shows the classification accuracy from various tokenizer.

<p align="center"> <img width="700" src="https://raw.githubusercontent.com/lyeoni/prenlp/master/images/tokenizer_comparison_IMDb.png" align="middle"> </p>

Comparisons with tokenizers on NSMC (Korean IMDb)

Below figure shows the classification accuracy from various tokenizer.

<p align="center"> <img width="700" src="https://raw.githubusercontent.com/lyeoni/prenlp/master/images/tokenizer_comparison_NSMC.png" align="middle"> </p>

Author