Awesome
ru_punkt
Russian language support for NLTK's PunktSentenceTokenizer
ru_punkt is a part of nltk_data since 2019-07-04
Instalation
- Install NLTK python package:
pip install nltk
- Download punkt data:
import nltk
nltk.download('punkt')
<!--
3. Download _ru_punkt_:
```bash
git clone https://github.com/Mottl/ru_punkt.git
```
4. Copy Russian tokenizer into `nltk_data` folder (ensure the appropriate location for your OS):
```bash
cp -r ru_punkt/nltk_data/* ~/nltk_data
```
-->
Usage
import nltk
text = "Ай да А.С. Пушкин! Ай да сукин сын!"
print("Before:", nltk.sent_tokenize(text))
print("After:", nltk.sent_tokenize(text, language="russian"))
Output:
Before: ['Ай да А.С.', 'Пушкин!', 'Ай да сукин сын!']
After: ['Ай да А.С. Пушкин!', 'Ай да сукин сын!']
Training data
Data for sentence tokenization was taken from 3 sources:
– Articles from Russian Wikipedia (about 1 million sentences);
– Common Russian abbreviations from Russian orthographic dictionary, edited by V. V. Lopatin;
– Generated names initials.
Implementation notes
After some research it was found that the single params.abbrev_types
performs better than together with params.collocations
and params.ortho_content
, so the latter were removed from the trained tokenizer.