Home

Awesome

<center>AI4Bharat-IndicNLP Dataset</center>

The AI4Bharat-IndicNLP dataset is an ongoing effort to create a collection of large-scale, general-domain corpora for Indian languages. Currently, it contains 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We evaluate the IndicNLP embeddings on multiple evaluation tasks.

You can read details regarding the corpus and other resources HERE. We showcased the AI4Bharat-IndicNLP dataset at REPL4NLP 2020 (collocated with ACL 2020) (non-archival submission as extended abstract). You can see the talk here: VIDEO.

You can use the IndicNLP corpus and embeddings for multiple Indian language tasks. A comprehensive list of Indian language NLP resources can be found in the IndicNLP Catalog. For processing the Indian language text, you can use the Indic NLP Library.

Table of contents

Text Corpora

The text corpus for 12 languages.

Language# News Articles*SentencesTokensLink
as0.60M1.39M32.6Mlink
bn3.83M39.9M836Mlink
en3.49M54.3M1.22Blink
gu2.63M41.1M719Mlink
hi4.95M63.1M1.86Blink
kn3.76M53.3M713Mlink
ml4.75M50.2M721Mlink
mr2.31M34.0M551Mlink
or0.69M6.94M107Mlink
pa2.64M29.2M773Mlink
ta4.41M31.5M582Mlink
te3.98M47.9M674Mlink

Note

Pre-requisites

To replicate the results reported in the paper, training and evaluation scripts are provided.

To run these scripts, the following tools/packages are required:

For Python packages to install, see requirements.txt

Word Embeddings

DOWNLOAD

Version 1

languagepahibnorgumrkntemlta
vectorslinklinklinklinklinklinklinklinklinklink
modellinklinklinklinklinklinklinklinklinklink

Training word embeddings

$FASTTEXT_HOME/build/fasttext skipgram \
	-epoch 10 -thread 30 -ws 5 -neg 10    -minCount 5 -dim 300 \
	-input $mono_path \
	-output $output_emb_prefix 

Evaluation on word similarity task

Evaluate on the IIIT-H Word Similarity Database: DOWNLOAD

The above mentioned link is a cleaned version of the same database found HERE.

Evaluation Command

python scripts/word_similarity/wordsim.py \
	<embedding_file_path> \
	<word_sim_db_path> \
	<max_vocab>

Evaluation on word analogy task

Evaluate on the Facebook word analogy dataset.

Evaluation Command

First, add MUSE root directory to Python Path

export PYTHONPATH=$PYTHONPATH:$MUSE_PATH
python  scripts/word_analogy/word_analogy.py \
    --analogy_fname <analogy_fname> \
    --embeddings_path <embedding_file_path> \
    --lang 'hi' \
    --emb_dim 300 \
    --cuda

IndicNLP News Article Classification Dataset

We used the IndicNLP text corpora to create classification datasets comprising news articles and their categories for 9 languages. The dataset is balanced across classes. The following table contains the statistics of our dataset:

LanguageClassesArticles per Class
Bengalientertainment, sports7K
Gujaratibusiness, entertainment, sports680
Kannadaentertainment, lifestyle, sports10K
Malayalambusiness, entertainment, sports, technology1.5K
Marathientertainment, lifestyle, sports1.5K
Oriyabusiness, crime, entertainment, sports7.5K
Punjabibusiness, entertainment, sports, politics780
Tamilentertainment, politics, sport3.9K
Teluguentertainment, business, sports8K

DOWNLOAD

Evaluation Command

python3 scripts/txtcls.py --emb_path <path> --data_dir <path> --lang <lang code>

Publicly available Classification Datasets

We also evaluated the IndicNLP embeddings on many publicly available classification datasets.

We have created standard test, validation and test splits for the above mentioned datasets. You can download them to evaluate your embeddings.

DOWNLOAD

Evaluation Command

To evaluate your embeddings on the above mentioned datasets, first download them and then run the following command:

python3 scripts/txtcls.py --emb_path <path> --data_dir <path> --lang <lang code>

License

These datasets are available under original license for each public dataset.

Morphanalyzers

IndicNLP Morphanalyzers are unsupervised morphanalyzers trained with morfessor.

DOWNLOAD

Version 1

pahibnorgumrkntemlta

Training Command

## extract vocabulary from embedings file
zcat $embedding_vectors_path |  \
    tail -n +2 | \
    cut -f 1 -d ' '  > $vocab_file_path

## train morfessor 
morfessor-train -d ones \
        -S $model_file_path \
        --logfile  $log_file_path \
        --traindata-list $vocab_file_path \
        --max-epoch 10 

Citing

If you are using any of the resources, please cite the following article:

@article{kunchukuttan2020indicnlpcorpus,
    title={AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages},
    author={Anoop Kunchukuttan and Divyanshu Kakwani and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
    year={2020},
    journal={arXiv preprint arXiv:2005.00085},
}

We would like to hear from you if:

License

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Dataset" property="dct:title" rel="dct:type">IndicNLP Corpus</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

Contributors

This work is the outcome of a volunteer effort as part of AI4Bharat initiative.

Contact