Home

Awesome

:bookmark: The Indic NLP Catalog

A Collaborative Catalog of Resources for Indic Language NLP

The Indic NLP Catalog repository is an attempt to collaboratively build the most comprehensive catalog of NLP datasets, models and other resources for all languages of the Indian subcontinent.

Please suggest any other resources you may be aware of. Raise a pull request or an issue to add more resources to the catalog. Put the proposed entry in the following format:

[Wikipedia Dumps](https://dumps.wikimedia.org/)

Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the CONTRIBUTORS list.

:+1: Featured Resources

Indian language NLP has come a long way. We feature a few resources that are illustrative of the trends in recent times along various axes and point to a bright future.

Browse the entire catalog...

:raising_hand:Note: Many known resources have not yet been classified into the catalog. They can be found as open issues in the repo.

<!-- vscode-markdown-toc --> <!-- vscode-markdown-toc-config numbering=false autoSave=true /vscode-markdown-toc-config --> <!-- /vscode-markdown-toc -->

<a name='MajorIndicLanguageNLPRepositories'></a>Major Indic Language NLP Repositories

<a name='Libraries'></a>Libraries and Tools

<a name='Benchmarks'></a>Evaluation Benchmarks

Benchmarks spanning multiple tasks.

<a name='Standards'></a>Standards

<a name='TextCorpora'></a>Text Corpora

<a name='MonolingualCorpus'></a>Monolingual Corpus

<a name='LanguageIdentification'></a>Language Identification

<a name='LexicalResources'></a>Lexical Resources and Semantic Similarity

<a name='NERCorpora'></a>NER Corpora

<a name='ParallelTranslationCorpus'></a>Parallel Translation Corpus

<a name='MTEvaluation'></a>MT Evaluation

<a name='ParallelTransliterationCorpus'></a>Parallel Transliteration Corpus

<a name='TextualClassification'></a>Text Classification

<a name='TextualEntailment'></a>Textual Entailment/Natural Language Inference

<a name='Paraphrase'></a> Paraphrase

<a name='SentimentAnalysis'></a>Sentiment, Sarcasm, Emotion Analysis

<a name='HateSpeech'></a>Hate Speech and Offensive Comments

<a name='QuestionAnswering'></a>Question Answering

<a name='Dialog'></a>Dialog

<a name='Discourse'></a>Discourse

<a name='InformationExtraction'></a>Information Extraction

<a name='POSTaggedcorpus'></a>POS Tagged corpus

<a name='ChunkCorpus'></a>Chunk Corpus

<a name='DependencyParseCorpus'></a>Dependency Parse Corpus

<a name='CoreferenceCorpus'></a>Coreference Corpus

<a name='Summarization'></a>Summarization

<a name='DatatoText'></a>Data to Text

<a name='Models'></a>Models

<a name='LIDModels'></a>Language Identification

<a name='WordEmbeddings'></a>Word Embeddings

<a name='PreTrainedLanguageModels'></a>Pre-trained Language Models

<a name='MultilingualWordEmbeddings'></a>Multilingual Word Embeddings

<a name='Morphanalyzers'></a>Morphanalyzers

<a name='TranslationModels'></a>Translation Models

<a name='TransliterationModels'></a>Transliteration Models

<a name='SpeechModels'></a>Speech Models

<a name='NER'></a>NER

<a name='SpeechCorpora'></a>Speech Corpora

<a name='OCRCorpora'></a>OCR Corpora

<a name='MultimodalCorpora'></a>Multimodal Corpora

<a name='LanguageSpecificCatalogs'></a>Language Specific Catalogs

Pointers to language-specific NLP resource catalogs