Home

Awesome

IndicTransToolkit

The goal of this repository is to provide a simple, modular, and extendable toolkit for IndicTrans2 and be compatible with the HuggingFace models released.

Minor Update (v1.0.2)

Major Update (v1.0.0)

Pre-requisites

Configuration

git clone https://github.com/VarunGumma/IndicTransToolkit
cd IndicTransToolkit

pip install --editable ./

Examples

For the training usecase, please refer here.

PreTainedTokenizer

import torch
from IndicTransToolkit import IndicProcessor
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

ip = IndicProcessor(inference=True)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)

sentences = [
    "This is a test sentence.",
    "This is another longer different test sentence.",
    "Please send an SMS to 9876543210 and an email on newemail123@xyz.com by 15th October, 2023.",
]

batch = ip.preprocess_batch(sentences, src_lang="eng_Latn", tgt_lang="hin_Deva")
batch = tokenizer(batch, padding="longest", truncation=True, max_length=256, return_tensors="pt")

with torch.inference_mode():
    outputs = model.generate(**batch, num_beams=5, num_return_sequences=1, max_length=256)

with tokenizer.as_target_tokenizer():
    # This scoping is absolutely necessary, as it will instruct the tokenizer to tokenize using the target vocabulary.
    # Failure to use this scoping will result in gibberish/unexpected predictions as the output will be de-tokenized with the source vocabulary instead.
    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=True)

outputs = ip.postprocess_batch(outputs, lang="hin_Deva")
print(outputs)

>>> ['рдпрд╣ рдПрдХ рдкрд░реАрдХреНрд╖рдг рд╡рд╛рдХреНрдп рд╣реИред', 'рдпрд╣ рдПрдХ рдФрд░ рд▓рдВрдмрд╛ рдЕрд▓рдЧ рдкрд░реАрдХреНрд╖рдг рд╡рд╛рдХреНрдп рд╣реИред', 'рдХреГрдкрдпрд╛ 9876543210 рдкрд░ рдПрдХ рдПрд╕. рдПрдо. рдПрд╕. рднреЗрдЬреЗрдВ рдФрд░ 15 рдЕрдХреНрдЯреВрдмрд░, 2023 рддрдХ newemail123@xyz.com рдкрд░ рдПрдХ рдИрдореЗрд▓ рднреЗрдЬреЗрдВред']

Evaluation

from IndicTransToolkit import IndicEvaluator

# this method returns a dictionary with BLEU and ChrF2++ scores with appropriate signatures
evaluator = IndicEvaluator()
scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=pred_file, refs=ref_file) 

# alternatively, you can pass the list of predictions and references instead of files 
# scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=preds, refs=refs)

Batching

ip = IndicProcessor(inference=True)

for batch in ip.get_batches(source_sentences, batch_size=32):
    # perform necessary operations on the batch
    # ... pre-processing
    # ... tokenization 
    # ... generation 
    # ... decoding

Authors

Bugs and Contribution

Since this a bleeding-edge module, you may encounter broken stuff and import issues once in a while. In case you encounter any bugs or want additional functionalities, please feel free to raise Issues/Pull Requests or contact the authors.

Citation

If you use our codebase, or models, please do cite the following paper:

@article{
    gala2023indictrans,
    title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
    author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2023},
    url={https://openreview.net/forum?id=vfT4YuzAYA},
    note={}
}