Awesome
<br /> <div align="center"> <h2 align="center">PLUE: Portuguese Language Understanding Evaluation</h2> <img src="https://user-images.githubusercontent.com/28462295/140660705-e39c001f-e311-4024-aa7a-a7e1c69268fc.png" alt="https://fairytail.fandom.com/wiki/Plue" width="250"> <br /> <img alt="GitHub release (latest by date)" src="https://img.shields.io/github/v/release/ju-resplande/PLUE"> <img alt="GitHub" src="https://img.shields.io/github/license/ju-resplande/PLUE"> <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/ju-resplande/PLUE?style=social"> <p align="center"> <b> Portuguese translation of the <a href="https://gluebenchmark.com/">GLUE benchmark</a>, <a href=https://nlp.stanford.edu/projects/snli/>SNLI</a>, and <a href=https://allenai.org/data/scitail> Scitail</a> <br /> using <a href=https://github.com/Helsinki-NLP/OPUS-MT>OPUS-MT model</a> and <a href=https://cloud.google.com/translate/docs>Google Cloud Translation</a>. </b> </p> </div>Getting Started
Datasets | Translation Tool |
---|---|
CoLA, MRPC, RTE, SST-2, STS-B, and WNLI | Google Cloud Translation |
SNLI, MNLI, QNLI, QQP, and SciTail | OPUS-MT |
Usage
Datasets :hugs:
from datasets import load_dataset
data = load_dataset("dlb/plue", "cola")
# ['cola', 'sst2', 'mrpc', 'qqp_v2', 'stsb', 'snli', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'qnli_v2', 'rte', 'wnli', 'scitail']
Manual download (for large files)
Larger files are not hosted on github repository.
-
DVC integration
$ pip install dvc $ dvc pull datasets/SNLI/train_raw.tsv $ dvc pull datasets/SNLI/train.tsv $ dvc pull datasets/MNLI/train.tsv $ dvc pull pairs/QQP.json
-
ZIP links
Structure
├── code ____________ # translation code and dependency parsing
├── datasets
│ ├── CoLA
│ ├── MNLI
│ ├── MRPC
│ ├── QNLI
│ ├── QNLI_v2
│ ├── QQP_v2
│ ├── RTE
│ ├── SciTail
│ │ └── tsv_format
│ ├── SNLI
│ ├── SST-2
│ ├── STS-B
│ └── WNLI
└── pairs ____________ # translation pairs as JSON dictionary
Observations
- GLUE provides two versions: first and second. We noticed the versions only differs in QNLI and QQP datasets, where we made QNLI available in both versions and QQP in the newest version.
- LX parser, Binarizer code and NLTK word tokenizer were used to create dependency parsings for SNLI and MNLI datasets.
- SNLI train split is a ragged matrix, so we made available two version of the data: train_raw.tsv contains irregular lines and train.tsv excludes those lines.
- Manual translation were made on 12 sentences due to translation errors.
- Our translation code is outdated. We recommend using from others.
Citing
@misc{Gomes2020,
author = {GOMES, J. R. S.},
title = {PLUE: Portuguese Language Understanding Evaluation},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ju-resplande/PLUE}},
commit = {e7d01cb17173fe54deddd421dd735920964eb26f}
}
Acknowledgments
- Deep Learning Brasil/CEIA
- Cyberlabs