Awesome
PorSimplesSent
A Portuguese corpus of aligned sentences pairs to investigate sentence readability assessment
NILC
This corpus was created during my master's degree at ICMC-USP, and made possible thanks to the Interinstitutional Center for Computational Linguistics - NILC (Núcleo Interinstitucional de Linguística Computacional), represented by my advisor Dra. Sandra Maria Aluísio and the linguistics specialist Dra. Magali Sanches Duran.
http://www.nilc.icmc.usp.br/nilc/index.php
License
Citation
@inproceedings{leal2018pss,
author = {Sidney Evaldo Leal and Magali Sanches Duran and Sandra Maria Aluísio},
title = {A Nontrivial Sentence Corpus for the Task of Sentence Readability Assessment in Portuguese},
booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)},
year = {2018},
pages = {401–413},
month = {August},
date = {20-26},
address = {Santa Fe, New Mexico, USA},
}
TSV format
All files are in Tab Separated Values (TSV) format, it means that fields are separated by tab (Also knows as char(9)
or \t
), and newline (char(10)
or \n
) for the rows.
PorSimples
In this folder you'll find the source corpus used to extract the sentence pairs, already exportaded in TSV format:
porsimples_sentences.tsv
- production_id: Each triplet of texts (original, natural, strong) has an unique id, called production_id.
- level: ORI (1 - Original), level NAT (2 - Natural) or STR (3 - Strong)
- text_id: Unique id for each text.
- sentence_id: Unique id for each sentence.
- paragraph: Sequential id for the paragraph in text.
- sentence_text: The raw text from the sentence.
porsimples_aligns.tsv
- production_id: See porsimples_sentences.tsv.
- level: Simplification level ORI->NAT or NAT->STR.
- text_id_from: Text id from source side of simplification.
- sentence_id_from: sentence id from source side of simplification.
- text_id_to: Text id for target side of simplification.
- sentence_id_to: Sentence id for target side of simplification.
PorSimplesSent (pss)
In this folder are the files with aligned pairs from pss0 to pss3, it all have the same layout:
- production_id: See porsimples_sentences.tsv.
- level: Simplification level ORI->NAT, NAT->STR or ORI->STR.
- changed: If the sentence has changes in this simplification level.
- split: If the sentence suffers split in this simplification level.
- sentence_text_from: The raw text of the source sentence.
- sentence_text_to: The raw text of the target sentence.
pss0 - Split sentences concatenated
Concatenate all resulting split sentences on the right side, may be usefull to study the simplification process.
- pss0_align_concat_ori_nat.tsv
- pss0_align_concat_nat_str.tsv
pss1 - All splits (1 to n)
Repeats left side sentence to each one resulting split
- pss1_align_all_splits_ori_nat.tsv
- pss1_align_all_splits_nat_str.tsv
- pss1_align_all_splits_ori_str.tsv
pss2 - Major Length splits (1 to major(n))
Only the sentence with bigger length and most overlap of tokens. Repeats left side sentence when two resulting split sentences has the same size and overlap.
- pss2_align_length_ori_nat.tsv
- pss2_align_length_nat_str.tsv
- pss2_align_length_ori_str.tsv
pss3 - No split sentences (1 to 1)
Only the sentences that not suffered split.
- pss3_align_no_splits_ori_nat.tsv
- pss3_align_no_splits_nat_str.tsv
- pss3_align_no_splits_ori_str.tsv
PorSimplesSent - Triplets
In the file triplets_length.tsv, are sentences from the 3 levels, generated from the pss2_length pairs, in the following layout:
- production_id: See porsimples_sentences.tsv.
- level: Fixed - ORI->NAT->STR.
- changed_ori_nat: If the sentence has changes from the original to the natural level.
- changed_nat_str: If the sentence has changes from the natural to the strong level.
- original_text: The raw text of the original sentence.
- natural_text: The raw text of the natural sentence.
- strong_text: The raw text of the strong sentence.
Statistics
Total sentences Original: 2907
Zero Hora: 2067
Caderno Ciencia FSP: 840
Total sentences Natural: 4066
Total sentences Strong: 4971
Total sentences ALL: 11944
Total sentences NO SIMPLIFICATION Original->Natural: 565
Total sentences NO SIMPLIFICATION Natural->Strong: 2619
Total sentences SPLIT Original->Natural: 826
Total sentences SPLIT Natural->Strong: 721
Total sentences Natural from split: 1990
Total sentences Strong from split: 1625
Total sentences SIMPLIFIED (no split) Original->Natural: 1515
Total sentences SIMPLIFIED (no split) Natural->Strong: 729
Total pairs simplified Original->Natural: 2340
Total pairs simplified Natural->Strong: 1450
Total pairs simplified Original->Strong: 1101
Total all pairs simplified: 4891
Total triplets NO SIMPLIFICATION 3 Levels: 393
Total triplets Simplified Only Original->Natural: 1297
Total triplets Simplified Only Natural->Strong: 181
Total triplets Simplified 3 Levels: 1099
Total triplets: 2970
Mean token size of sentences - simplified (no split) - Ori->Nat: 20
Min token size of sentences - simplified (no split) - Ori->Nat: 3
Max token size tokens of sentences - simplified (no split) - Ori->Nat: 69
Mean token size of sentences - simplified (with split) - Ori->Nat: 33
Min token size of sentences - simplified (with split) - Ori->Nat: 6
Max token size tokens of sentences - simplified (with split) - Ori->Nat: 54
Mean token size of sentences - simplified (no split) - Nat->Str: 22
Min token size of sentences - simplified (no split) - Nat->Str: 4
Max token size tokens of sentences - simplified (no split) - Nat->Str: 57
Mean token size of sentences - simplified (with split) - Nat->Str: 24
Min token size of sentences - simplified (with split) - Nat->Str: 5
Max token size tokens of sentences - simplified (with split) - Nat->Str: 49
Mean tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 6
Min tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 1
Max tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 26
Mean tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 9
Min tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 1
Max tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 64
Total PSS1 Original->Natural: 3504
Total PSS1 Natural->Strong: 2353
Total PSS1 Original->Strong: 2052
Total geral PSS1: 7909
Total PSS2 Original->Natural: 2370
Total PSS2 Natural->Strong: 1491
Total PSS2 Original->Strong: 1101
Total geral PSS2: 4962
Total PSS3 Original->Natural: 1515
Total PSS3 Natural->Strong: 729
Total PSS3 Original->Strong: 264
Total geral PSS3: 2508