Home

Awesome

PorSimplesSent

A Portuguese corpus of aligned sentences pairs to investigate sentence readability assessment

NILC

This corpus was created during my master's degree at ICMC-USP, and made possible thanks to the Interinstitutional Center for Computational Linguistics - NILC (Núcleo Interinstitucional de Linguística Computacional), represented by my advisor Dra. Sandra Maria Aluísio and the linguistics specialist Dra. Magali Sanches Duran.

http://www.nilc.icmc.usp.br/nilc/index.php

License

CC BY 4.0

Citation

@inproceedings{leal2018pss,
    author = {Sidney Evaldo Leal and Magali Sanches Duran and Sandra Maria Aluísio},
    title = {A Nontrivial Sentence Corpus for the Task of Sentence Readability Assessment in Portuguese},
    booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)},
    year = {2018},
    pages = {401–413},
    month = {August},
    date = {20-26},
    address = {Santa Fe, New Mexico, USA},
}

TSV format

All files are in Tab Separated Values (TSV) format, it means that fields are separated by tab (Also knows as char(9) or \t), and newline (char(10) or \n) for the rows.

PorSimples

In this folder you'll find the source corpus used to extract the sentence pairs, already exportaded in TSV format:

porsimples_sentences.tsv

porsimples_aligns.tsv

PorSimplesSent (pss)

In this folder are the files with aligned pairs from pss0 to pss3, it all have the same layout:

pss0 - Split sentences concatenated

Concatenate all resulting split sentences on the right side, may be usefull to study the simplification process.

pss1 - All splits (1 to n)

Repeats left side sentence to each one resulting split

pss2 - Major Length splits (1 to major(n))

Only the sentence with bigger length and most overlap of tokens. Repeats left side sentence when two resulting split sentences has the same size and overlap.

pss3 - No split sentences (1 to 1)

Only the sentences that not suffered split.

PorSimplesSent - Triplets

In the file triplets_length.tsv, are sentences from the 3 levels, generated from the pss2_length pairs, in the following layout:

Statistics

Total sentences Original: 2907
      Zero Hora: 2067
      Caderno Ciencia FSP: 840
Total sentences Natural: 4066
Total sentences Strong: 4971
Total sentences ALL: 11944

Total sentences NO SIMPLIFICATION Original->Natural: 565
Total sentences NO SIMPLIFICATION Natural->Strong: 2619

Total sentences SPLIT Original->Natural: 826
Total sentences SPLIT Natural->Strong: 721

Total sentences Natural from split: 1990
Total sentences Strong from split: 1625

Total sentences SIMPLIFIED (no split) Original->Natural: 1515
Total sentences SIMPLIFIED (no split) Natural->Strong: 729

Total pairs simplified Original->Natural: 2340
Total pairs simplified Natural->Strong: 1450
Total pairs simplified Original->Strong: 1101
Total all pairs simplified: 4891

Total triplets NO SIMPLIFICATION 3 Levels: 393
Total triplets Simplified Only Original->Natural: 1297
Total triplets Simplified Only Natural->Strong: 181
Total triplets Simplified 3 Levels: 1099
Total triplets: 2970

Mean token size of sentences - simplified (no split) - Ori->Nat: 20
Min token size of sentences - simplified (no split) - Ori->Nat: 3
Max token size tokens of sentences - simplified (no split) - Ori->Nat: 69

Mean token size of sentences - simplified (with split) - Ori->Nat: 33
Min token size of sentences - simplified (with split) - Ori->Nat: 6
Max token size tokens of sentences - simplified (with split) - Ori->Nat: 54

Mean token size of sentences - simplified (no split) - Nat->Str: 22
Min token size of sentences - simplified (no split) - Nat->Str: 4
Max token size tokens of sentences - simplified (no split) - Nat->Str: 57

Mean token size of sentences - simplified (with split) - Nat->Str: 24
Min token size of sentences - simplified (with split) - Nat->Str: 5
Max token size tokens of sentences - simplified (with split) - Nat->Str: 49

Mean tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 6
Min tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 1
Max tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 26

Mean tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 9
Min tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 1
Max tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 64

Total PSS1 Original->Natural: 3504
Total PSS1 Natural->Strong: 2353
Total PSS1 Original->Strong: 2052
Total geral PSS1: 7909

Total PSS2 Original->Natural: 2370
Total PSS2 Natural->Strong: 1491
Total PSS2 Original->Strong: 1101
Total geral PSS2: 4962

Total PSS3 Original->Natural: 1515
Total PSS3 Natural->Strong: 729
Total PSS3 Original->Strong: 264
Total geral PSS3: 2508