Awesome
NLP for 18th-century Portuguese medical texts
This is a repository for the paper: Zilio, L., Lazzari R.R., Finatto, M.J.B. (2024) NLP for historical Portuguese: Analysing 18th-century medical texts. In Proceedings of PROPOR 2024.
Repository content:
This is just an overview. Please refer to the paper above to get more information about the content of each folder.
TMX: this folder contains original and normalised versions of the texts described in the paper in a TMX format (a type of XML format)
aligned: this folder contains the results of semi-automatic alignments between original and normalised versions of each file of the corpus
keywords: this folder contains the results from the keyword analysis presented in the paper
parsed: this folder contains the automatically parsed version of the files (parsing done with STANZA)
variants: this folder contains the variants found by comparing the semi-automatically aligned files
The file aligned_parsed_modern_PT_NLTK_tokenizer_stanza.tsv
contains an automatic parsing (with STANZA) combined with the alginments for the whole corpus. The second column of the parsing is the semi-automatically aligned original spelling of the token.