Home

Awesome

<h2>Genocide Transcript Corpus (GTC)</h2>

The Genocide Transcript Corpus (GTC) provides transcript data from three different genocide tribunals: the Extraordinary Chambers in the Courts of Cambodia (ECCC), the International Criminal Tribunal for Rwanda (ICTR), and the International Criminal Tribunal for the Former Yugoslavia (ICTY).

GTC Version 2 - June 2023

Besides meta data regarding the respective tribunal and transcript annotation this version also includes the annotation of text segments that inlude potentially traumatic witness experiences.

The updated version of the GTC contains 52,845 text segments of a total of 90 transcripts that can be attributed to an individual person or court proceedings. The final data set includes the following variables:

Codebook V2

Variable NameDescription
tribunalName of the tribunal (ICTY, ICTR, or ECCC)
id_transcriptDocument/transcript number/ID
caseCase number/ID
accusedName of the accused
dateDate of the respective trial day corresponding to the transcript
textTranscript segment separated by speaker
traumaPotentially trauma-related content: <br> <ul> <li>not containing trauma-related content = 0</li> <li>containing trauma-related content = 1</li> </ul>
roleLegal role of the person speaking: <br> <ul> <li>Witness</li> <li>Accused</li> <li>JudgeProc (Judge talking about procedural matters)</li> <li>JudgeQA (Judge examining a witness</li> <li>LawyerProc (Lawyer talking about procedural matters)</li> <li>LawyerQA (Lawyer examining a witness</li> <li>Proceedings (procedural matters)</li> </ul>
witnessesNames or pseudonyms of witnesses
n_witnessesNumber of witnesses examined in the hearing
startStarting point of the respective segment annotation (useful for ordering the data chronologically)
id_annotationAnnotation ID of the segment
id_documentDocument ID in reference to the annotation process
urlLink to transcript

Please refer to the corresponding paper for further context, including details on the labeling process:

Miriam Schirmer, Isaac Misael Olguín Nolasco, Edoardo Mosca, Shanshan Xu, and Jürgen Pfeffer. 2023. Uncovering Trauma in Genocide Tribunals: An NLP Approach Using the Genocide Transcript Corpus. In Nineteenth International Conference on Artificial Intelligence and Law (ICAIL 2023), June 19–23, 2023, Braga, Portugal. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3594536.3595147

GTC Version 1 - June 2022

All samples were labeled according to whether they contain a witness’s description of experienced violence. Violence in this context includes accounts of experienced torture, interrogation, death, beating, psychological violence, experienced military attacks, destruction of villages, and looting.

The transcript data was divided into equally large text chunks of 250 words each. Numbers and punctuation were removed.

Codebook V1

Variable NameDescription
paragraphA text passage from a genocide tribunal transcript (250 words each).
labelViolence-related content: <br> <ul> <li>not containing violence = 0</li> <li>containing violence = 1</li> </ul>
tribunalThe specific tribunal the transcript data is from: <br> <ul> <li>ECCC = 1</li> <li>ICTY = 2</li> <li>ICTR = 3 </li> </ul>
witnessThe witness's name or a pseudonym.
documentThe document number / ID.
caseThe case number / ID.
dateThe trial date.

General Note: All of the used transcripts are openly accessible on the respective courts' websites.