Home

Awesome

LaCour! Corpus

Companion dataset to the arXiv preprint presenting the LaCour! corpus.

Please use the following citation

@article{held2023lacour,
    author = {Held, Lena and Habernal, Ivan},
    title = {{LaCour!: Enabling Research on Argumentation in Hearings of the European Court of Human Rights}},
    journal = {arXiv preprint},
    year = {2023},
    doi = {10.48550/arXiv.2312.05061},
}

Abstract Why does an argument end up in the final court decision? Was it deliberated or questioned during the oral hearings? Was there something in the hearings that triggered a particular judge to write a dissenting opinion? Despite the availability of the final judgments of the European Court of Human Rights (ECHR), none of these legal research questions can currently be answered as the ECHR's multilingual oral hearings are not transcribed, structured, or speaker-attributed. We address this fundamental gap by presenting LaCour!, the first corpus of textual oral arguments of the ECHR, consisting of 154 full hearings (2.1 million tokens from over 267 hours of video footage) in English, French, and other court languages, each linked to the corresponding final judgment documents. In addition to the transcribed and partially manually corrected text from the video, we provide sentence-level timestamps and manually annotated role and language labels. We also showcase LaCour! in a set of preliminary experiments that explore the interplay between questions and dissenting opinions. Apart from the use cases in legal NLP, we hope that law students or other interested parties will also use LaCour! as a learning resource, as it is freely available in various formats at https://huggingface.co/datasets/TrustHLT/LaCour.

Contact person: Lena Held, lena.held@tu-darmstadt.de

tl;dr

:book:Reading some ECHR hearing transcripts?LaCour! Preview
:hugs:Dataset convenient and easy usageHuggingface Dataset
:arrow_down_small:Download the individual transcript files.txt .xml
:arrow_down_small:Download the documents meta datadocuments
:woman_technologist:Creation code for reproductiontrusthlt/lacour-generation
:interrobang:Questions and opinions datasettrusthlt/lacour-qando

Data

The dataset consists of 2 subsets.

Subset transcripts

The first subset transcripts contains the 154 transcripts of court hearings. It is provided in 2 different formats, .xml or .txt. All text and information is the same in both formats.

Files in .txt format have the following structure:

[[Announcer;UNK]]

<<22.32;23.16;fr>>
La Cour!

[[]] denotes a segment with the information Role and Name for the speaker, <<>> marks snippets with a begin, end and language tag, followed by the text.

Files in .xml format have the following structure:

<?xml version='1.0' encoding='utf-8'?>
<Transcript>
  <WebcastID>2438419_29092021</WebcastID>
	<SpeakerSegment>
		<Role>Announcer</Role>
		<Name>UNK</Name>
		<Snippet>
			<Language>fr</Language>
			<TimestampBegin>16.5</TimestampBegin>
			<TimestampEnd>17.1</TimestampEnd>
			<Text>La Cour!</Text>
		</Snippet>
    ...
	</SpeakerSegment>
  ...
</Transcript>

We provide this nested format to make potential annotation tasks easier.

Both file formats contain the following information:

Subset documents

The second subset documents contains information on all relevant documents found in the HUDOC database which have a link to a webcast hearing. This link is established by the application number associated with the hearing and a case. To link transcripts with these documents, the webcast_id can be used. Each instance in documents represents information on a document in hudoc associated with a hearing and the metadata associated with a hearing. Note: hearing_type states the type of the hearing, type states the type of the document. If the hearing is a "Grand Chamber hearing", the "CHAMBER" document refers to a different hearing.

 '4': {
    'webcast_id': '2438419_29092021',
    'hearing_date': '2021-09-29 00:00:00',
    'hearing_title': 'H.F. and M.F. v. France and J.D. and A.D. v. France (nos. 24384/19 and 44234/20)',
    'hearing_type': 'Grand Chamber hearing',
    'appno': '44234/20',
    'case_id': '001-219333',
    'case_name': 'CASE OF H.F. AND OTHERS v. FRANCE',
    'case_url': 'https://hudoc.echr.coe.int/eng?i=001-219333',
    'type': 'GRANDCHAMBER',
    'typedescription': 15,
    'document_date': '2022-09-14 00:00:00',
    'collection': 'CASELAW;JUDGMENTS;GRANDCHAMBER;ENG',
    'importance': 1,
    'court': '8',
    'issue': 'Inter-ministerial instruction no. 5995/SG of 23 February 2018 on “Provisions to be made for minors on their return from areas of terrorist group operations (in particular the Syria-Iraq border area)”',
    'represented_by': 'DOSÉ M.',
    'respondent': 'FRA',
    'articles': '1;34;35;35-3-a;41;46;46-2;P4-3;P4-3-2',
    'strasbourg_caselaw': 'Abdi Ibrahim v. Norway [GC], no. 15379/16, § 180, 10 December 2021;Abdul Wahab Khan v. the United Kingdom (dec.), no. 11987/11, §§ 27-28, 28 January 2014;Airey v. Ireland, 9 October 1979, §§ 24-25, Series A no. 32;Al-Dulimi and Montana Management Inc. v. Switzerland [GC], no. 5809/08, §§ 134 and 145-146, 21 June 2016;[...]',
    'external_sources': 'Article 12 § 4 of the International Covenant on Civil and Political Rights (ICCPR);United Nations Human Rights Committee’s (UNCCPR) General Comment no. 27 on the Freedom of Movement under Article 12 of the ICCPR, adopted on 1 November 1999 (UN Documents CCPR/C/21/Rev.1/Add.9);Article 19 of the International Law Commission (ILC) Draft Articles on Diplomatic Protection and commentary;[...]',
    'conclusion': 'Preliminary objection dismissed (Art. 34) Individual applications;(Art. 34) Locus standi;Remainder inadmissible (Art. 35) Admissibility criteria;(Art. 35-3-a) Ratione loci;(Art. 35-3-a) Ratione personae;Violation of Article 3 of Protocol No. 4 - Prohibition of expulsion of nationals (Article 3 para. 2 of Protocol No. 4 - Enter own country);Respondent State to take individual measures (Article 46-2 - Individual measures);Non-pecuniary damage - finding of violation sufficient (Article 41 - Non-pecuniary damage;Just satisfaction)',
    'separate_opinion': 'TRUE',
    'judges': "Ganna Yudkivska;Jon Fridrik Kjølbro;Krzysztof Wojtyczek;Mārtiņš Mits;Robert Spano;Síofra O'Leary;Stéphanie Mourou-Vikström;Yonko Grozev;Georges Ravarani;Ksenija Turković;Lorraine Schembri Orland",
    'ecli': 'ECLI:CE:ECHR:2022:0914JUD002438419'
    }

The fields in documents are:

Usage

Loading transcripts

XML

The xml format is nested and can be loaded e.g. with the function provided in load_lacour.py.

from load_lacour import load_transcript
from glob import glob
import pandas as pd

transcripts = []
for tf in glob('transcripts-xml/*.xml'):
    t, w = load_transcript(tf, format='xml')
    transcripts += t

df = pd.DataFrame(transcripts)

TXT

To load the txt files, you can use load_lacour.py:

from load_lacour import load_transcript
from glob import glob
import pandas as pd

transcripts = []
for tf in glob('transcripts-txt/*.txt'):
    t, w = load_transcript(tf)
    transcripts += t

df = pd.DataFrame(transcripts)

Loading document meta data

Load the .json file, i.e.

import pandas as pd
df = pd.read_json('lacour_linked_documents.json', orient='index', dtype={'webcast_id':str})

or

import json
with open('lacour_linked_documents.json') as f:
    d = json.load(f)

Questions and Opinions

The companion dataset for the experimental part using questions asked during the hearings and dissenting or concurring opinions can be found in the repository trusthlt/lacour-qando.

Data creation

Companion code for the creation of this dataset is available in the repository trusthlt/lacour-generation.