Awesome

LaCour! Corpus

Companion dataset to the arXiv preprint presenting the LaCour! corpus.

Please use the following citation

@article{held2023lacour,
    author = {Held, Lena and Habernal, Ivan},
    title = {{LaCour!: Enabling Research on Argumentation in Hearings of the European Court of Human Rights}},
    journal = {arXiv preprint},
    year = {2023},
    doi = {10.48550/arXiv.2312.05061},
}

Abstract Why does an argument end up in the final court decision? Was it deliberated or questioned during the oral hearings? Was there something in the hearings that triggered a particular judge to write a dissenting opinion? Despite the availability of the final judgments of the European Court of Human Rights (ECHR), none of these legal research questions can currently be answered as the ECHR's multilingual oral hearings are not transcribed, structured, or speaker-attributed. We address this fundamental gap by presenting LaCour!, the first corpus of textual oral arguments of the ECHR, consisting of 154 full hearings (2.1 million tokens from over 267 hours of video footage) in English, French, and other court languages, each linked to the corresponding final judgment documents. In addition to the transcribed and partially manually corrected text from the video, we provide sentence-level timestamps and manually annotated role and language labels. We also showcase LaCour! in a set of preliminary experiments that explore the interplay between questions and dissenting opinions. Apart from the use cases in legal NLP, we hope that law students or other interested parties will also use LaCour! as a learning resource, as it is freely available in various formats at https://huggingface.co/datasets/TrustHLT/LaCour.

Contact person: Lena Held, lena.held@tu-darmstadt.de

tl;dr


:book:	Reading some ECHR hearing transcripts?	LaCour! Preview
:hugs:	Dataset convenient and easy usage	Huggingface Dataset
:arrow_down_small:	Download the individual transcript files	.txt .xml
:arrow_down_small:	Download the documents meta data	documents
:woman_technologist:	Creation code for reproduction	trusthlt/lacour-generation
:interrobang:	Questions and opinions dataset	trusthlt/lacour-qando

Data

The dataset consists of 2 subsets.

Subset transcripts

The first subset transcripts contains the 154 transcripts of court hearings. It is provided in 2 different formats, .xml or .txt. All text and information is the same in both formats.

Files in .txt format have the following structure:

[[Announcer;UNK]]

<<22.32;23.16;fr>>
La Cour!

[[]] denotes a segment with the information Role and Name for the speaker, <<>> marks snippets with a begin, end and language tag, followed by the text.

Files in .xml format have the following structure:

<?xml version='1.0' encoding='utf-8'?>
<Transcript>
  <WebcastID>2438419_29092021</WebcastID>
	<SpeakerSegment>
		<Role>Announcer</Role>
		<Name>UNK</Name>
		<Snippet>
			<Language>fr</Language>
			<TimestampBegin>16.5</TimestampBegin>
			<TimestampEnd>17.1</TimestampEnd>
			<Text>La Cour!</Text>
		</Snippet>
    ...
	</SpeakerSegment>
  ...
</Transcript>

We provide this nested format to make potential annotation tasks easier.

Both file formats contain the following information:

webcast_id: the identifier for the hearing (allows linking to documents)
Role: the role/party the speaker represents (Announcer for announcements, Judge for judges, JudgeP for judge president, Applicant for representatives of the applicant, Government for representatives of the respondent government, ThirdParty for representatives of third party interveners)
Name: the name of the speaker (not given for Applicant, Government or Third Party)
Begin: the timestamp for begin of line (in seconds)
End: the timestamp for end of line (in seconds)
Language: the language spoken (in ISO 639-1)
text: the spoken line

Subset documents

The second subset documents contains information on all relevant documents found in the HUDOC database which have a link to a webcast hearing. This link is established by the application number associated with the hearing and a case. To link transcripts with these documents, the webcast_id can be used. Each instance in documents represents information on a document in hudoc associated with a hearing and the metadata associated with a hearing. Note: hearing_type states the type of the hearing, type states the type of the document. If the hearing is a "Grand Chamber hearing", the "CHAMBER" document refers to a different hearing.

 '4': {
    'webcast_id': '2438419_29092021',
    'hearing_date': '2021-09-29 00:00:00',
    'hearing_title': 'H.F. and M.F. v. France and J.D. and A.D. v. France (nos. 24384/19 and 44234/20)',
    'hearing_type': 'Grand Chamber hearing',
    'appno': '44234/20',
    'case_id': '001-219333',
    'case_name': 'CASE OF H.F. AND OTHERS v. FRANCE',
    'case_url': 'https://hudoc.echr.coe.int/eng?i=001-219333',
    'type': 'GRANDCHAMBER',
    'typedescription': 15,
    'document_date': '2022-09-14 00:00:00',
    'collection': 'CASELAW;JUDGMENTS;GRANDCHAMBER;ENG',
    'importance': 1,
    'court': '8',
    'issue': 'Inter-ministerial instruction no. 5995/SG of 23 February 2018 on “Provisions to be made for minors on their return from areas of terrorist group operations (in particular the Syria-Iraq border area)”',
    'represented_by': 'DOSÉ M.',
    'respondent': 'FRA',
    'articles': '1;34;35;35-3-a;41;46;46-2;P4-3;P4-3-2',
    'strasbourg_caselaw': 'Abdi Ibrahim v. Norway [GC], no. 15379/16, § 180, 10 December 2021;Abdul Wahab Khan v. the United Kingdom (dec.), no. 11987/11, §§ 27-28, 28 January 2014;Airey v. Ireland, 9 October 1979, §§ 24-25, Series A no. 32;Al-Dulimi and Montana Management Inc. v. Switzerland [GC], no. 5809/08, §§ 134 and 145-146, 21 June 2016;[...]',
    'external_sources': 'Article 12 § 4 of the International Covenant on Civil and Political Rights (ICCPR);United Nations Human Rights Committee’s (UNCCPR) General Comment no. 27 on the Freedom of Movement under Article 12 of the ICCPR, adopted on 1 November 1999 (UN Documents CCPR/C/21/Rev.1/Add.9);Article 19 of the International Law Commission (ILC) Draft Articles on Diplomatic Protection and commentary;[...]',
    'conclusion': 'Preliminary objection dismissed (Art. 34) Individual applications;(Art. 34) Locus standi;Remainder inadmissible (Art. 35) Admissibility criteria;(Art. 35-3-a) Ratione loci;(Art. 35-3-a) Ratione personae;Violation of Article 3 of Protocol No. 4 - Prohibition of expulsion of nationals (Article 3 para. 2 of Protocol No. 4 - Enter own country);Respondent State to take individual measures (Article 46-2 - Individual measures);Non-pecuniary damage - finding of violation sufficient (Article 41 - Non-pecuniary damage;Just satisfaction)',
    'separate_opinion': 'TRUE',
    'judges': "Ganna Yudkivska;Jon Fridrik Kjølbro;Krzysztof Wojtyczek;Mārtiņš Mits;Robert Spano;Síofra O'Leary;Stéphanie Mourou-Vikström;Yonko Grozev;Georges Ravarani;Ksenija Turković;Lorraine Schembri Orland",
    'ecli': 'ECLI:CE:ECHR:2022:0914JUD002438419'
    }

The fields in documents are:

id: the identifier
webcast_id: the identifier for the hearing (allows linking to transcripts)
hearing_date: the date of the hearing
hearing_title: the title of the hearing
hearing_type: the type of hearing (Grand Chamber, Chamber or Grand Chamber Judgment Hearing)
appno: the application number which is associated with the hearing and case
case_id: the id of the case
case_name: the name of the case
case_url: the direct link to the document
type: the type of the document
typedescription: the exact identifier of the document type (distinction between e.g. Merits and Just Satisfaction, no key provided)
document_date: the date of the document
collection: the categorization of the document, i.e. type of document, type of chamber, language
importance: the importance score of the case (1 is the highest importance, key case)
court: the identifier for the court that issued the document
issue: the references to the issue of the case
represented_by: the person(s) representing the applicant(s)
respondent: the code of the respondent government(s) (in ISO-3166 Alpha-3)
articles: the concerning articles of the Convention of Human Rights
strasbourg_caselaw: the list of cases in the ECHR which are relevant to the current case
external_sources: the relevant references outside of the ECHR
conclusion: the short textual description of the conclusion
separate_opinion: the indicator if there is a separate opinion
judges: the judges appearing in the associated document
ecli: the ECLI (European Case Law Identifier)

Usage

Loading transcripts

XML

The xml format is nested and can be loaded e.g. with the function provided in load_lacour.py.

from load_lacour import load_transcript
from glob import glob
import pandas as pd

transcripts = []
for tf in glob('transcripts-xml/*.xml'):
    t, w = load_transcript(tf, format='xml')
    transcripts += t

df = pd.DataFrame(transcripts)

TXT

To load the txt files, you can use load_lacour.py:

from load_lacour import load_transcript
from glob import glob
import pandas as pd

transcripts = []
for tf in glob('transcripts-txt/*.txt'):
    t, w = load_transcript(tf)
    transcripts += t

df = pd.DataFrame(transcripts)

Loading document meta data

Load the .json file, i.e.

import pandas as pd
df = pd.read_json('lacour_linked_documents.json', orient='index', dtype={'webcast_id':str})

import json
with open('lacour_linked_documents.json') as f:
    d = json.load(f)

Questions and Opinions

The companion dataset for the experimental part using questions asked during the hearings and dissenting or concurring opinions can be found in the repository trusthlt/lacour-qando.

Data creation

Companion code for the creation of this dataset is available in the repository trusthlt/lacour-generation.