Awesome

GerDaLIR

The German Dataset for Legal Information Retrieval (GerDaLIR) is a legal information retrieval dataset comprising a large collection of documents, passages and relevance labels. The large amount of training data we provide enables GerDaLIR to be used as a downstream task for German or multilingual language models. The task provided is a precedent retrieval task based on case documents from the open legal information platform Open Legal Data. Relevance labels are derived from references: If a passage contains a reference to one or more available documents, the passage is used as a query while the referenced cases are labelled as relevant.

More information about the dataset and its generation can be taken from the official GerDaLIR paper

Download

All files are formatted with tab-separators (*.tsv) and packed with gzip. Some files are packed to tarballs and compressed (*.tar.gz). Sizes specified refer to decompressed size. Details about the individual Files can be found in the dataset details section

Essential Files

The collection, the queries and the relevance labels can be downloaded with the links in the table below.

Filename	Description	Num Records	Size	Format
`collection.tsv.gz`	Collection - one passage per line	3.095.383	2.0 GB	`d_id, passage`
`queries.tar.gz`	Queries - train, dev and test	122.975	127 MB	`q_id, query`
`qrels.tar.gz`	Labels - train, dev and test	144.324	1.7 MB	`q_id, d_id`

Optional Files

Beneath the essential files, also a bunch of optional files can be downloaded. Among them is a file that maps document-ids back to the original file numbers (see details) . For convenience, we also provide passage-wise and document-wise BM25 rankings (candidates) that can be used for training or re-ranking. Passage-wise candidate files contain the whole passages as text, while document-wise candidate files only specify document-ids.

Filename	Description	Num Records	Size	Format
`refmap.tsv.gz`	Doc-Ids to reference number	131.446	6.2 MB	`d_id, slug, file`
`pass-candidates.train.tsv.gz`	Top-1000 passage candidates with text (train)	98.380.000	85 GB	`q_id, d_id, rank, text`
`pass-candidates.dev.tsv.gz`	Top-1000 passage candidates with text (dev)	12.297.000	11 GB	`q_id, d_id, rank, text`
`pass-candidates.test.tsv.gz`	Top-1000 passage candidates with text (test)	12.298.000	11 GB	`q_id, d_id, rank, text`
`doc-candidates.train.tsv.gz`	Top-1000 document candidates (train)	98.380.000	1.5 GB	`q_id, d_id, rank`
`doc-candidates.dev.tsv.gz`	Top-1000 document candidates (dev)	12.297.000	189 MB	`q_id, d_id, rank`
`doc-candidates.test.tsv.gz`	Top-1000 document candidates (test)	12.298.000	189 MB	`q_id, d_id, rank`

Dataset details

Pre-processing

All documents have been pre-processed with the goal to remove all text that is not natural language and remove hints that neural models might exploit. This includes html markup, margin numbers, references, dates and numbers in general. We parse dates as well as references to statutes and other cases using regular expressions, and replace the occurrences with a [DATE] or [REF] token respectively. Braced contents are removed (including the braces), which in the most cases are comprehensive reference descriptions. All remaining numbers are replaced by zeros.

Collection

The collection file contains potentially relevant documents split into passages. Note that passages have no own passage-ids assigned to them. Each line represents one passage and begins with the corresponding document-id (d_id). For document-level retrieval, all passages with the same document-id must be concatenated.

1       Das Zulassungsvorbringen der Klägerin begründet keine ernstlichen Zweifel an der Richtigkeit des angefochtenen Urteils . Zweifel in diesem Sinn si..
1       Daran fehlt es hier. Die Antragsbegründung, wonach die Klägerin entgegen den Ausführungen im angefochtenen Urteil zuverlässig sei und die Urteil...
1       In der Antragsbegründung fehlt es an jeglichen Ausführungen zu der ausführlichen Würdigung des Verwaltungsgerichts, die sich im Einzelnen mit de...
2       Tenor Auf die Beschwerden der Antragsteller wird der Beschluss des Verwaltungsgerichts Göttingen 0. Kammer vom [DATE] geändert. Die Antragsgegneri..
2       Durch Beschluss vom [DATE] , auf den wegen der Einzelheiten des Sachverhalts und der Begründung Bezug genommen wird, hat das Verwaltungsgericht den...

Queries

Queries are passages that referenced one or more collection documents. There is one query file each for training, development and testing. Query ids (q_id) are assigned globally so that the query files could be concatenated without any problem.

2       Nach [REF] ist eine Erlaubnis zu widerrufen, wenn nachträglich bekannt wird, dass die Voraussetzung nach § 0 Nummer 0 nicht erfüllt ist. Gem...
3       Erforderlich ist mithin eine Prognoseentscheidung unter Berücksichtigung aller Umstände des Einzelfalls dahingehend, ob der Betreffende wille...
4       [REF] ist in Reaktion auf das Urteil des Schleswig-Holsteinischen Landesverfassungsgerichts neu gefasst worden, vor dem Hintergrund, dass sich...
5       Die streitgegenständliche Satzung gibt nicht die Rechtsvorschriften an, welche zum Erlass der Satzung berechtigen, [REF] . Dies ist aber insbe...
6       Insofern gehört zur zutreffenden Angabe der zum Erlass der Satzung berechtigenden Rechtsvorschriften im Sinne des [REF] nicht nur die genaue A...

Qrels

The relevance labels to the queries are also split into sets for training, development and testing. To each query at least one relevance label exist. Multiple relevance labels for a single query result in multiple lines with one target document-id each.

2       118149
3       72511
4       74503
5       4240
5       72939

Reference Number Map

To each assigned document-id, we provide the corresponding Open Legal Data slug and the official file number. With the slug, the original case document can directly be accessed on the Open Legal Data website as follows: https://de.openlegaldata.io/case/<slug>.

1       ovgnrw-2020-11-03-4-a-236320                        4 A 2363/20
2       ovgni-2020-11-03-2-nb-25120                         2 NB 251/20
3       vg-regensburg-2020-11-03-ro-12-k-192080             12 K 19/2080
4       ovgnrw-2020-11-03-18-a-102119                       18 A 1021/19
5       vg-schleswig-holsteinisches-2020-11-03-1-b-12520    1 B 125/20

Benchmarking with GerDaLIR

As described in the paper, we take two types of ranking modalities into account: passage-wise and document-wise ranking. In the passage-wise ranking setting, each passage is mapped to the document it originates from. Thus, in a ranking of passages, multiple ranks may refer to the same document. To cast this to a regular ranking of documents, we discard all passages that refer to a document that was listed at a higher rank. This is equivalent to score max-pooling.

Baselines

We provide several baselines on our dataset. With Mode "P" and "D" we refer to passage-wise and document-wise ranking as described above. More details to the baseline methods can be found in our paper.

Method	Mode	MRR@10	nDCG@20	Recall@100	Recall@1000
TF-IDF	P	0.333	0.375	0.651	0.768
	D	0.336	0.386	0.701	0.809
BM25 (k1=1.20, b=0.75)	P	0.365	0.409	0.693	0.800
	D	0.386	0.434	0.734	0.827
BM25 tuned (k1=0.51, b=0.72)	P	0.372	0.417	0.703	0.803
BM25 tuned (k1=0.90, b=0.98)	D	0.391	0.439	0.737	0.829
WCS - GloVe	P	0.242	0.278	0.539	0.695
	D	0.134	0.166	0.420	0.625
WCS - fastText	P	0.257	0.295	0.582	0.726
	D	0.153	0.188	0.468	0.668
Neural Re-ranking - BERT	P	0.416	0.465	0.745	0.789
Neural Re-ranking - ELECTRA	P	0.436	0.481	0.743	0.789

How do I cite this work?

If you use this dataset for your research, please consider citing our Paper:

@inproceedings{wrzalik-krechel-2021-gerdalir,
    title = "{G}er{D}a{LIR}: A {G}erman Dataset for Legal Information Retrieval",
    author = "Wrzalik, Marco  and
      Krechel, Dirk",
    booktitle = "Proceedings of the Natural Legal Language Processing Workshop 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.nllp-1.13",
    pages = "123--128",
    abstract = "We present GerDaLIR, a German Dataset for Legal Information Retrieval based on case documents from the open legal information platform Open Legal Data. The dataset consists of 123K queries, each labelled with at least one relevant document in a collection of 131K case documents. We conduct several baseline experiments including BM25 and a state-of-the-art neural re-ranker. With our dataset, we aim to provide a standardized benchmark for German LIR and promote open research in this area. Beyond that, our dataset comprises sufficient training data to be used as a downstream task for German or multilingual language models.",
}