Awesome
Awesome-LegalAI-Resources
This repository aims to collect and curate resources related to Legal AI, including datasets, websites, and other useful links. Whether you are a researcher, developer, or simply interested in the intersection of law and artificial intelligence, we hope this repository provides valuable information and references.
About
The rapid advancements in AI technologies have significantly impacted various domains, including the legal industry. The purpose of this repository is to collect and organize resources that cover a wide range of topics related to Legal AI, including but not limited to:
- Natural Language Processing (NLP) for legal text analysis
- AI-powered legal research tools
- Automated contract analysis and generation
- Predictive analytics in legal decision-making
- Legal document classification and summarization
- Ethical considerations in Legal AI
- Legal implications of AI adoption in the legal profession
This repository serves as a centralized hub for researchers, practitioners, and enthusiasts to discover, share, and collaborate on Legal AI resources. Whether you're looking for tutorials, datasets, websites or open-source projects, you'll find valuable material here.
General Corpus
-
MultiLegalPile: A 689GB corpus in 24 languages from 17 jurisdictions. Paper Link
Language: multilingual Country: multinational
-
MC4_legal: This dataset contains large text resources (~106GB in total) from mc4 filtered for legal data that can be used for pretraining language models. Link
Language: multilingual Country: multinational
-
EurlexResources: This dataset contains large text resources (~179GB in total) from EURLEX that can be used for pretraining language models. Link
Language: multilingual Country: multinational
-
LeXFile: The LeXFiles is a new diverse English multinational legal corpus that we created including 11 distinct sub-corpora that cover legislation and case law from 6 primarily English-speaking legal systems (EU, CoE, Canada, US, UK, India). The corpus contains approx. 19 billion tokens. Paper Link
Language: English Country: multinational
-
Pile of Law: A 256GB (and growing) dataset of open-source English-language legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records. Paper Link
Language: English Country: Unknown
-
Spanish Legal Domain Corpora: Our corpora comprises multiple digital resources and it has a total of 8.9GB of textual data. Paper Link
Language: Spanish Country: Spanish
-
GeLeCo: GeLeCo is a large German Legal Corpus for research, teaching and translation purposes. It includes the complete collection of federal laws, administrative regulations and court decisions published on three online databases by the German Federal Ministry of Justice and Consumer Protection and the Federal Office of Justice. Link
Language: German Country: German
-
CourtListener: The original Court Listener dataset is a collection of every court opinion published by every court in the United States. It covers 406 jurisdictions (out of 423), with opinions from the year 1754 up to now. It is constantly updated with newly filed opinions, and digitized archives. Link
Language: English Country: America
Evaluation Benchmark
Multi Legal Task
-
LegalLAMA: LegalLAMA is a diverse probing benchmark suite comprising 8 sub-tasks that aims to assess the acquaintance of legal knowledge that PLMs acquired in pre-training. Paper Link
Language: English Country: multinational
-
LexGLUE: LexGLUE comprises seven datasets: ECtHR Task A and B, SCOTUS, EUR-LEX, LEDGAR, UNFAIR-ToS, and CaseHOLD that are available for re-use and re-share with appropriate attribution. Paper Link
Language: English Country: multinational
-
LEXTREME: The dataset consists of 11 diverse multilingual legal NLU datasets. 6 datasets have one single configuration and 5 datasets have two or three configurations. This leads to a total of 18 tasks. Paper Link
Language: multilingual Country: multinational
-
LegalBench: LegalBench is a collaborative benchmark intended to evaluate English large language models on legal reasoning and legal text-based tasks. LegalBench currently consists of more than 90 tasks. Paper Link
Language: English Country: multinational
-
LBOX OPEN: This paper present the first large-scale benchmark of Korean legal AI datasets, LBOX OPEN, that consists of one legal corpus, two classification tasks, two legal judgement prediction (LJP) tasks, and one summarization task. Paper Link
Language: Korean Country: Korean
-
GENTLE: We present GENTLE, a new mixed-genre English challenge corpus totaling 17K tokens and consisting of 8 unusual text types for out-of domain evaluation: dictionary entries, esports commentaries, legal documents, medical notes, poetry, mathematical proofs, syllabuses, and threat letters. Paper Link
Language: English Country: Unknown
-
SCALE: In this paper, we introduce a novel NLP benchmark that poses challenges to current LLMs across four key dimensions: processing long documents (up to 50K tokens), utilizing domain specific knowledge (embodied in legal texts), multilingual understanding (covering five languages), and multitasking (comprising legal document to document Information Retrieval, Court View Generation, Leading Decision Summarization, Citation Extraction, and eight challenging Text Classification tasks). Paper Link
Language: multilingual Country: Switzerland
Legal Case Retrieval
-
LeCaRD: LeCaRD composes of 107 query cases and 10,700 candidate cases selected from a corpus of over 43,000 Chinese criminal judgements. Paper Link
Language: Chinese Country: China
-
LeCaRDv2: LeCaRDv2 is one of the largest Chinese legal case retrieval datasets with the widest coverage of criminal charges. The dataset comprises 800 query cases and 55,192 candidate cases extracted from 4.3 million criminal case documents. Link
Language: Chinese Country: China
-
COLIEE: The Competition on Legal Information Extraction/Entailment (COLIEE) is an annual international competition whose aim is to achieve state-of-the-art methods for legal text processing. Task 1 is the legal case retrieval task. Task 3 is the statute law retrieval task. Paper Link
Language: English/Japanese Country: Canada/Japan
-
document-similarity: The task here is to calculate a similarity score (in the range 0-1) between two case documents. The dataset collected 53, 210 publicly available case documents from the Supreme Court of India and and 12, 814 Acts from the Indian judiciary. Paper Link
Language: English Country: India
Question Answering
-
JEC-QA: the largest question answering dataset in the legal domain, collected from the National Judicial Examination of China. There are 26,365 questions in JEC-QA. Paper Link
Language: Chinese Country: China
-
CaseHOLD: This CaseHOLD dataset provides 53,000+ multiple choice questions with prompts from a judicial decision and multiple potential holdings, one of which is correct, that could be cited. Paper Link
Language: English Country: America
-
SARA: A novel dataset based on US tax law, together with test cases. Paper Link
Language: English Country: America
-
PrivacyQA: PrivacyQA is a corpus consisting of 1750 questions about the contents of privacy policies, paired with expert annotations. Paper Link
Language: English Country: America
Legal Case Entailment
-
COLIEE: The Competition on Legal Information Extraction/Entailment (COLIEE) is an annual international competition whose aim is to achieve state-of-the-art methods for legal text processing. Task 2 is the legal case entailment task. Task 4 is the legal textual entailment data corpus. Paper Link
Language: English/Japanese Country: Canada/Japan
-
Legal Linking: This paper describes a dataset and baseline systems for linking paragraphs from court cases to clauses or amendments in the US Constitution. Paper Link
Language: English Country: America
Document Classification
-
CAIL2018: CAIL2018 contains more than 2.6 million criminal cases published by the Supreme People’s Court of China. It consists of applicable law articles, charges, and prison terms, which are expected to be inferred according to the fact descriptions of cases. Paper Link
Language: Chinese Country: China
-
ECHR: This paper contributes a new publicly available English legal judgment prediction dataset of cases from the European Court of Human Rights (~11.5k cases). Paper Link
Language: English Country: European
-
Swiss-Judgment-Prediction: The paper publicly release a multilingual (German, French, and Italian), diachronic (2000-2020) corpus of 85K cases from the Federal Supreme Court of Switzer- land (FSCS). Paper Link
Language: multilingual Country: multinational
-
German Legal Decision Corpora:To meet this need for publicly available German legal text corpora this paper presents two German legal text corpora. The first corpus contains 32,748 decisions from 131 German courts, enriched with metadata. The second one is a subset of the first corpus and consists of 200 randomly chosen judgements. Paper Link
Language: German Country: German
-
EURLEX57K:We release a new dataset of 57k legislative documents from EUR-LEX, the European Union’s public document database, annotated with concepts from EUROVOC, a multidisciplinary thesaurus. Paper Link
Language: English Country: European
-
German rental agreements:601 sentences from the tenancy law of the German Civil Code and 312 sentences, classified according to a semantic type system consisting of 9 different classes, from German rental agreements. Paper Link
Language: English Country: German
Summarization
-
BillSum: We introduce the BillSum dataset, which contains a primary corpus of 22,218 US Congressional bills and reference summaries split into a train and a test set. Paper Link
Language: English Country: America
-
EUR-Lex-Sum: We obtain up to 1,500 document/summary pairs per language, including a subset of 375 crosslingually aligned legal acts with texts available in all 24 languages. Paper Link
Language: multilingual Country: European
-
Plain English Summarization of Contracts: The dataset we propose contains 446 sets of parallel text. Paper Link
Language: English Country: America
-
Summarization-of-Privacy-Policies: This dataset was extracted from the text of privacy policy, terms of service, and cookie policy of 151 companies. The Points and Plain English Summaries are extracted from tosdr.org. Paper Link
Language: English Country: Unknown
-
Multi-LexSum: We introduce Multi-LexSum, a collection of 9,280 expert-authored summaries drawn from ongoing CRLC writing. Paper Link
Language: English Country: Unknown
Entity extraction
-
CDJUR-BR: We describe the development of the Golden Collection of the Brazilian Judiciary (CDJUR-BR) contemplating a set of fine-grained named entities that have been annotated by experts in legal documents. This contains 44,526 annotations for 21 entities. Paper Link
Language: Portuguese Country: Brazilian
-
Extracting Contract Elements: The paper describes and is accompanied by a new benchmark dataset of approximately 3,500 English contracts with gold contract element annotations. Paper
Language: English Country: England
-
LEVEN: LEVEN contains 108 event types in total, including 64 charge-oriented events and 44 general events. Their distribution is shown below. Paper Link
Language: Chinese Country: China
Others
-
MAUD: To address this challenge, we introduce the Merger Agreement Understanding Dataset (MAUD), an expert-annotated reading comprehension dataset based on the American Bar Association’s 2021 Public Target Deal Points Study, with over 39,000 examples and over 47,000 total annotations. Paper Link
Language: English Country: America
-
VerbCL: This paper presents a new dataset that consists of the citation graph of court opinions, which cite previously published court opinions in support of their arguments. Paper Link
Language: English Country: America
-
MultiLegalSBD: Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. We curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. Paper Link
Language: multilingual Country: multinational
-
FairLex: Our benchmarks cover four jurisdictions (European Council, USA, Switzerland, and China), five languages (English, German, French, Italian and Chinese) and fairness across five attributes (gender, age, region, language, and legal area). Paper Link
Language: multilingual Country: multinational
-
ContractNLI: In this work, we propose documentlevel natural language inference (NLI) for contracts, a novel, real-world application of NLI that addresses such problems. We annotated and release the largest corpus to date consisting of 607 annotated contracts. Paper Link
Language: English Country: America
-
Demosthen: A novel corpus for argument mining in legal documents, composed of 40 decisions of the Court of Justice of the European Union on matters of fiscal state ai. Paper Link
Language: English Country: European
Websites
-
https://flk.npc.gov.cn/ all Chinese laws and regulations.
-
https://wenshu.court.gov.cn/ judicial documents in China.
-
https://www.westlaw.com/: a well-known legal research platform that provides access to legal documents, cases, statutes, commentaries, and legal news from around the world.
-
https://www.lexisnexis.com/: another widely used legal research tool that offers global legal documents, cases, statutes, news, and commentaries.
-
https://home.heinonline.org/: a specialized legal and law-related research database that includes legal documents, journals, statutes, and more from the United States and other countries.
-
https://case.law/ all official, book-published United States case law.
-
https://www.legifrance.gouv.fr/ a French legal publisher providing access to law codes and legal decisions.
-
http://scdb.wustl.edu/ information about every case decided by the US Supreme Court between 1791 and today.
-
https://www.statmt.org/europarl/ Parallel text of the proceedings of the European Parliment, collected in 11 languages.
-
https://uscode.house.gov/download/download.shtml downloadable version of the US Code in XML format
-
https://www.uspto.gov/ip-policy/economic-research/research-datasets/patent-litigation-docket-reports-data detailed patent litigation data on over 80k unique district court cases
-
https://curia.europa.eu/jcms/jcms/j_6/: the official website of the European Court of Justice, offering access to European Union legal documents and cases.
-
https://www.justia.com/: provides access to a wide range of legal information, including cases, statutes, regulations, and legal articles. Covers both U.S. federal and state laws.
-
https://www.findlaw.com/: offers legal resources and information, including cases, statutes, regulations, and legal news. Covers U.S. federal and state laws.
-
https://www.courtlistener.com/: a free legal research platform that provides access to U.S. federal and state court cases, along with other legal documents and opinions.
-
https://www.pacer.gov/: the Public Access to Court Electronic Records (PACER) system provides access to U.S. federal court documents, including case filings, docket information, and court opinions. Registration and fees may apply.
-
https://www.law.cornell.edu/: operated by Cornell Law School, the LII offers access to U.S. federal and state laws, regulations, and court cases, along with legal articles and resources.
-
https://www.bailii.org/: provides access to legal materials from the United Kingdom and Ireland, including cases, legislation, and legal journals.
-
https://www.austlii.edu.au/: Offers access to legal materials from Australia and neighboring countries, including cases, legislation, treaties, and law reform reports.
-
https://www.canlii.org/: Provides access to Canadian legal documents, including cases, statutes, regulations, and court rules.
-
http://www.worldlii.org/: a free and independent global legal research resource, aggregating legal materials from various countries and regions.
Contact
If you believe there's any missing resources or have any questions, suggestions, or concerns, please feel free to open an issue on the repository or contact us via email liht22@mails.tsinghua.edu.cn.
License
This repository is licensed under the MIT License. You are free to use, modify, and distribute the content of this repository for both commercial and non-commercial purposes. However, we kindly request that you provide attribution by linking back to this repository.
We hope you find the Awesome-LegalAI-Resource Repository valuable and discover new insights and tools for your Legal AI journey. If you have any questions, suggestions, or concerns, please don't hesitate to open an issue. Happy exploring!