Awesome
Legal Text Analytics
A list of selected resources, methods, and tools dedicated to Legal Text Analytics.
Please read the contribution guidelines before contributing. Please add a resource by raising a pull request. We also seek for discussion and proposal of new ideas (including additional content sections) as issues.
Contents
- Selected Tasks and Use Cases
- Methods
- Libraries
- Datasets and Data
- Large Language Models and GPT
- Annotation and Data Schemes
- Annotation Tools
- Software (interfaces)
- Research Groups and Labs
- Tutorials
Selected Tasks and Use Cases
- Optical Character Recognition (find more information here)
- Legal Document Pre-processing (find more information here)
- Clause Segmentation and Sentence Boundary Detection
- Information Extraction and Named Entity Recognition (find more information here)
- Legal Norm Classification
- Machine Translation
- Document Comparison and Semantic Matching
- Text Summarization
- Argument Mining
- Question Answering
- Legal Case Outcome Prediction
- Legal and Regulatory Monitoring
- Legal Criticality Prediction
- Court View Generation
- Reference and Coreference Extraction
- Document Assembling and Generation
- Voice Transcription
- Anomaly Detection
- Data Anonymization
- Consistency Checking
- Natural Language Processing in the Legal Domain
Methods
- NLP Progress
- Text Visualizations
- Optical Character Recognition
- Rule-based methods for NLP, Apache Ruta, Jape Grammar
- Statistical NLP
- Machine Learning Frameworks
- Neural networks and deep learning for NLP Tutorial
- Domain adaptation (e.g., research paper)
Libraries
- Spacy - Industrial-Strength Natural Language Processing
- Scikit - machine learning in python
- NLTK - Natural Language Toolkit
- Apache UIMA
- Gate - General Architecture for Text Engineering
- Hugging Face - more than 1000 pre-trained transformer/embedding models for the legal domain
- German Bert Model: Deepset AI
- Flair - SOTA NLP (incl. biomedical and legal data)
- Blackstone - Legal Named Entity Recognition and Text Categorizer
- Legal Reference Detection - Neo Search
- Legal Reference Detection - Open Legal Data
- Haystack - Transformers at scale for question answering & neural search
- Sentence Boundary Detection (US Caselaw)
- Quantitative Legal Studies
- CiteURL - an extensible tool to detect and hyperlink legal citations
- LexNLP – Python NLP library for legal text analytics
- Dutch Case Law Extractor - Functions to obtain published Dutch case law (rechtspraak) data and available metadata associated to the cases
- Case Law Explorer - Materials for building a network analysis software platform for analyzing Dutch and European court decisions
Datasets and Data
- NLP Datasets
- An 800GB Dataset of Diverse Text for Language Modeling
- Meta Search: Google Dataset Search
- OpenLegalData
- IR Ad-hoc Ranking Benchmarks, Training Datasets, etc.
- Belgium: Belgian Statutory Article Retrieval Dataset (BSARD), including code
- Awesome German NLP
- German Dataset for Legal Information Retrieval (GerDaLIR)
- Legal Entity Recognition
- Legal Text Summarization
- Legal Text Translation
- Legal Document Classification
- Legal Sentence Classification (German)
- 100k German Court Decisions
- Legal Paper Datasets
- LexGLUE: a Benchmark Dataset for Legal Language Understanding in English
- LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain
- MultiLegalPile: A 689GB Multilingual Legal Corpus
- MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset
- MultiLegalNeg
- Awesome Legal Data
- Germany: Gesetze im Internet, Rechtsprechung im Internet, Verwaltungsvorschriften im Internet
- Germany: Annotated Court Decisions (Judgment style)
- Germany: German Federal Courts Dataset
- Germany: Quantitative dataset of asylum court hearings at German administrative courts. ASYFAIR
- Germany: Answering legal questions from laymen in German civil law system: Data and code. EACL Paper 2024
- Germany: Detecting void clauses in German standard form consumer contracts
- Germany: Aktenzeichen der Bundesrepublik Deutschland (AZ-BRD)
- Germany: Corpus des Deutschen Bundesrechts (C-DBR)
- Germany: Corpus der Entscheidungen des Bundesverfassungsgerichts (CE-BVerfG)
- Germany: Corpus der amtlichen Entscheidungssammlung des Bundesverfassungsgerichts (C-BVerfGE)
- Germany: Corona-Rechtsprechung des Bundesverfassungsgerichts (BVerfG-Corona)
- Germany: Corpus der Entscheidungen des Bundesverwaltungsgerichts (CE-BVerwG)
- Germany: Corpus der Entscheidungen des Bundesarbeitsgerichts (CE-BAG)
- Germany: Corpus der Entscheidungen des Bundespatentgerichts (CE-BPatG)
- Germany: Corpus der Entscheidungen des Bundesgerichtshofs (CE-BGH)
- Germany: Presidents and Vice-Presidents of the Federal Courts of Germany (PVP-FCG)
- Germany: Stoppwörter der Deutschen Rechtssprache (SW-DE-RS)
- France: The French Court Decision Structure dataset — FCD12K
- Switzerland: Swiss Legislation Corpus French and German
- Switzerland: Swiss Federal Supreme Court Dataset (SCD)
- Switzerland: Swiss Judgment Prediction
- Switzerland: Swiss Judgment Prediction XL
- Switzerland: Swiss Criticality Prediction
- Switzerland: Swiss Law Area Prediction
- Switzerland: Swiss Leading Decisions
- Switzerland: Swiss Legislation
- Switzerland: Swiss Rulings
- Switzerland: Swiss Leading Decision Summarzation
- Switzerland: Swiss Citation Extraction
- Switzerland: Swiss Court View Generation
- Switzerland: Swiss Doc2Doc Information Retrieval
- Turkey: Prediction of Outcomes in the Higher Courts of Turkey
- India: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation
- ECtHR: Judicial Decisions of the European Court of Human Rights
- ECtHR: LaCour!: Enabling Research on Argumentation in Hearings of the European Court of Human Rights
- ECtHR: Argument Mining Corpus
- EU Law (eurlex R Package), Digital Corpus of the European Parliament (DCEP)
- EU Regulatory Compliance Information Retrieval
- EU LEXTREME
- Israel: The Israeli Supreme Court Database
- Canada: Federal Laws and Regulations (ftp://205.193.86.89/)
- UK: UK Law Reports & Case Law Search
- UK: Cambridge Law Corpus
- Australia: Open Australian Legal Corpus — The first and only multijurisdictional open corpus of Australian legislative and judicial documents
- US Statutory Law Interpretation Data Set
- US Caselaw Sentence Boundary Detection Dataset
- US Caselaw Functional and Issue Specific Segmentation Dataset
- US Caselaw Sentence Polarity Detection
- US Caselaw Access Project
- US Federal caselaw via CourtListener RECAP by the Free.Law project, includes an API
- US Supreme Court Database
- US House of Representatives Office of the Law Revision Counsel
- US Board of Veterans Appeals (BVA) Citation Prediction Dataset and Code
- Overview of Political Science Datasets: PolData
- International Law: Text of Trade Agreements (ToTA)
- International Law: Corpus of Decisions: International Court of Justice (CD-ICJ)
- International Law: Corpus of Decisions: Permanent Court of International Justice (CD-PCIJ)
- United Nations: United Nations General Debate Corpus, United Nations Parallel Corpus
- Contract Understanding Atticus Dataset by The Atticus Project: A corpus of 13,000+ labels in 510 commercial legal contracts with rich expert annotations.
- Kira Systems M&A Dataset by Kira Systems: A non-commercial use dataset comprising 4,400 documents and labels for 50 legal concepts in the M&A Due Diligence setting.
- India: ILSI Dataset for Legal Statute Identification
- India: Dataset for Semantic Segmentation / Rhetorical Role Labeling
- India: Summarization with Multiple Datasets
- India: BUILDNyAI
- European Patent Office - EP full-text data for text analytics
- Google Patents Public Datasets: connecting public, paid, and private patent data
- World Patent Information (WPI) - Documents technical domains from the major patenting authorities
- Genocide Transcript Corpus (GTC)
Large Language Models and GPT
- See dedicated repository on Large Language Models (LLMs) and Generative Pre-trained Transformers (GPTs) for Legal
- ChatGPT at OpenAI: Examples, Documentation, Pricing, Fine-tuning ChatGPT
- Sketch summarizing ChatGPT
- Large Language Models: Report by KI Bundesverband
- Large Language Models: Hugging Face Report
- Report on Limitations of ChatGPT
- GPT Takes the Bar Exam
- Legal Language Models
Annotation and Data Schemes
- Annotation guidelines for Legal Entity Recognition (Germany)
- Semantic Types of Legal Norms
- Annotation Guidelines for Sentence Boundary Detection in Caselaw (US)
- Annotation Guidelines for Sentence Value in Statutory Interpretation (US)
- SALI: Modern Legal Industry Standards
Annotation Tools
Software (interfaces)
- Case Law Explorer - Network analysis software platform for analyzing Dutch and European court decisions - User Guide
- Electronic Database on Investment Treaties (EDIT)
- GraphDoc - User-friendly graphical interface that allows building decision trees - codebase
- gesp - Download all publicly available German court decisions straight from your terminal
Research Groups, Labs, and Communities
- Stanford University - CodeX: The Stanford Center for Legal Informatics
- Technical University of Munich
- Technical University of Munich - Legal Tech Group
- Bucerius Center on the Legal Profession
- Suffolk Law School - Legal Innovation & Technology (LIT) Lab
- University of Ottawa - Legal Technology Lab
- University of Vienna - Department of Innovation and Digitalisation in Law
- University of Amsterdam - Leibniz Center for Law
- University of Helsinki - LegalTech Research Lab
- Hofstra University - Law, Logic & Technology Research Laboratory
- Computational Legal Studies
- CIRSFID-AI – University of Bologna
- IAAIL - International Association for AI and Law
- ASAIL - Automated Detection, Extraction and Analysis of Semantic Information in Legal Texts
- Workshop on Natural Legal Language Processing: Papers, models, data sets, and related events
- Chinese AI and Law (CAIL)
- University of Copenhagen, iCourts, the Danish National Research Foundation's Centre of Excellence for International Courts
- Maastricht Law and Tech Lab
Tutorials
- Monkey Learn - Text Analysis
- Using NLP to understand laws
- Document Representation for Legal Texts
- Data Science for Lawyers - Learning Resources
- Coding for Lawyers (discontinued)
- Custom NLP Approaches to Data Anonymization
- Information Extraction in legal documents
- Legal NLP: Sentence classification and Explainable AI
- Legal AI Glossary
- Legal AI Learning Centre
Credits
Many thanks to our contributors and many more.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.