Awesome
regulatory-statement-classification
Scripts, algorithms and files for classifying sentences in EU legislative documents (in PDF or HTML format) as either regulatory or non-regulatory in nature using the Institutional Grammar Tool, as well as for evaluating the accuracy of these approaches. This code is developed as part of the Nature of EU Rules project.
Requirements
Setup: before running any scripts in this repository
-
Get a copy of the code:
git clone git@github.com:nature-of-eu-rules/regulatory-statement-classification.git
-
Change into the
regulatory-statement-classification/
directory:cd regulatory-statement-classification/
-
Create new virtual environment e.g:
python -m venv path/to/virtual/environment/folder/
-
Activate new virtual environment e.g. for MacOSX users type:
source path/to/virtual/environment/folder/bin/activate
-
Install required libraries for the script in this virtual environment:
pip install -r requirements.txt
Description of scripts in this repository
rule-based-classification.py
and rule-based-classification-batch.py
Given a list of English sentences which originate in EU legislative documents, these scripts apply a rule-based approach using grammatical dependency parsing and predefined dictionaries to classify the sentences as either regulatory (1) or non-regulatory (0) in nature. The only difference with rule-based-classification-batch.py
is that the classification results are periodically saved to disk after all sentences in documents from a specific year are processed, rather than saving all results for all documents to file only at the end.
Input
A CSV file with at least one column with column-header 'sent'. This column should be a list of English sentences that originate in EU legislative documents.
Output
Same CSV file as input with additional two columns: 'regulatory_according_to_rule' and 'attribute_according_to_rule' which are the classification results (0 or 1 whether the sentence is regulatory or not) and the name of the entity (called the 'attribute') in the sentence that is being regulated by the regulatory statement, respectively. The accuracy of the attribute extraction from the sentence is not currently measured and, based on cursory analysis, is likely not as high as the classification accuracy itself.
Usage
-
Check the command line arguments required to run the script by typing (use analogous instructions for
rule-based-classification-batch.py
):python rule-based-classification.py -h OUTPUT > usage: rule-based-classification.py [-h] -in INPUT -out OUTPUT -agts AGENTS Regulatory vs. Non-regulatory sentence classifier for EU legislation based on NLP dependency analysis optional arguments: -h, --help show this help message and exit required arguments: -in INPUT, --input INPUT Path to input CSV file. Must have at least one column with header 'sent' containing sentences from EU legislation in English. -out OUTPUT, --output OUTPUT Path to output CSV file in which to store the classification results. -agts AGENTS, --agents AGENTS Path to JSON file which contains data of the form {'agent_nouns' : [...list of lowercase English word strings, each of which represents an entity with agency...]}. Some example words include 'applicant', 'court', 'tenderer' etc.
-
Example usage:
python rule-based-classification.py --input path/to/input.csv --output path/to/output.csv --agents path/to/agent_nouns.json
generate-sample-for-annotation-and-classificationperformance-evaluation.py
Given a list of metadata for EU legislative documents in CSV format (please see this repo for scripts for downloading such data), this script generates a representative sample, based on variance by year and policy area for the input documents. This sample can be used for human labelling for training a classification model and evaluating the accuracy of this model and the rule-based algorithm implemented in rule-based-classification.py
.
Input
See metadata output of the eu_rules_metadata_extractor.py
script in this repo. The input CSV file for generate-sample-for-annotation-and-classificationperformance-evaluation.py
should have the same format.
Output
A CSV file with rows that are a subset of the rows of the input CSV file with the following columns / metadata retained: celex
(document identifier), form
(legislation type), year
(year when legislation was published), dc_string
(policy area), format
(file extension i.e., PDF/HTML). See this repo for more info.
Usage
-
Check the command line arguments required to run the script by typing:
python generate-sample-for-annotation-and-classificationperformance-evaluation.py -h OUTPUT > usage: generate-sample-for-annotation-and-classificationperformance-evaluation.py [-h] -in INPUT -out OUTPUT EU law sample document generator: generates a sample of EU legislative documents to annotate for regulatory sentence classification optional arguments: -h, --help show this help message and exit required arguments: -in INPUT, --input INPUT Path to input CSV file. See output file of 'eu_rules_metadata_extractor.py' in the https://github.com/nature-of-eu-rules/data-extraction repo for the required columns -out OUTPUT, --output OUTPUT Path to output CSV file which stores the generated sample (subset of the rows in the input file)
-
Example usage:
python generate-sample-for-annotation-and-classificationperformance-evaluation.py --input path/to/input.csv --output path/to/output.csv
train-fewshot-classifier.py
Given a labelled dataset of sentences labelled by human legal experts, this script fine-tunes a pre-trained fewshot model for the same sentence classification task that rule-based-classification.py
tackles.
Input
An input CSV file with training data. At least two columns are required: 1) a column with all items to classify (in our case, English sentences). 2) a column with human-specified labels (either 0 or 1 integer) for each item. In our case, a 0 corresponds to a non-regulatory sentence, and 1 corresponds to a regulatory sentence.
Output
- One or more classification models (depending on what parameters have been specified when running the script), saved to disk as .model files.
- A CSV file with validation results and predicted labels for the classification task on the input data. Each row in the file is the classification result for those specific values for the input parameters of the script.
Usage
-
Check the command line arguments required to run the script by typing:
python train-fewshot-classifier.py -h OUTPUT > usage: train-fewshot-classifier.py [-h] -in INPUT -ic ITEMSCOL -cc CLASSCOL -b BSIZE -e EPOCHS -t TSPLIT -out OUTPUT Fine-tune facebook/bart-large-mnli fewshot model to classify English sentences from EU law as either regulatory or non-regulatory optional arguments: -h, --help show this help message and exit required arguments: -in INPUT, --input INPUT Path to input CSV file with training data. -ic ITEMSCOL, --itemscol ITEMSCOL Name of column in input CSV file which contains the items to classify -cc CLASSCOL, --classcol CLASSCOL Name of column in input CSV file which contains the classified labels for the items -b BSIZE, --bsize BSIZE List of batch sizes e.g. [8,16,32] -e EPOCHS, --epochs EPOCHS List of numbers indicating different training iterations or epochs to try e.g. [20,25,30] -t TSPLIT, --tsplit TSPLIT Proportion of data to use as training data (the remainder will be used for validation). Number between 0 and 1. E.g. a value of 0.8 means 80 percent of the data will be used for training and 20 for validation. -out OUTPUT, --output OUTPUT Path to output CSV file with a summary of training results
-
Example usage:
python train-fewshot-classifier.py --input path/to/input.csv --itemscol 'item_col_name' --classcol 'itemlabel_col_name' --bsize '[8,16]' --epochs '[20,25]' --tsplit 0.8 --output path/to/output.csv
License
Copyright (2023) Kody Moodley, Christiaan Meijer, The Netherlands eScience Center
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.