Awesome
data-extraction
Data extraction scripts for the Nature of EU Rules project.
Full text extractor
Given a list of CELEX identifiers for EU legislation, the eu_rules_fulltext_extractor.py
script downloads the corresponding documents for the legislation from the EURLEX website. It stores the files in two folders: one for HTML documents and one for PDF documents. Some older documents are only available in PDF format on the website. As a first priority, the script tries to download the HTML version of a document if available (because this format is easier to parse later on). If there is no HTML version available, it extracts the PDF version. If neither the HTML nor PDF versions could be extracted for whatever reason, the script keeps a list of CELEX identifiers which encountered errors when downloading.
Metadata extractor
Given a list of CELEX identifiers for EU legislation, the eu_rules_metadata_extractor.py
script downloads the metadata for the corresponding legislative documents downloaded by eu_rules_fulltext_extractor.py
. The metadata is extracted from the EU publications office's CELLAR SPARQL interface. If any information is missing in CELLAR, it alternatively tries to scrape the information from the EURLEX webpage for that legislation. The resulting metadata is stored in an output CSV file. Below is a table describing the metadata we extract for this project:
# | metadata | description | example value | other possible values |
---|---|---|---|---|
1 | celex | CELEX identifier for legislation | 32005R0091 | 32019D0001 |
2 | author | Author of the legislation | European Commission | European Parliament, Council of European Union |
3 | responsible_boy | EU body or agent responsible for legislation | DG01/X/00 | SUD./X/00 |
4 | form | Type of legislation | R (regulation) | L (directive), D (decision) |
5 | title | Short textual summary of the major details about the legislation | Commission Regulation (EEC) No 1631/82 of 21 June 1982 on the supply of common wheat flour to Somalia as food aido | Commission Regulation (EEC) No 1679/82 of 29 June 1982 fixing, for the 1982/83 marketing year, the reference prices for pears |
6 | addressee | Agent to which the legislation is addressed | France | Germany |
7 | date_adoption | Date when legislation was accepted for implementation | 1997-03-11 | 1976-01-23 |
8 | date_in_force | Date when legislation became enforcable (can have multiple values) | European Commission | 1997-03-11 |
9 | date_end_validity | Date when legislation expires | 2023-02-02 | 2023-12-14 |
10 | directory_code | Identifier for official policy area of legislation (extracted as HTTP link to RDF description of the policy area) | http://publications.europa.eu/resource/authority/dir-eu-legal-act/014065 | http://publications.europa.eu/resource/authority/dir-eu-legal-act/04103010 |
11 | eurovoc | EUROVOC keyword classification or categorisation of this legislation's topics | Union transit, viticulture, wine, beverage industry ,tobacco, administrative cooperation | export (EU), chemical fertiliser, France, inter-company agreement |
12 | subject_matters | EURLEX alternative keyword classification scheme of this legislation's topics | Competition, Agreements, decisions and concerted practices | Commercial policy, Protective measures |
Requirements
- Python 3.9.12+
- A tool for checking out a Git repository.
- Input CSV file with single column (no header) of CELEX identifiers for EU legislation
Usage steps for eu_rules_fulltext_extractor.py
-
Get a copy of the code:
git clone git@github.com:nature-of-eu-rules/data-extraction.git
-
Change into the
data-extraction/
directory:cd data-extraction/
-
Create new virtual environment e.g:
python -m venv path/to/virtual/environment/folder/
-
Activate new virtual environment e.g. for MacOSX users type:
source path/to/virtual/environment/folder/bin/activate
-
Install required libraries for the script in this virtual environment:
pip install -r requirements.txt
-
Check the command line arguments required to run the script by typing:
python eu_rules_fulltext_extractor.py -h OUTPUT > usage: eu_rules_fulltext_extractor.py [-h] -in INPUT -htp HTMLPATH -pdp PDFPATH -prp PROBPATH EURLEX PDF and HTML legislative documents downloader optional arguments: -h, --help show this help message and exit required arguments: -in INPUT, --input INPUT Path to input CSV file (single column, no header, list of celex identifiers). Find more info about CELEX identifiers here: http://eur-lex.europa.eu/content/help/eurlex-content/celex- number.html -htp HTMLPATH, --htmlpath HTMLPATH Path to directory where to store the extracted EU legislative documents from EURLEX (http://eur-lex.europa.eu/) in HTML format. Each downloaded document will be named only with the CELEX identifier e.g. 32012R0145.html -pdp PDFPATH, --pdfpath PDFPATH Path to directory where to store the extracted EU legislative documents from EURLEX (http://eur-lex.europa.eu/) in PDF format. Each downloaded document will be named only with the CELEX identifier e.g. 32013R0148.pdf -prp PROBPATH, --probpath PROBPATH Path to a directory. After execution, this script will write a CSV file called 'problematic-celexes.csv' to this directory containing a list of CELEX identifiers for legislation that could not be downloaded for whatever reason
-
Example usage:
python eu_rules_fulltext_extractor.py --input path/to/celex_nums.csv --htmlpath path/to/htmls/ --pdfpath path/to/pdfs/ --probpath path/to/problems/
Usage steps for eu_rules_metadata_extractor.py
-
Get a copy of the code:
git clone git@github.com:nature-of-eu-rules/data-extraction.git
-
Change into the
data-extraction/
directory:cd data-extraction/
-
Create new virtual environment e.g:
python -m venv path/to/virtual/environment/folder/
-
Activate new virtual environment e.g. for MacOSX users type:
source path/to/virtual/environment/folder/bin/activate
-
Install required libraries for the script in this virtual environment:
pip install -r requirements.txt
-
Check the command line arguments required to run the script by typing:
python eu_rules_metadata_extractor.py -h OUTPUT > usage: eu_rules_metadata_extractor.py [-h] -in INPUT -out OUTPUT EURLEX PDF and HTML legislative documents metadata downloader optional arguments: -h, --help show this help message and exit required arguments: -in INPUT, --input INPUT Path to input CSV file (single column, no header, list of celex identifiers). Find more info about CELEX identifiers here: http://eur-lex.europa.eu/content/help/eurlex-content/celex- number.html -out OUTPUT, --output OUTPUT Path to a CSV file to store the metadata in e.g. 'path/to/metadata.csv'.
-
Example usage:
python eu_rules_metadata_extractor.py --input path/to/celex_nums.csv --output path/to/metadata.csv
License
Copyright (2023) Kody Moodley, The Netherlands eScience Center
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.