Awesome

data-extraction

Data extraction scripts for the Nature of EU Rules project.

Full text extractor

Given a list of CELEX identifiers for EU legislation, the eu_rules_fulltext_extractor.py script downloads the corresponding documents for the legislation from the EURLEX website. It stores the files in two folders: one for HTML documents and one for PDF documents. Some older documents are only available in PDF format on the website. As a first priority, the script tries to download the HTML version of a document if available (because this format is easier to parse later on). If there is no HTML version available, it extracts the PDF version. If neither the HTML nor PDF versions could be extracted for whatever reason, the script keeps a list of CELEX identifiers which encountered errors when downloading.

Metadata extractor

Given a list of CELEX identifiers for EU legislation, the eu_rules_metadata_extractor.py script downloads the metadata for the corresponding legislative documents downloaded by eu_rules_fulltext_extractor.py. The metadata is extracted from the EU publications office's CELLAR SPARQL interface. If any information is missing in CELLAR, it alternatively tries to scrape the information from the EURLEX webpage for that legislation. The resulting metadata is stored in an output CSV file. Below is a table describing the metadata we extract for this project:

#	metadata	description	example value	other possible values
1	celex	CELEX identifier for legislation	32005R0091	32019D0001
2	author	Author of the legislation	European Commission	European Parliament, Council of European Union
3	responsible_boy	EU body or agent responsible for legislation	DG01/X/00	SUD./X/00
4	form	Type of legislation	R (regulation)	L (directive), D (decision)
5	title	Short textual summary of the major details about the legislation	Commission Regulation (EEC) No 1631/82 of 21 June 1982 on the supply of common wheat flour to Somalia as food aido	Commission Regulation (EEC) No 1679/82 of 29 June 1982 fixing, for the 1982/83 marketing year, the reference prices for pears
6	addressee	Agent to which the legislation is addressed	France	Germany
7	date_adoption	Date when legislation was accepted for implementation	1997-03-11	1976-01-23
8	date_in_force	Date when legislation became enforcable (can have multiple values)	European Commission	1997-03-11
9	date_end_validity	Date when legislation expires	2023-02-02	2023-12-14
10	directory_code	Identifier for official policy area of legislation (extracted as HTTP link to RDF description of the policy area)	http://publications.europa.eu/resource/authority/dir-eu-legal-act/014065	http://publications.europa.eu/resource/authority/dir-eu-legal-act/04103010
11	eurovoc	EUROVOC keyword classification or categorisation of this legislation's topics	Union transit, viticulture, wine, beverage industry ,tobacco, administrative cooperation	export (EU), chemical fertiliser, France, inter-company agreement
12	subject_matters	EURLEX alternative keyword classification scheme of this legislation's topics	Competition, Agreements, decisions and concerted practices	Commercial policy, Protective measures

Requirements

Python 3.9.12+
A tool for checking out a Git repository.
Input CSV file with single column (no header) of CELEX identifiers for EU legislation

Usage steps for `eu_rules_fulltext_extractor.py`

Get a copy of the code:

 git clone git@github.com:nature-of-eu-rules/data-extraction.git

Change into the data-extraction/ directory:
```
 cd data-extraction/
```

Create new virtual environment e.g:

 python -m venv path/to/virtual/environment/folder/

Activate new virtual environment e.g. for MacOSX users type:

 source path/to/virtual/environment/folder/bin/activate

Install required libraries for the script in this virtual environment:
```
 pip install -r requirements.txt
```

Check the command line arguments required to run the script by typing:

 python eu_rules_fulltext_extractor.py -h
 
 OUTPUT >
 
 usage: eu_rules_fulltext_extractor.py [-h] -in INPUT -htp HTMLPATH -pdp PDFPATH -prp    PROBPATH

 EURLEX PDF and HTML legislative documents downloader

 optional arguments:
 -h, --help            show this help message and exit

 required arguments:
 -in INPUT, --input INPUT
                 Path to input CSV file (single column, no header, list of celex identifiers). Find more info about CELEX identifiers here: http://eur-lex.europa.eu/content/help/eurlex-content/celex-
                 number.html
                 
 -htp HTMLPATH, --htmlpath HTMLPATH
                 Path to directory where to store the extracted EU legislative documents from EURLEX (http://eur-lex.europa.eu/) in HTML format. Each downloaded document will be named only with the CELEX identifier e.g.
                 32012R0145.html
                 
 -pdp PDFPATH, --pdfpath PDFPATH
                 Path to directory where to store the extracted EU legislative documents from EURLEX (http://eur-lex.europa.eu/) in PDF format. Each downloaded document will be named only with the CELEX identifier e.g.
                 32013R0148.pdf
                 
 -prp PROBPATH, --probpath PROBPATH
                 Path to a directory. After execution, this script will write a CSV file called 'problematic-celexes.csv' to this directory containing a list of CELEX identifiers for legislation that could
                 not be downloaded for whatever reason

Example usage:

 python eu_rules_fulltext_extractor.py --input path/to/celex_nums.csv --htmlpath path/to/htmls/ --pdfpath path/to/pdfs/ --probpath path/to/problems/

Usage steps for `eu_rules_metadata_extractor.py`