Home

Awesome

data-extraction

Data extraction scripts for the Nature of EU Rules project.

Full text extractor

Given a list of CELEX identifiers for EU legislation, the eu_rules_fulltext_extractor.py script downloads the corresponding documents for the legislation from the EURLEX website. It stores the files in two folders: one for HTML documents and one for PDF documents. Some older documents are only available in PDF format on the website. As a first priority, the script tries to download the HTML version of a document if available (because this format is easier to parse later on). If there is no HTML version available, it extracts the PDF version. If neither the HTML nor PDF versions could be extracted for whatever reason, the script keeps a list of CELEX identifiers which encountered errors when downloading.

Metadata extractor

Given a list of CELEX identifiers for EU legislation, the eu_rules_metadata_extractor.py script downloads the metadata for the corresponding legislative documents downloaded by eu_rules_fulltext_extractor.py. The metadata is extracted from the EU publications office's CELLAR SPARQL interface. If any information is missing in CELLAR, it alternatively tries to scrape the information from the EURLEX webpage for that legislation. The resulting metadata is stored in an output CSV file. Below is a table describing the metadata we extract for this project:

#metadatadescriptionexample valueother possible values
1celexCELEX identifier for legislation32005R009132019D0001
2authorAuthor of the legislationEuropean CommissionEuropean Parliament, Council of European Union
3responsible_boyEU body or agent responsible for legislationDG01/X/00SUD./X/00
4formType of legislationR (regulation)L (directive), D (decision)
5titleShort textual summary of the major details about the legislationCommission Regulation (EEC) No 1631/82 of 21 June 1982 on the supply of common wheat flour to Somalia as food aidoCommission Regulation (EEC) No 1679/82 of 29 June 1982 fixing, for the 1982/83 marketing year, the reference prices for pears
6addresseeAgent to which the legislation is addressedFranceGermany
7date_adoptionDate when legislation was accepted for implementation1997-03-111976-01-23
8date_in_forceDate when legislation became enforcable (can have multiple values)European Commission1997-03-11
9date_end_validityDate when legislation expires2023-02-022023-12-14
10directory_codeIdentifier for official policy area of legislation (extracted as HTTP link to RDF description of the policy area)http://publications.europa.eu/resource/authority/dir-eu-legal-act/014065http://publications.europa.eu/resource/authority/dir-eu-legal-act/04103010
11eurovocEUROVOC keyword classification or categorisation of this legislation's topicsUnion transit, viticulture, wine, beverage industry ,tobacco, administrative cooperationexport (EU), chemical fertiliser, France, inter-company agreement
12subject_mattersEURLEX alternative keyword classification scheme of this legislation's topicsCompetition, Agreements, decisions and concerted practicesCommercial policy, Protective measures

Requirements

Usage steps for eu_rules_fulltext_extractor.py

  1. Get a copy of the code:

     git clone git@github.com:nature-of-eu-rules/data-extraction.git
    
  2. Change into the data-extraction/ directory:

     cd data-extraction/
    
  3. Create new virtual environment e.g:

     python -m venv path/to/virtual/environment/folder/
    
  4. Activate new virtual environment e.g. for MacOSX users type:

     source path/to/virtual/environment/folder/bin/activate
     
    
  5. Install required libraries for the script in this virtual environment:

     pip install -r requirements.txt
    
  6. Check the command line arguments required to run the script by typing:

     python eu_rules_fulltext_extractor.py -h
     
     OUTPUT >
     
     usage: eu_rules_fulltext_extractor.py [-h] -in INPUT -htp HTMLPATH -pdp PDFPATH -prp    PROBPATH
    
     EURLEX PDF and HTML legislative documents downloader
    
     optional arguments:
     -h, --help            show this help message and exit
    
     required arguments:
     -in INPUT, --input INPUT
                     Path to input CSV file (single column, no header, list of celex identifiers). Find more info about CELEX identifiers here: http://eur-lex.europa.eu/content/help/eurlex-content/celex-
                     number.html
                     
     -htp HTMLPATH, --htmlpath HTMLPATH
                     Path to directory where to store the extracted EU legislative documents from EURLEX (http://eur-lex.europa.eu/) in HTML format. Each downloaded document will be named only with the CELEX identifier e.g.
                     32012R0145.html
                     
     -pdp PDFPATH, --pdfpath PDFPATH
                     Path to directory where to store the extracted EU legislative documents from EURLEX (http://eur-lex.europa.eu/) in PDF format. Each downloaded document will be named only with the CELEX identifier e.g.
                     32013R0148.pdf
                     
     -prp PROBPATH, --probpath PROBPATH
                     Path to a directory. After execution, this script will write a CSV file called 'problematic-celexes.csv' to this directory containing a list of CELEX identifiers for legislation that could
                     not be downloaded for whatever reason
    
  7. Example usage:

     python eu_rules_fulltext_extractor.py --input path/to/celex_nums.csv --htmlpath path/to/htmls/ --pdfpath path/to/pdfs/ --probpath path/to/problems/
     
    

Usage steps for eu_rules_metadata_extractor.py

  1. Get a copy of the code:

     git clone git@github.com:nature-of-eu-rules/data-extraction.git
    
  2. Change into the data-extraction/ directory:

     cd data-extraction/
    
  3. Create new virtual environment e.g:

     python -m venv path/to/virtual/environment/folder/
    
  4. Activate new virtual environment e.g. for MacOSX users type:

     source path/to/virtual/environment/folder/bin/activate
     
    
  5. Install required libraries for the script in this virtual environment:

     pip install -r requirements.txt
    
  6. Check the command line arguments required to run the script by typing:

     python eu_rules_metadata_extractor.py -h
     
     OUTPUT >
     
     usage: eu_rules_metadata_extractor.py [-h] -in INPUT -out OUTPUT
    
     EURLEX PDF and HTML legislative documents metadata downloader
    
     optional arguments:
     -h, --help            show this help message and exit
    
     required arguments:
     -in INPUT, --input INPUT
                             Path to input CSV file (single column, no header, list of celex identifiers). Find more info about CELEX identifiers here: http://eur-lex.europa.eu/content/help/eurlex-content/celex-
                             number.html
     -out OUTPUT, --output OUTPUT
                             Path to a CSV file to store the metadata in e.g. 'path/to/metadata.csv'.
    
  7. Example usage:

     python eu_rules_metadata_extractor.py --input path/to/celex_nums.csv --output path/to/metadata.csv
    
License

Copyright (2023) Kody Moodley, The Netherlands eScience Center

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.