Home

Awesome

hillary-clinton-emails

This is a work in progress - any help normalizing and extracting this data's much appreciated!

This repo contains code to transform Hillary Clinton's emails released through the FOIA request from raw PDF documents to CSV files and a SQLite database, making it easier to understand and analyze the documents.

A zip of the extracted data is available for download on Kaggle.

Check out some analytics on this data on Kaggle Scripts.

Note that conversion is very imprecise: there's plenty of room to improve the PDF conversion, the sender/receiver extraction, and the body text extraction.

Extracted data

There are five main output files this produces: four CSV files and one SQLite database.

Note that each table contains a numeric Id column. This Id column is only meant to be used to join the tables: it is internally consistent, but each entity may have a different Id when the data's updated.

Emails.csv

This file currently contains the following fields:

Persons.csv

Aliases.csv

EmailReceivers.csv

database.sqlite

This SQLite database contains all of the above tables (Emails, Persons, Aliases, and EmailReceivers) with their corresponding fields. You can see the schema and ingest code under scripts/sqlImport.sql

Contributing: next steps

Running the download and extraction code

Running make all in the root directory will download the data (~162mb total) and create the output files, assuming you have all the requirements installed.

Requirements

This has only been tested on OS X, it may or may not work on other operating systems.

References

The source PDF documents for this repo were downlaoded from the WSJ Clinton Inbox search.

I created this project before I realized the WSJ also open-sourced some code they used to create the Inbox Search. Subsequently, I've included some material from their open source project as well: I used their HRCEMAIL_names.csv to seed alias_person.csv. I also scraped metadata from foia.state.gov in a similar fashion as they did in downloadMetadata.py.