Home

Awesome

ECHO_EPA_SCRAPE_TO_MYSQL

This is a Python script "scrapeECHOEPA" which makes several calls to the following bash scripts:

These are all helper scripts which do not need to be run independantly. All of the above require execute permission (chmod u+x <filename>)

wgetGsheet downloads the Google Sheet here: https://docs.google.com/spreadsheets/d/1Z2rBoGqb_SXW6oAu12A6TCWEJGV1pk0YxL13P_Z5Wlw/export?format=csv&gid=2049992364 and saves it to "files.csv"

wgetEPA will get zip files file from the ECHO EPA website and place it in a directory "zips"

unzipEPA takes a file located in "zips" and extracts it to a directory "CSV"

stripNulls strips the nulls from some of the CSVs files which would fail to import otherwise

Preferably all of these bash scripts would be replaced by pure python.

scrapeECHOEPA loads a file "currentDBIndex" to determine which is the live database (either a or b), it then scrapes and unzips the files listed in "files.csv" using the bash scripts mentioned above. the script gets the connection info for the currently unused database from either db_a_private.csv or db_b_private.csv, these are included as sample files that need to be populated with your database info. The user needs insert and truncate permission.

the directory "middleware" contains a short php script that relies on both currentDBIndex, and either db_a_public.csv or db_b_public.csv the user should have select permission only. "middleware" should be moved to somewhere inside the web root of the server, the rest of the project should reside outside of web root.

dependencies are python3, php>=5.3, csvkit, and wget

"schema.psql" contains the schema for the tables listed in "files.csv" "schema.psql" should be loaded into your databases before "scrapeECHOEPA" is run.

Set-up

Irregularities

Things holding up automatic data copying and requiring alterations