Home

Awesome

ID3C: Infectious Disease Data Distribution Center

Data logistics system enabling real-time genomic epidemiology. Built for the Seattle Flu Study.

Navigation

Database

Currently PostgreSQL 15.

Initially aims to provide:

Design

The database is designed as a distribution center which receives data from external providers, repackages and stores it in a data warehouse, and ships data back out of the warehouse via views, web APIs, and other means. Each of these three conceptual areas are organized into their own PostgreSQL schemas within a single database.

The "receiving" area contains tables to accept minimally- or un-controlled data from external providers. The general expectation is that most tables here are logs (in the journaling sense) and will be processed later in sequential order. For example, participant enrollment documents from our consent and questionnaire app partner, Audere, are stored here when received by our web API.

The "warehouse" area contains a hybrid relational + document model utilizing standard relational tables that each have a JSON column for additional details. Data enters the warehouse primarily through extract-transform-load (ETL) routines which process received data and copy it into suitable warehouse rows and documents. These ETL routines are run via bin/id3c etl subcommands, where they're defined in Python (though lean heavily on Pg itself).

The "shipping" area contains views of the warehouse designed with specific data consumers and purposes in mind, such as the incidence modeling team.

While the receiving and shipping areas are expected to be fairly fluid and reactive to new and changing external requirements, the warehouse area is expected to change at a somewhat slower pace informed by longer-term vision for it.

Guidelines

General principles to follow when developing the schema.

Integration with other data systems

Although we're building our own data system, we want to design and create it with interoperability in mind. To this extent, our system should adopt or parallel practices and terminology from other systems when appropriate. For example:

Deploying

The database schema is deployed using Sqitch, a database change management tool that really shines. You can install it a large number of ways, so pick the one that makes most sense to you.

Development

For development, you'll need a PostgreSQL server and superuser credentials for it. The following commands assume the database server is running locally and your local user account maps directly to a database superuser.

Create a database named seattleflu using the standard Pg tools. (You can use another name if you want, maybe to have different dev instances, but you'll need to adjust the sqitch target you deploy to.)

createdb --encoding=UTF-8 seattleflu

Then use sqitch to deploy to it. (dev is a sqitch target configured in sqitch.conf which points to a local database named seattleflu.)

sqitch deploy dev

Now you can connect to it for interactive use with:

psql seattleflu

Testing and production

Our testing and production databases are configured as testing and production sqitch targets. When running sqitch against these targets, you'll need to provide a username via PGUSER and a password via an entry in ~/.pgpass.

Web API

Python 3 + Flask

Config

Starting the server

The commands pipenv run python -m id3c.api or pipenv run flask run will run the application's development server. To provide database connection details while starting the development server, run the command PGDATABASE=DB_NAME pipenv run flask run, substituting DB_NAME with the name of your database.

For production, a standard api.wsgi file is provided which can be used by any web server with WSGI support.

Examples

User authentication must be provided when making POST requests to the API. For example, you can run the following curl command to send JSON data named enrollments.json to the /enrollment endpoint on a local development server:

curl http://localhost:5000/enrollment \
  --header "Content-Type: application/json" \
  --data-binary @enrollments.json \
  --user USERNAME

Substitute your own local database username for USERNAME. This will prompt you for a password; you can also specify it directly by using --user "USERNAME:PASSWORD", though be aware it will be saved to your shell history.

CLI

Python 3 + click

Interact with the database on the command-line in your shell to:

The id3c command is the entry point. It must be run within the project environment, for example by using pipenv run id3c.

The LOG_LEVEL environment variable controls the level of terminal output. Levels are strings: debug, info, warning, error.

Development setup

Dependencies

Python dependencies are managed using Pipenv.

Install all the (locked, known-good) dependencies by running:

pipenv sync

Add new dependencies to setup.py and run:

pipenv lock
pipenv sync

and then commit the changes to Pipfile and Pipfile.lock.

Connection details

Details for connecting to the ID3C database are by convention controlled entirely by the standard libpq environment variables, service definitions, and password files.

For example, if you want to list the identifier sets available in the Seattle Flu Study testing database, you could create the following files:

~/.pg_service.conf

[seattleflu-testing]
host=testing.db.seattleflu.org
user=your_username
dbname=testing

~/.pgpass

testing.db.seattleflu.org:5432:*:your_username:your_password

Make sure the ~/.pgpass file is only readable by you since it contains your password:

chmod u=rw,og= ~/.pgpass

and then run:

PGSERVICE=seattleflu-testing pipenv run bin/id3c identifier set ls

These files will also allow you to connect using psql:

psql service=seattleflu-testing

Tests

Run all tests with:

pipenv run pytest -v

Run just type-checking tests with:

./dev/mypy