Home

Awesome

Bio-Index

Bio-Index is a tool that indexes genomic data stored in AWS S3 "tables" (typically generated by Spark) so that it can be rapidly queried and loaded. It uses a MySQL database to store the indexes and to look up where in S3 each "record" is located.

The Bio-Index has two entry points: a CLI used for basic CRUD operations and a simple HTTP server and REST API for pure querying.

Prerequisites

Python 3.8+

$ sudo amazon-linux-extras install python3.8

Make sure python3 and pip3 are on the path. You may need to do something like:

$ sudo ln -s `which python3.8` /usr/bin/python3
$ sudo ln -s /usr/local/bin/pip3 /usr/bin/pip3

Upgrade pip, otherwise installing deps later won't work:

sudo python3 -m pip install --upgrade pip

Setup

First clone the git repository:

$ git clone https://github.com/broadinstitute/dig-bioindex.git

Then, cd into the directory created and install needed requirements

$ sudo pip3 install -r requirements.txt

At this point, the BioIndex is installed on your system and you can run it with python3 -m bioindex.main:

$ python3 -m bioindex.main [--env-file <environment overrides>] <command> [args]

Configuring the BioIndex

The bio-index uses python-dotenv (environment variables) for configuration. There are two environment files of importance: .bioindex and .env. The .bioindex file contains environment variables for connecting to AWS if they need to differ from those in the AWS credentials file. If you pass --env-file before the command you can override which environment file is used instead of .bioindex.

The following are the environment variables that can be set in the .bioindex file:

BIOINDEX_S3_BUCKET       # S3 bucket to index/read from
BIOINDEX_RDS_SECRET      # AWS SecretID used to connect to the RDS instance (*)
BIOINDEX_RDS_INSTANCE    # RDS instance name; used if no secret specified (*)
BIOINDEX_RDS_USERNAME    # RDS instance login; used if no secret specified (**)
BIOINDEX_RDS_PASSWORD    # RDS instance credentials; used if no secret specified (**)
BIOINDEX_BIO_SCHEMA      # RDS MySQL schema for the bio index (default=bio)
BIOINDEX_PORTAL_SCHEMA   # RDS MySQL schema for the portal (optional)
BIOINDEX_LAMBDA_FUNCTION # Lambda function that can be used for indexing remotely (optional)
BIOINDEX_GRAPHQL_SCHEMA  # File the GraphQL schema is written to and read from (optional)
BIOINDEX_GENES_URI       # Location of a GFF gene source (default=genes/genes.gff.gz)
BIOINDEX_RESPONSE_LIMIT  # Number of bytes to read from S3 per request (default=2 MB)
BIOINDEX_MATCH_LIMIT     # Number of matches to return per request (default=100)

(*)  - Either BIOINDEX_RDS_SECRET or BIOINDEX_RDS_INSTANCE is required
(**) - If BIOINDEX_RDS_INSTANCE is used, then username and password are required

Additionally, one can set a single environment variable (BIOINDEX_ENVIRONMENT), which should be the name of an AWS secret. If set, the BioIndex will read that secret as JSON and expects it to contain the rest of the environment setup.

Likewise, the environment can be overridden. The priority of values is as such:

secret < .bioindex < envrionment

For example, consider the following setup:

When run, the S3 bucket will be set to "bio-index-dev". Likewise, if the command line is run like so:

$ BIOINDEX_S3_BUCKET=bio-test python3 -m bioindex.main query gene SLC30A8

The S3 bucket used will be "bio-test".

NOTE: The only environment variable that must be set are BIOINDEX_S3_BUCKET and either the BIOINDEX_RDS_SECRET or other BIOINDEX_RDS_* variables. These will tell the BioIndex both where the data is located and where to write/read the index data.

Creating Indexes

To create a new index, use the create command. Example:

$ python3 -m bioindex.main create my-index prefix/key/to/files/ phenotype,chrom:pos

The above would create a new (or overwrite the existing) index named my-index. It indicates that all the files to index in the S3 bucket are located in prefix/key/to/files/ recursively, and that the schema used to index the files should be done by phenotype first and then by locus: chrom:pos.

The "prefix" to the files should always be a directory name and end with /. Every object in S3 under it will be indexed, no matter how deeply nested.

The "schema" parameter for the index controls how each record is indexed. Every schema follows the same the same general format: keys,...,locus. Consider the following JSON record:

{
    "varId": "8:117962623:C:T",
    "dbSNP": "rs769898168",
    "chromosome": "8",
    "position": 117962623,
    "phenotype": "T2D",
    "pValue": 0.39,
    "beta": 0.3,
    "consequence": "splice_region_variant",
    "gene": "SLC30A8",
    "impact": "LOW"
}

This record, may be indexed many different ways. For example:

The rules of indexing are as follows:

Preparing S3 Objects

Once everything is setup, you can begin creating or preparing the objects in S3 to be indexed. Each objects is expected to be in JSON-lines format, and must be sorted in order they are to be indexed! The only exception to this would be if the index is always a 1:1 mapping with a single record (e.g. indexing by ID).

For example, if the the schema phenotype,chromosome:position is used, then the objects in S3 are expected to be written (using Spark) like so:

df.orderBy(['phenotype','chromosome','position']) \
    .write \
    .json('s3://my-bucket/folder')

The above code would write out many part files to the bucket/path, each perfectly sorted and ready to be indexed using the index CLI command.

Indexing

Once an index has been created, simply use the index command and pass long a comma-separated list of indexes to build.

$ python3 -m bioindex.main index my-index,another-index

NOTE: You can also pass * as to build all indexes!

You can also build indexes "remotely" using an AWS Lambda Function. To do this, see the [DIG Indexer][indexer] project, which is a [Serverless][serverless] project that can be used to deploy a Lambda Function to AWS. Once deployed, set the BIOINDEX_LAMBDA_FUNCTION environment variable and pass --use-lambda on the CLI for the index command. You can also adjust the number of workers (--workers) to use, which is the number of Lambda functions that will execute in parallel.

Block Gzip Compression

Once you've indexed plain json files, you can compress them using bgzip and mark the index as compressed in the db.
To see about the steps for compressing the plain json see README.md. Setting the compressed field in __Indexes will cause the bio index server to retrieve the relevant data with the bgzip command. To install bgzip on AWS Linux follow these steps (it is necessary to build from source in order to enable seamless s3 support as of 06/2023):

  1. sudo yum install -y gcc make zlib-devel bzip2-devel xz-devel curl-devel ncurses-devel openssl-devel
  2. wget https://github.com/samtools/htslib/releases/download/1.17/htslib-1.17.tar.bz2 (replace with latest version from https://github.com/samtools/htslib/releases if you want)
  3. tar -xvf htslib-1.17.tar.bz2
  4. cd htslib-1.17
  5. ./configure --enable-s3 --enable-libcurl
  6. make
  7. sudo make install

Querying Indexes

Once you've built an index, you can then query it and retrieve all the records that match various input keys and/or overlap the given region. For example, to query all records in the genes key space that overlap a given region:

$ python3 -m bioindex.main query genes chr3:983248-1180000
{'chromosome': '3', 'end': 1445901, 'name': 'CNTN6', 'source': 'symbol', 'start': 1134260, 'type': 'protein_coding'}

NOTE: If you'd like to limit the output, just pipe it to head -n.

In addition to querying, there are also commands to count records, fetch all records, and match keys. Examples:

$ python3 -m bioindex.main count genes 8:100000000-200000000
1587

$ python3 -m bioindex.main match gene SLC30A
SLC30A1
SLC30A10
SLC30A2
SLC30A3
SLC30A4
SLC30A5
SLC30A6
SLC30A7
SLC30A8
SLC30A9

NOTE: The count command is an approximation. It reads the first 500 records and divides the total number of bytes to read from S3 by the average byte size per record.

RsId => VariantId Index

This is an index that will look up variant id that most frequently corresponds to a specified rsid. We use a dynamo db table for this index. You can access this index via /api/bio/varIdLookup/<rsid> e.g. curl http://localhost:5000/api/bio/varIdLookup/rs1294894678 You can find more info about this index and how it's created here.

The GraphQL REST Server

In addition to a CLI, Bio-Index is also a FastAPI server that allows you to query records using GraphQL via REST calls.

Building the GraphQL Schema

GraphQL requires a schema to process queries. The schema is inferred from the data, and build with the build-schema CLI option:

$ python3 -m bioindex.main build-schema --save

If you don't pass --save, then the schema is simply printed out. By default it is written to the filename specified by the BIOINDEX_GRAPHQL_SCHEMA environment variable (defaulted to schema.graphql), but you can change the destination by either providing --out <filename> or simply redirecting the output somewhere else.

Once the schema has been saved, you can then start the server.

Starting the Server

The server is started using the serve command:

$ bioindex serve --port 5000

REST Queries

The entire REST API can be explored both via the demo page and via the REST API documentation page.

Each request results in a JSON response that looks like so:

{
    "continuation": null,
    "nonce": "Ox4YfcJapxGYST_siDYjFtp150BZEMqC5JdyTuyTMUQ",
    "count": 1,
    "page": 1,
    "data": [
        {
            "chromosome": "8",
            "end": 100728,
            "ensemblId": "ENSG00000254193",
            "name": "AC131281.2",
            "start": 100584,
            "type": "processed_pseudogene"
        }
    ],
    "index": "genes",
    "limit": null,
    "profile": {
        "query": 0.138009,
        "fetch": 0.417972
    },
    "progress": {
        "bytes_read": 368,
        "bytes_total": 368
    },
    "q": [
        "chr8:100000-101000"
    ]
}

The count is the total number of records returned by this request.

The data is the array of records (if format=row) or a dictionary of columns (if format=column).

The profile shows how long the index query took vs. how much time was spent fetching the records from S3.

The progress shows how many bytes were read from S3 this request and what the total number of bytes that need to be read are.

If the continutation value is non-null, then it is a string, which is a token indicating there are more bytes left to be read and records left to be returned. They can be retrieved using the /api/bio/cont?token=<token> end-point.

If the continuation is followed to download more records, then the page count is increased each subsequent call.

Using Docker

In the image/ subfolder is a Dockerfile that can be used to build a Docker image. Or a pre-built image can be pulled from DockerHub.

Building the Image

To build the image from scratch, run the following:

$ docker build -t broadinstitute/bioindex:latest image

Once built, running docker images should show it ready for use.

Executing Using Docker

When running the BioIndex from the docker image, it's best to pass the environment data through with --env-file and if you want to make use of the GraphQL API, then a volume needs to be mounted that will point to where the BIOINDEX_GRAPHQL_SCHEMA file is located.

$ # list all indexes
$ docker run --env-file ./my-bioindex.env -rm broadinstitute/bioindex bioindex list

$ # build the schema and output it to stdout
$ docker run --env-file ./my-bioindex.env -v .:. -rm broadinstitute/bioindex bioindex build-schema

$ # start the server
$ docker run --env-file ./my-bioindex.env -v .:. -rm broadinstitute/bioindex bioindex serve

Genes URI

When executing queries, it's often more convenient to use a gene name instead of trying to pass a specific region. Since gene names are specific to species and assemblies, the gene names are configurable using a GFF3 file. This can be a local file (and by default is the one located in this repository), but can also be a remote, hosted file. It is expected that the attributes column contains either the ID or Name field set to the gene name to use. If the Alias attribute is also present, it is assumed to be a comma-separated list of alternate names for the gene, and those will also be included in the map.

Here is an example of the first few lines of the default GTF in this repository:

19  .  protein_coding  58856544  58864865  .  +  .  Name=A1BG;Alias=ENSG00000121410,HGNC:5,uc002qsd.5,MGI:2152878
10  .  protein_coding  52559169  52645435  .  +  .  Name=A1CF;Alias=ENSG00000148584,HGNC:24086,uc057tgv.1,MGI:1917115
12  .  protein_coding  9220260   9268825   .  +  .  Name=A2M;Alias=ENSG00000175899,HGNC:7,uc001qvk.2,MGI:2449119
12  .  protein_coding  8975068   9039597   .  +  .  Name=A2ML1;Alias=ENSG00000166535,HGNC:23336,uc001quz.6
1   .  protein_coding  33772367  33786699  .  +  .  Name=A3GALT2;Alias=ENSG00000184389,HGNC:30005,uc031plq.1,MGI:2685279

NOTE: GFF files are tab-delimited! The spacing shown above is only for readability.

The GFF file is only downloaded/parsed if needed. It is loaded on-demand (only once per execution) if a query requiring a locus is provided something other than a known, region format (e.g. chromosome:start-end) and then assumes what was provided should be interpreted as a gene name.

fin.