Home

Awesome

Build Status Maven Central License

Please see the GATK website, where you can download a precompiled executable, read documentation, ask questions, and receive technical support. For GitHub basics, see here.

GATK 4

This repository contains the next generation of the Genome Analysis Toolkit (GATK). The contents of this repository are 100% open source and released under the Apache 2.0 license (see LICENSE.TXT).

GATK4 aims to bring together well-established tools from the GATK and Picard codebases under a streamlined framework, and to enable selected tools to be run in a massively parallel way on local clusters or in the cloud using Apache Spark. It also contains many newly developed tools not present in earlier releases of the toolkit.

Table of Contents

<a name="requirements">Requirements</a>

<a name="quickstart">Quick Start Guide</a>

<a name="downloading">Downloading GATK4</a>

You can download and run pre-built versions of GATK4 from the following places:

<a name="dockerSoftware">Tools Included in Docker Image</a>

Our docker image contains the following bioinformatics tools, which can be run by invoking the tool name from the command line:

We also include an installation of Python3 (3.10.13) with the following popular packages included:

We also include an installation of R (4.3.1) with the following popular packages included:

For more details on system packages, see the GATK Base Dockerfile and for more details on the Python3/R packages, see the Conda environment setup file. Versions for the Python3/R packages can be found there.

<a name="building">Building GATK4</a>

<a name="running">Running GATK4</a>

<a name="jvmoptions">Passing JVM options to gatk</a>

<a name="configFileOptions">Passing a configuration file to gatk</a>

<a name="gcs">Running GATK4 with inputs on Google Cloud Storage:</a>

<a name="sparklocal">Running GATK4 Spark tools locally:</a>

<a name="sparkcluster">Running GATK4 Spark tools on a Spark cluster:</a>

./gatk ToolName toolArguments -- --spark-runner SPARK --spark-master <master_url> additionalSparkArguments

<a name="dataproc">Running GATK4 Spark tools on Google Cloud Dataproc:</a>

Once you're set up, you can run a Spark tool on your Dataproc cluster using a command of the form:

./gatk ToolName toolArguments -- --spark-runner GCS --cluster myGCSCluster additionalSparkArguments

<a name="R">Using R to generate plots</a>

Certain GATK tools may optionally generate plots using the R installation provided within the conda environment. If you are uninterested in plotting, R is still required by several of the unit tests. Plotting is currently untested and should be viewed as a convenience rather than a primary output.

<a name="tab_completion">Bash Command-line Tab Completion (BETA)</a>

source gatk-completion.sh
./gatk <TAB><TAB>
echo "source <PATH_TO>/gatk-completion.sh" >> ~/.bashrc

<a name="developers">For GATK Developers</a>

<a name="dev_guidelines">General guidelines for GATK4 developers</a>

<a name="testing">Testing GATK</a>

<a name="lfs">Using Git LFS to download and track large test data</a>

We use git-lfs to version and distribute test data that is too large to check into our repository directly. You must install and configure it in order to be able to run our test suite.

<a name="intellij">Creating a GATK project in the IntelliJ IDE (last tested with version 2016.2.4):</a>

<a name="debugging">Setting up debugging in IntelliJ</a>

<a name="intellij_gradle_refresh">Updating the Intellij project when dependencies change</a>

If there are dependency changes in build.gradle it is necessary to refresh the gradle project. This is easily done with the following steps.

<a name="jprofiler">Setting up profiling using JProfiler</a>

<a name="sonatype">Uploading Archives to Sonatype (to make them available via maven central)</a>

To upload snapshots to Sonatype you'll need the following:

#needed for snapshot upload
sonatypeUsername=<your sonatype username>
sonatypePassword=<your sonatype password>

#needed for signing a release
signing.keyId=<gatk key id>
signing.password=<gatk key password>
signing.secretKeyRingFile=/Users/<username>/.gnupg/secring.gpg

To perform an upload, use

./gradlew uploadArchives

Builds are considered snapshots by default. You can mark a build as a release build by setting -Drelease=true.
The archive name is based off of git describe.

<a name="docker_building">Building GATK4 Docker images</a>

Please see the the Docker README in scripts/docker. This has instructions for the Dockerfile in the root directory.

<a name="releasing_gatk">Releasing GATK4</a>

Please see the How to release GATK4 wiki article for instructions on releasing GATK4.

<a name="gatkdocs">Generating GATK4 documentation</a>

To generate GATK documentation, run ./gradlew gatkDoc

<a name="gatkwdlgen">Generating GATK4 WDL Wrappers</a>

<a name="zenhub">Using Zenhub to track github issues</a>

We use Zenhub to organize and track github issues.

<a name="spark_further_reading">Further Reading on Spark</a>

Apache Spark is a fast and general engine for large-scale data processing. GATK4 can run on any Spark cluster, such as an on-premise Hadoop cluster with HDFS storage and the Spark runtime, as well as on the cloud using Google Dataproc.

In a cluster scenario, your input and output files reside on HDFS, and Spark will run in a distributed fashion on the cluster. The Spark documentation has a good overview of the architecture.

Note that if you don't have a dedicated cluster you can run Spark in standalone mode on a single machine, which exercises the distributed code paths, albeit on a single node.

While your Spark job is running, the Spark UI is an excellent place to monitor the progress. Additionally, if you're running tests, then by adding -Dgatk.spark.debug=true you can run a single Spark test and look at the Spark UI (on http://localhost:4040/) as it runs.

You can find more information about tuning Spark and choosing good values for important settings such as the number of executors and memory settings at the following:

<a name="contribute">How to contribute to GATK</a>

(Note: section inspired by, and some text copied from, Apache Parquet)

We welcome all contributions to the GATK project. The contribution can be a issue report or a pull request. If you're not a committer, you will need to make a fork of the gatk repository and issue a pull request from your fork.

For ideas on what to contribute, check issues labeled "Help wanted (Community)". Comment on the issue to indicate you're interested in contibuting code and for sharing your questions and ideas.

To contribute a patch:

We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some things to consider:

Thank you for getting involved!

<a name="discussions">Discussions</a>

<a name="authors">Authors</a>

The authors list is maintained in the AUTHORS file. See also the Contributors list at github.

<a name="citing">Citing GATK</a>

If you use GATK in your research, please see this article for details on how to properly cite GATK.

<a name="license">License</a>

Licensed under the Apache 2.0 License. See the LICENSE.txt file.