Awesome
DECA: Distributed Exome CNV Analyzer
Introduction
DECA is a distributed re-implementation of the XHMM exome CNV caller using ADAM and Apache Spark.
If you use DECA please cite:
Linderman MD, Chia D, Wallace F, Nothaft FA. DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark. BMC Bioinformatics. 2019;20:493. doi:10.1186/s12859-019-3108-7.
Getting Started
Installation
Note: These instructions are shared with other tools that build on ADAM.
Building from Source
You will need to have Maven installed in order to build DECA.
Note: The default configuration is for Hadoop 2.7.3. If building against a different version of Hadoop, please edit the build configuration in the
<properties>
section of thepom.xml
file.
$ git clone https://github.com/.../deca.git
$ cd deca
$ export MAVEN_OPTS="-Xmx512m"
$ mvn clean package
Installing Spark
You'll need to have a Spark release on your system and the $SPARK_HOME
environment variable pointing at it; prebuilt binaries can be downloaded from the
Spark website. DECA has been developed and tested with
Spark 2.1.0 built against Hadoop 2.7 with Scala 2.11, but any more recent Spark distribution should likely work.
Helpful Scripts
The bin/deca-submit
script wraps the spark-submit
commands to set up and launch DECA.
Commands
$ deca-submit
Usage: deca-submit [<spark-args> --] <deca-args> [-version]
Choose one of the following commands:
normalize : Normalize XHMM read-depth matrix
coverage : Generate XHMM read depth matrix from read data
discover : Call CNVs from normalized read matrix
normalize_and_discover : Normalize XHMM read-depth matrix and discover CNVs
cnv : Discover CNVs from raw read data
You can learn more about a command, by calling it without arguments or with --help
, e.g.
$ deca-submit normalize_and_discover --help
-I VAL : The XHMM read depth matrix
-cnv_rate N : CNV rate (p). Defaults to 1e-8.
-exclude_targets STRING : Path to file of targets (chr:start-end) to be excluded from analysis
-fixed_pc_toremove INT : Fixed number of principal components to remove if defined. Defaults to undefined.
-h (-help, --help, -?) : Print help
-initial_k_fraction N : Set initial k to fraction of max components. Defaults to 0.10.
-max_sample_mean_RD N : Maximum sample mean read depth prior to normalization. Defaults to 200.
-max_sample_sd_RD N : Maximum sample standard deviation of the read depth prior to normalization. Defaults to 150.
-max_target_length N : Maximum target length. Defaults to 10000.
-max_target_mean_RD N : Maximum target mean read depth prior to normalization. Defaults to 500.
-max_target_sd_RD_star N : Maximum target standard deviation of the read depth after normalization. Defaults to 30.
-mean_target_distance N : Mean within-CNV target distance (D). Defaults to 70000.
-mean_targets_cnv N : Mean targets per CNV (T). Defaults to 6.
-min_partitions INT : Desired minimum number of partitions to be created when reading in XHMM matrix
-min_sample_mean_RD N : Minimum sample mean read depth prior to normalization. Defaults to 25.
-min_some_quality N : Min Q_SOME to discover a CNV. Defaults to 30.0.
-min_target_length N : Minimum target length. Defaults to 10.
-min_target_mean_RD N : Minimum target mean read depth prior to normalization. Defaults to 10.
-o VAL : Path to write discovered CNVs as GFF3 file
-print_metrics : Print metrics to the log on completion
-save_zscores STRING : Path to write XHMM normalized, filtered, Z score matrix
-zscore_threshold N : Depth Z score threshold (M). Defaults to 3.
Using native library algebra libraries
Apache Spark includes the Netlib-Java library for high-performance linear algebra. Netlib-Java can invoke optimized BLAS and Lapack system libraries if available; however, many Spark distributions are built without Netlib-Java system library support. You may be able to use system libraries by including the DECA jar on the Spark driver classpath when running locally, e.g.
deca-submit --driver-class-path $DECA_JAR ...
or you may need to rebuild Spark as described in the Spark MLlib guide.
If you see the following warning messages in the log file, you have not successfully invoked the system libraries:
WARN BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
WARN BLAS:61 - Failed to load implementation from: com.github.fommil.neltlib.NativeRefARPACK
To build DECA with the optimized netlib native code in, you will need to invoke the native-lgpl
profile when running Maven:
mvn package -P native-lgpl
We cannot package this code by default, as netlib is licensed under the LGPL and cannot be bundled in Apache 2 licensed code.
Example Usage
Running DECA in "stand-alone" mode on a workstation
A small dataset (30 samples by 300 targets) is distributed as part of the XHMM tutorial.
An example DECA command to call CNVs from the pre-computed read-depth matrix and related files
on a 16-core workstation with 128 GB RAM is shown below; a step-by-step listing of the commands to call CNVs in the tutorial data is available here. Note that you will need to set the DECA_JAR
environment variable to point to the jar file created by mvn package
, set spark.local.dir
to a suitable temporary directory for your system and likely need to change the executor and driver memory to suitable values for your system. The exclude_targets.txt
and DATA.RD.txt
files from the XHMM tutorial data are also distributed as part of the DECA test resources in the deca-core/src/test/resources/
directory.
From within the unzip'd RUN directory, prepare exclude_targets.txt
:
cat low_complexity_targets.txt extreme_gc_targets.txt | sort -u > exclude_targets.txt
then run DECA:
deca-submit \
--master local[16] \
--driver-class-path $DECA_JAR \
--conf spark.local.dir=/path/to/temp/directory \
--conf spark.driver.maxResultSize=0 \
--conf spark.kryo.registrationRequired=true \
--driver-memory 16G \
-- normalize_and_discover \
-min_some_quality 29.5 \
-exclude_targets exclude_targets.txt \
-I DATA.RD.txt \
-o DECA.gff3
The resulting GFF3 file should contain
22 HG00121 DEL 18898402 18913235 9.167771318038923 . . END_TARGET=117;START_TARGET=104;Q_SOME=90;Q_START=8;Q_STOP=4;Q_EXACT=9;Q_NON_DIPLOID=90
22 HG00113 DUP 17071768 17073440 25.32122306047942 . . END_TARGET=11;START_TARGET=4;Q_SOME=99;Q_START=53;Q_STOP=25;Q_EXACT=25;Q_NON_DIPLOID=99
The exlude_targets.txt
file is the unique combination of the extreme_gc_targets.txt
and low_complexity_targets.txt
files provided in the tutorial data. The min_some_quality
parameter is set to 29.5 to mimic XHMM behavior which uses a
default minimum SOME quality of 30 after rounding (while DECA applies the filter prior to rounding). Depending on your particular
computing environment, you may need to modify the spark-submit
configuration parameters. spark.driver.maxResultSize
is set to 0 (unlimited)
to address errors collecting larger amounts of data to the driver.
The corresponding xcnv output from XHMM is:
SAMPLE CNV INTERVAL KB CHR MID_BP TARGETS NUM_TARG Q_EXACT Q_SOME Q_NON_DIPLOID Q_START Q_STOP MEAN_RD MEAN_ORIG_RD
HG00121 DEL 22:18898402-18913235 14.83 22 18905818 104..117 14 9 90 90 8 4 -2.51 37.99
HG00113 DUP 22:17071768-17073440 1.67 22 17072604 4..11 8 25 99 99 53 25 4.00 197.73
View a recording of the above installation and CNV calling workflow executed on OSX.
To call CNVs from the original BAM files compute the coverage:
deca-submit \
--master local[16] \
--driver-class-path $DECA_JAR \
--conf spark.local.dir=/path/to/temp/directory \
--conf spark.driver.maxResultSize=0 \
--conf spark.kryo.registrationRequired=true \
--driver-memory 16G \
-- coverage \
-L EXOME.interval_list \
-I *.bam \
-o DECA.RD.txt
followed by the normalize_and_discovery
command above (with DECA.RD.txt
as the input). DECA's coverage calculation is
designed to match the output of the GATK DepthOfCoverage command specified in the XHMM protocol, i.e. count fragment depth with
zero minimum base quality.
Running DECA on a YARN cluster
The equivalent example command to call CNVs on a YARN cluster with Spark dynamic allocation would be:
deca-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 1 \
--executor-memory 72G \
--executor-cores 5 \
--driver-memory 72G \
--driver-cores 5 \
--conf spark.driver.maxResultSize=0 \
--conf spark.yarn.executor.memoryOverhead=4096 \
--conf spark.yarn.driver.memoryOverhead=4096 \
--conf spark.kryo.registrationRequired=true \
--conf spark.hadoop.mapreduce.input.fileinputformat.split.minsize=$(( 8 * 1024 * 1024 )) \
--conf spark.default.parallelism=10 \
--conf spark.dynamicAllocation.enabled=true \
-- normalize_and_discover \
-min_partitions 10 \
-exclude_targets "hdfs://path/to/exclude_targets.txt" \
-min_some_quality 29.5 \
-I "hdfs://path/to/DATA.RD.txt" \
-o "hdfs://path/to/DECA.gff3"
Note that many of the parameters above, e.g. driver and executor cores and memory, are specific to a particular cluster environment and would likely need to be modified for other environments.
Running DECA on AWS with Elastic MapReduce
DECA can readily be run on Amazon AWS using the Elastic MapReduce (EMR) Spark configuration. Data can be read from and written to S3 using the s3a:// scheme. For example, the 1000 Genomes data are available as a public dataset on S3 in the 1000genomes
bucket (i.e. s3a://1000genomes/...
). S3a is an overlay over the AWS Simple Storage System (S3) cloud data store which is provided by Apache Hadoop.
Note that unlike HDFS, S3 is an eventually-consistent filesystem and so you may encounter problems when trying to read recently written files, such as occurs at the end of the DECA operations when combining sharded files. When writing to S3 use the -multi_file
option to leave the files sharded for subsequent combination or analysis.
DECA has been tested with emr-5.12.2. Clusters can be created with the command-line tools or the AWS management console. A JSON file emr_config.json
is provided in the scripts directory to configure clusters for maximum resource utilization.
A bootstrap script emr_bootstrap.sh
is provided in the scripts directory for use as a bootstrap action. The bootstrap script can copy a pre-built JAR onto the cluster (faster) or build DECA directly from GitHub (slower). To use the bootstrap script, copy it to S3 and provide the S3 path as the bootstrap action when creating the cluster. To copy a pre-built JAR onto the cluster provide a s3 path to the DECA CLI jar, e.g. s3://path/to/deca-cli_2.11-0.2.1-SNAPSHOT.jar
, as the optional argument to bootstrap action. After connecting to the EMR master node via SSH, you can launch DECA as you would on any YARN cluster. For example the following command calls CNVs in the entire 1000 Genomes phase 3 cohort on a cluster of i3.2xlarge nodes.
deca-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 7 \
--executor-memory 22G \
--executor-cores 4 \
--driver-memory 22G \
--driver-cores 5 \
--conf spark.driver.maxResultSize=0 \
--conf spark.kryo.registrationRequired=true \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.hadoop.mapreduce.input.fileinputformat.split.minsize=$(( 104 * 1024 * 1024 )) \
--conf spark.default.parallelism=28 \
--conf spark.executor.extraJavaOptions="-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35" \
-- normalize_and_discover \
-min_partitions 28 \
-exclude_targets "s3a://path/to/20130108.exome.targets.exclude.txt" \
-min_some_quality 29.5 \
-print_metrics \
-I "s3a://path/to/DATA.2535.RD.txt" \
-o "s3a://path/to/DATA.2535.RD.gff3" \
-multi_file
Alternately jobs can be launched as steps on cluster. In this approach, no bootstrap actions are needed; the JAR file can be downloaded directly from S3. The spark-submit
arguments are:
--class org.bdgenomics.deca.cli.DecaMain
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.kryo.registrator=org.bdgenomics.deca.serialization.DECAKryoRegistrator
(in addition to the arguments shown above) and the application arguments would be
normalize_and_discover
-min_partitions 28
-exclude_targets s3a://path/to/20130108.exome.targets.exclude.txt
-min_some_quality 29.5
-print_metrics
-I s3a://path/to/DATA.2535.RD.txt
-o s3a://path/to/DATA.2535.RD.gff3
-multi_file
Running DECA on Databricks
DECA can readily be run on Databricks on the Amazon cloud. DECA has been tested on Databricks Light 2.4 as a spark-submit job using the DECA jar fetched from a S3 bucket. As with EMR, data can be read from and written to S3 using the s3a:// scheme. The Databricks cluster was configured to access S3 via AWS IAM roles. Note that access to any public buckets, e.g. the 1000genomes bucket, must also be included in the cross account IAM role created according to the above instructions. The same issues with eventual consistency described above also apply when writing data to S3 from the Databricks cluster.
An example configuration for calling CNVs directly from the original BAM files:
[
"--class", "org.bdgenomics.deca.cli.DecaMain",
"--conf", "spark.serializer=org.apache.spark.serializer.KryoSerializer",
"--conf", "spark.kryo.registrator=org.bdgenomics.deca.serialization.DECAKryoRegistrator",
"--conf", "spark.kryo.registrationRequired=true",
"--conf", "spark.hadoop.fs.s3.impl=com.databricks.s3a.S3AFileSystem",
"--conf", "spark.hadoop.fs.s3a.impl=com.databricks.s3a.S3AFileSystem",
"--conf", "spark.hadoop.fs.s3n.impl=com.databricks.s3a.S3AFileSystem",
"--conf", "spark.hadoop.fs.s3a.canned.acl=BucketOwnerFullControl",
"--conf", "spark.hadoop.fs.s3a.acl.default=BucketOwnerFullControl",
"--conf", "spark.hadoop.mapreduce.input.fileinputformat.split.minsize=536870912",
"s3://path/to/deca-cli_2.11-0.2.1-SNAPSHOT.jar",
"cnv",
"-L", "s3a://path/to/20130108.exome.targets.filtered.interval_list",
"-I", "s3a://path/to/1kg.bams.50.list",
"-l",
"-o", "s3a://path/to/DECA.50.gff3",
"-multi_file"
]
License
DECA is released under an Apache 2.0 license.