Home

Awesome

CoRAL - Complete Reconstruction of Amplifications with Long reads

Reference

CoRAL is a tool which utilizes aligned, single-molecule long-read data (.bam) as input, and identifies candidate ecDNA structures. A pre-print is available here: https://www.biorxiv.org/content/10.1101/2024.02.15.580594v1

Installation

CoRAL can be installed and run on most modern Unix-like operating systems (e.g. Ubuntu 18.04+, CentOS 7+, macOS).

CoRAL requires python>=3.12; we recommend using venv/conda for managing Python/pip installations.

  1. Clone source

    git clone https://github.com/AmpliconSuite/CoRAL
    cd CoRAL
    
  2. Install packages

    • Option 1. Install With pip.

      pip install -r requirements.txt

      Set --extra-index-url https://download.pytorch.org/whl/cpu to prevent inclusion of gigantic GPU packages.

    • Option 2. Install with poetry.

      pip install poetry
      poetry install
      
  3. Download a Gurobi optimizer license (free for academic use)

    • Place the gurobi.lic file you download into $HOME/. This path is usually /home/username/gurobi.lic.
  4. Finish installing CNVkit dependencies (recommended)

    Rscript -e 'if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    Rscript -e 'BiocManager::install("DNAcopy")'
    

Getting copy number calls

Before running CoRAL, you will need genome-wide copy number (CN) calls generated from your long-read data.

Command line arguments to run CoRAL

CoRAL and its various run-modes can by used in the following manner

coral [mode] [mode arguments]

The modes are as follows:

  1. seed: Identify and filter copy number gain regions where amplifications exist
  2. reconstruct: Perform breakpoint graph construct and cycle decomposition on the amplified seeds.
  3. plot: Create plots of decomposed cycles and/or breakpoint graph sashimi plot.
  4. hsr: Identify candidate locations of chromosomal homogenously staining region (HSR) integration points for ecDNA.
  5. cycle2bed: Convert the AmpliconArchitect (AA) style *_cycles.txt file to a .bed format. The AA format is also used by CoRAL.

1. seed

As the seed amplification intervals are required by the main script reconstruct mode, it is suggested the user first run seed mode to generate seed amplification intervals.

Usage: coral seed <Required arguments> <Optional arguments>

Required arguments:

Optional arguments:

2. reconstruct

Usage: reconstruct <Required arguments> <Optional arguments>

2.1 Required arguments:

2.2 Optional arguments:

2.3 Expected output:

CoRAL may identify and reconstruct a few distinct focal amplifications in the input *.BAM sample, each will be organized as an amplicon, which includes a connected component of amplified intervals and their connections by discordant edges. CoRAL writes the following files to the directory specified with --output_dir.

SequenceEdge: StartPosition, EndPosition, PredictedCN, AverageCoverage, Size, NumberOfLongReads
sequence	chr7:54659673-	chr7:54763281+	4.150534	45.907363	103609	576
sequence	chr7:54763282-	chr7:55127266+	89.340352	1052.714362	363985	40637
sequence	chr7:55127267-	chr7:55155020+	2.843655	32.729552	27754	172
sequence	chr7:55155021-	chr7:55609190+	89.340352	1013.182857	454170	49675
sequence	chr7:55609191-	chr7:55610094+	2.868261	31.027655	904	915
sequence	chr7:55610095-	chr7:56049369+	89.340352	1023.280633	439275	49106
sequence	chr7:56049370-	chr7:56149664+	4.150534	49.623899	100295	562
BreakpointEdge: StartPosition->EndPosition, PredictedCN, NumberOfLongReads
concordant	chr7:54763281+->chr7:54763282-	4.150534	26
concordant	chr7:55127266+->chr7:55127267-	2.843655	36
concordant	chr7:55155020+->chr7:55155021-	2.843655	32
concordant	chr7:55609190+->chr7:55609191-	2.697741	38
concordant	chr7:55610094+->chr7:55610095-	2.697741	41
concordant	chr7:56049369+->chr7:56049370-	4.150534	45
discordant	chr7:55610095-->chr7:55609190+	86.642611	869
discordant	chr7:56049369+->chr7:54763282-	85.189818	981
discordant	chr7:55155021-->chr7:55127266+	86.496697	978
Interval	1	chr7	54659673	56149664
List of cycle segments
Segment	1	chr7	54659673	54763281
Segment	2	chr7	54763282	55127266
Segment	3	chr7	55127267	55155020
Segment	4	chr7	55155021	55609190
Segment	5	chr7	55609191	55610094
Segment	6	chr7	55610095	56049369
Segment	7	chr7	56049370	56149664
List of longest subpath constraints
Path constraint	1	2+,3+,4+	Support<=6	Satisfied
Path constraint	2	4+,5+,6+	Support<=34	Satisfied
Cycle=1;Copy_count=82.34616279663038;Segments=2+,4+,6+;Path_constraints_satisfied=
Cycle=2;Copy_count=2.8436550275157644;Segments=0+,2+,3+,4+,5+,6+,0-;Path_constraints_satisfied=1,2

Note that if --output-all-path-constraints is specified, then all path constraints given by long reads will be written to in *.cycles file.

3. plot

Usage: coral plot <Required arguments> <Optional arguments>

3.1 Required arguments: If --plot-graph is given, --graph is required. If --plot-cycles is given --cycles is required.

ArgumentDescription
--ref <choice>Reference genome choice. Must be one of [hg19, hg38, GRCh38, mm10]
--bam <file>Bam file the run was based on
--graph <file>AA-formatted _graph.txt file
--cycles <file>AA-formatted _cycles.txt file
--output-dir <str>Directory for output files

3.2 Optional arguments:

ArgumentDefaultDescription
--plot-graphPlot the AA graph file CN, SVs and coverage as a sashimi plot
--plot-cyclesPlot the AA cycles file genome decompositions
--only-cyclic-pathsOnly visualize the cyclic paths in the cycles file
--num-cycles <int>[all]Only plot the first [arg] cycles from the cycles file
--max-coverage <float>[1.25x max coverage in region]Do not extend coverage plot in graph sashimi plot above [arg] value
--min-mapq <int>15Do not use alignment in coverage plot with MAPQ value below [arg]
--gene-subset-list <str> <str> <str> ...[all]Only indicate positions of the gene names in this list
--hide-genesDo not plot positions of genes
--gene-fontsize <float>12Adjust fontsize of gene names
--bushman-genesOnly plot genes found in the Bushman lab cancer-related gene list ('Bushman group allOnco').
--region <chrom:pos1-pos2>[entire amplicon]Only plot genome region in the interval given by chrom:start-end

4. hsr

Usage: coral hsr <Required arguments> <Optional arguments>

4.1 Required arguments:

ArgumentDescripion
--lr-bam <file>Coordinate-sorted and indexed long read .bam file
--cycles <file>AA-formatted _cycles.txt file
--cn-segs <file>Long read segmented whole genome CN calls (.bed or CNVkit .cns file).
--normal-cov <float>Estimated coverage of diploid genome regions

4.2 Optional arguments:

ArgumentDefaultDescription
--bp_match_cutoff <int>100Breakpoint matching cutoff distance (bp)
--bp_match_cutoff_clustering2000Crude breakpoint matching cutoff distance (bp) for clustering

5. cycle2bed

CoRAL provides an option to convert its cycles output in AmpliconArchitect format *_cycles.txt into *.bed format (similar to Decoil), which makes it easier for downstream analysis of these cycles.

Usage: coral cycle2bed <Required arguments> <Optional arguments>

5.1 Required arguments:

5.2 Optional arguments:

Here is an example output of cycle2bed given by the above cycles file from GBM39.

#chr	start	end	orientation	cycle_id	iscyclic	weight
chr7	54763282	55127266	+	1	True	82.346163
chr7	55155021	55609190	+	1	True	82.346163
chr7	55610095	56049369	+	1	True	82.346163
chr7	54763282	56049369	+	2	False	2.843655