Awesome
damage-estimator-nf
Nextflow pipeline to run "Damage Estimator"
Description
This tool estimate the DNA damage when the DNA is sequenced using Illumina plateform on paired-end mode. There are 3 steps (starting from an aligned bam file) :
- Split the paired end reads into R1 and R2 using split_mapped_reads.pl (universal)
- Estimate the damage across reads using estimate_damage.pl
- Plot the result using R.
Cf. https://github.com/Ettwiller/Damage-estimator
Dependencies
-
This pipeline is based on nextflow. As we have several nextflow pipelines, we have centralized the common information in the IARC-nf repository. Please read it carefully as it contains essential information for the installation, basic usage and configuration of nextflow and our pipelines.
-
External software:
- Damage Estimator
- samtools
- R with GGPLOT2 package
The tool writes in tmp folder so check that yours is specified in your .bash_profile (export TMPDIR=/data/tmp/, export TMP=/data/tmp)
You can avoid installing all the external software by only installing Docker. See the IARC-nf repository for more information.
Input
Type | Description |
---|---|
bam folder | Folder containing the bam files on which you want to run "Damage Estimator" |
Parameters
-
Mandatory
Name | Example value | Description |
---|---|---|
--bam_folder | PATH/FOLDER | folder containing .bam and .bam.bai files on which to run "Damage Estimator" (bams should preferably be generated by bwa mapping of Illumina paired-end sequencing) |
--de_path | PATH/DE | location of folder containing damage estimator files (.pl and .r) |
--ref | PATH/FILE | genome of reference (fasta file) |
-
Optional
Name | Default value | Description |
---|---|---|
--Q | 0 | Phred score quality threshold (Sanger encoding). Only keep the bases with a Q score above a given threshold |
--mq | 10 | mapping quality. Only keep the reads that passes a given threshold |
--max_coverage_limit | 100 | If a position has equal or more than MAX reads (R1 or R2), the position is not used to calculate the damage. This option is put in place in order to avoid high coverage regions of the genome being the main driver for the damage estimation program. |
--min_coverage_limit | 1 | If a position has equal or less than MIN reads (R1 or R2), the position is not used to calculate the damage. This option is put in place in order to calculate damage only in on-target regions (in cases of enrichment protocol such as exome ....) |
--qualityscore | 30 | Discard the match or mismatch if the base on a read has less than MIN base quality. Important parameters. The lower this limit is, the less the damage is apparent. |
For exome bams, we recommend: --Q 20 --mq 20 --max_coverage_limit 300 --min_coverage_limit 30
Usage
nextflow run iarcbioinfo/damage-estimator.nf --bam_folder BAM/ --de_path /path/ --genome_ref ref.fasta
Output
Type | Description |
---|---|
"SMR" file1 and file2 | Intermediate mpileup files generated by samtools ("Split Mapped Reads") containing all the positions in the genome with at least one read. The file in -mpileup1 correspond to the first in paired reads and the file in -mpileup2 correspond to the second in paired reads. |
Table | 6 columns : [1] raw count of variant type [2] variant type (ex. G_T, G to T) [3] id (from the --id option) [4] frequency of variant [5] family (the variant type and reverse complement) [6] GIV-score . |
Graph | Representation of the table generated by plot_damage.R |
Contributions
Name | Description | |
---|---|---|
VOEGELE Catherine | voegelec@iarc.fr | Developer |