Awesome

GpC methylation from fast5:s to reference anchored calls (gcf52ref)

Set of scripts to convert guppy produced methylation calls from fast5 files to reference anchored files similar to nanopolish output.

NOTE: This is first quick test for converting guppy methylation call fast5:s to something more useful. Unfortunately the process is not very efficient and the output is not very useful. I'm looking forward for other toolchain built around SAM/BAM/CRAM format and MM and MP tags.

Requirements

Mostly ONT guppy basecaller and many other more or less standard bioinformatics tools. Conda environment defined in environment.yml will define them all.

Usage

See test/test_gcf52ref.sh for demonstration of downloading reference genome and fast5 file for 4000 reads from http://s3.amazonaws.com/nanopore-human-wgs/rel6/MultiFast5Tars/, rebasecalling, creating tags and mapping the reads to genome. The heavy lifting is done in a snakemake pipeline launched by scripts/fast5_to_cram.sh

Methylation to sam

Script scripts/extract_methylation_fast5_to_sam.py contains a very early code for outputting the methylation calls from guppy to SAM format using MM/MP tags in pull request for SAM specification by James Bonfield. Specifically the tags follow commit Oct 21, 2019.

In addition to MM/MP tags, the script can output the modification likelihoods are provided in hex string tags such that ml:Z:C+m,ffff04,A+a,000093. The syntax is ml:Z:([ACGTN][-+][a-z],([0-9a-f][0-9a-f])+);)+ where ml is the custom SAM tag, Z symbol for character string value. Next three groupings are as in 'Base modifications' in https://github.com/samtools/hts-specs/pull/418 Final group is the hexadecimal coded likelihood for the given type modification for each position in SEQ in the original strand. Note that the likelihoods are 'given the underlying base is called the one defined in the tag'. Each hexadecimal value ranging 00-ff is 255*P(data|base, modified).

Commandline

Launch script for processing reads input/path/for_fast5s/pass_fast5/*.fast5 to aligned cram files. This requires guppy_basecaller in PATH

usage:
scripts/fast5_to_cram.sh  -i input/path/for_fast5s/ -s SAMPLEID 
-C all     GPU to use (0,1,2,3 or all, default all)
-c guppy_config.cfg  Guppy configuration file. Default /home/kpalin/ont-guppy/data/dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg
-w WORKDIR Directory where to make the processes files and output.
-r REF_FASTA   Reference genome to use.
-s SAMPLEID  
-S server.remote.com   Remote SFTP server holding the data.
-K         Keep temporary files (except output )
-n         Dry run
-N 8       Number of splits
-h          Show this message and exit.

Python code used by above script to convert fast5:s to fastq:s.

usage: extract_methylation_fast5_to_sam.py [-h] [-o OUTPUT] [-f [FASTQ]]
                                           [--failed_reads FAILED_READS] [-L]
                                           [-F [FILTER]] [-V]
                                           input_fast5 [input_fast5 ...]

Extract base modifications from fast5 files called with Guppy 3.3. The
modifications are provided as MM and MP tags conforming to
 If requested,
the modification likelihoods are provided in hex string tags such that
ml:Z:C+m,ffff04,A+a,000093. The syntax is
ml:Z:([ACGTN][-+][a-z],([0-9a-f][0-9a-f])+);)+ ml is the custom SAM tag, Z
symbol for character string value. Next three groupings are as in 'Base
modifications' in https://github.com/samtools/hts-specs/pull/418 Final group
is the hexadecimal coded likelihood for the given type modification for each
position in SEQ in the original strand. Note that the likelihoods are 'given
the underlying base is called the one defined in the tag'. Each hexadecimal
value ranging 00-ff is 255*P(data|base, modified).

positional arguments:
  input_fast5           Input paths of fast5 files [default:None]

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output the unsorted SAM or passing reads fastq file
                        file here [default:/dev/stdout]
  -f [FASTQ], --fastq [FASTQ]
                        Produce output in fastq format instead of SAM. Store
                        SAM header to file named here. The SAM tags are stored
                        as read comments that can be copied over to SAM
                        [default:None const:/dev/null]
  --failed_reads FAILED_READS
                        Output failed reads in fastq format here. With SAM
                        output, the filter/fail is marked with flag
                        [default:/dev/stderr]
  -L, --likelihoods     Include also the raw likelihoods as 'ml' tag.
                        [default:False]
  -F [FILTER], --filter [FILTER]
                        Mark reads with average q less than this as vendor
                        failed [default:False]
  -V, --verbose         Be more (and more) verbose with output [default:0]

There is a also a script to convert the MM/MP tags to tsv similar to what nanopolish produces.

usage: cram_to_refpos.py [-h] [--chrom CHROM] [--start START] [--end END]
                         [--reference_fasta REFERENCE_FASTA] [--no_header]
                         cram

convert ont-cram file with MM/MP tags to tsv

positional arguments:
  cram                  aligned cram file containing MM/MP tags

optional arguments:
  -h, --help            show this help message and exit
  --chrom CHROM         chromosome to limit calls to.
  --start START         start coordinates to limit calls from
  --end END             end coordinates to limit calls from
  --reference_fasta REFERENCE_FASTA
                        indexed fasta file for reporting sequence context.
  --no_header           Don't output header line

###There are various notes and to-do:s involved:

As output from the script, the MM/MP tags are reported correctly in the '+' strand with respect to the SEQ record in the SAM. The suggested minimap2 (or essentially any) alignment will reverse complement some of the reads but the MM/MP tags will still be in the original strand coordinates.
The so-called likelihoods in guppy fast5 files are not likelihoods but (possibly/likely) some sort of posterior probabilities of modification given a call and nanopore signal: $P(mod|base call, signal) = P(mod, base call, signal)/ ( P(base call|signal) * P(signal) )$
Current (commit 1e10d0dc) version calls methylated C:s in any context but logged statistics are computed on CpG sites.
The calls are made assuming that ONT gives likelihoods so we get $P(mod|D) = P(D|mod)*P(mod)/P(D) = P(D|mod)/( P(D|mod)P(mod)+P(D|no-mod)P(no-mod))$ and assuming $P(mod)=0.9$ and $P(no-mod) = 0.1$. This prior results in 66.5% of CpG sites called methylated with phred score>=3 on a single colorectal adenocarcinoma sample.