Home

Awesome

ONT_logo

We have a new bioinformatic resource that largely replaces the functionality of this project! See our new repository here: https://github.com/nanoporetech/bonito

This repository is now unsupported and we do not recommend its use. Please contact Oxford Nanopore: support@nanoporetech.com for help with your application if it is not possible to upgrade to our new resources, or we are missing key features.


Scrappie basecaller

Scrappie is a technology demonstrator for the Oxford Nanopore Research Algorithms group.

Ref   : GACACAGTGAGGCTGCGTCTC-AAAAAAAAAAAAAAAAAAAAAAAAATTGCCCCTTCTTAAGTTTGCATTTAGATCTCTT
Query : GACACAG-GAGGCTGCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAATTGCCCCTTCTTAAGCTT-CA--CAGA-CT-TT

Travis Coverity Scan

For a complete release history, see [RELEASES.md]

Dependencies

On Debian based systems, the following packages are sufficient (tested Ubuntu 14.04 and 16.04)

Docker files documenting the minimal installs for various flavours of Linux can be found in the docker/ directory.

The Intel MKL may be used to provide the BLAS library. The combination of the Intel icc compiler and linking against the MKL can result in significant performance improvements, a gain of 50% being observed on one machine.

On Mac OSX systems, the argp-standalone package is also required. The argp-standalone package can be installed using the brew package manager (http://brew.sh).

brew install argp-standalone

Scrappie makes use of the OpenMP extensions for multi-processing. These are supported by the system installed compiler on most modern Linux systems but requires a more modern version of the clang/llvm compiler than that installed on Mac OSX machines. Support for OpenMP was adding in clang/llvm in version 3.7 (see http://llvm.org or use brew).

Compiling

mkdir build && cd build && cmake .. && make

If HDF5 or OpenBLAS libraries are stored in non-standard locations, you can specify the HDF5_ROOT and/or OPENBLAS_ROOT cmake option.

cmake -DOPENBLAS_ROOT=/software/gcc/openblas -DHDF5_ROOT=/software/gcc/hdf5 ..

Running

#  Set some enviromental variables.
# Allow scrappie to use as many threads as the system will support
export OMP_NUM_THREADS=`nproc`
# Use openblas in single-threaded mode
export OPENBLAS_NUM_THREADS=1
# Call a folder of reads via events
scrappie events reads ... > basecalls.fa
# Call a folder of reads from raw signal
scrappie raw reads ... > basecalls.fa
# Call individual reads
scrappie raw reads/read1.fast5 reads/read2.fast5 > basecalls.fa
# Or using a strand list (skipping first line)
tail -n +2 strand_list.txt | sed 's:^:/path/to/reads/:' | xargs scrappie raw > basecalls.fa
#  Using Scrappie in single-threaded mode
find path/to/reads/ -name \*.fast5 | parallel -P ${OMP_NUM_THREADS} scrappie raw --threads 1 > basecalls.fa
#  Dump read meta-data to tsv
scrappie raw --threads 1 path/to/reads/ | tee basecalls.fa | grep '^>' | cut -d ' ' -f 2- | python3 misc/json_to_tsv.py > meta_data.tsv

Commandline options

The commandline options accepted by Scrappie depend on whether it is being used to call via events or from raw signal, or predicting the squiggle from the sequence.

> scrappie help events
Usage: events [OPTION...] fast5 [fast5 ...]
Scrappie basecaller -- basecall via events

  -#, --threads=nreads       Number of reads to call in parallel
      --dump=filename        Dump annotated events to HDF5 file
      --dwell, --no-dwell    Perform dwell correction of homopolymer lengths
  -f, --format=format        Format to output reads (FASTA or SAM)
      --hdf5-chunk=size      Chunk size for HDF5 output
      --hdf5-compression=level   Gzip compression level for HDF5 output (0:off,
                             1: quickest, 9: best)
  -l, --limit=nreads         Maximum number of reads to call (0 is unlimited)
      --licence, --license   Print licensing information
      --local=penalty        Penalty for local basecalling
  -m, --min_prob=probability Minimum bound on probability of match
  -o, --output=filename      Write to file rather than stdout
  -p, --prefix=string        Prefix to append to name of each read
  -s, --skip=penalty         Penalty for skipping a base
      --segmentation=chunk:percentile
                             Chunk size and percentile for variance based
                             segmentation
      --slip, --no-slip      Use slipping
  -t, --trim=start:end       Number of events to trim, as start:end
  -y, --stay=penalty         Penalty for staying
  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version
> scrappie help raw
Usage: raw [OPTION...] fast5 [fast5 ...]
Scrappie basecaller -- basecall from raw signal

Usage: raw [OPTION...] fast5 [fast5 ...]
Scrappie basecaller -- basecall from raw signal

  -#, --threads=nparallel    Number of reads to call in parallel
  -f, --format=format        Format to output reads (FASTA or SAM)
      --hdf5-chunk=size      Chunk size for HDF5 output
      --hdf5-compression=level   Gzip compression level for HDF5 output (0:off,
                             1: quickest, 9: best)
  -H, --homopolymer=homopolymer   Homopolymer run calc. to use: choose from
                             nochange (the default) or mean. Not implemented
                             for CRF.
  -l, --limit=nreads         Maximum number of reads to call (0 is unlimited)
      --licence, --license   Print licensing information
      --local=penalty        Penalty for local basecalling
  -m, --min_prob=probability Minimum bound on probability of match
      --model=name           Raw model to use: "raw_r94", "rgrgr_r94"
                             "rgrgr_r941","rgrgr_r10", "rnnrf_r94"
  -o, --output=filename      Write to file rather than stdout
  -p, --prefix=string        Prefix to append to name of each read
  -s, --skip=penalty         Penalty for skipping a base
      --segmentation=chunk:percentile
                             Chunk size and percentile for variance based
                             segmentation
      --slip, --no-slip      Use slipping
      --temperature1=factor  Temperature for softmax weights
      --temperature2=factor  Temperature for softmax bias
  -t, --trim=start:end       Number of samples to trim, as start:end
  -y, --stay=penalty         Penalty for staying
  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version


> scrappie help squiggle
Usage: squiggle [OPTION...] fasta [fasta ...]
Scrappie squiggler

  -l, --limit=nreads         Maximum number of reads to call (0 is unlimited)
      --licence, --license   Print licensing information
  -m, --model=name           Squiggle model to use: "squiggle_r94",
                             "squiggle_r10"
  -o, --output=filename      Write to file rather than stdout
  -p, --prefix=string        Prefix to append to name of each read
      --rescale, --no-rescale   Rescale network output
  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Output formats

Scrappie basecalling current supports two ouput formats, FASTA and SAM. The default format is currently FASTA; SAM format output is enabled using the --format SAM commandline argument.

Scrappie can emit SAM "alignment" lines containing the sequences but no quality information. No other fields, include a SAM header are emitted. A CRAM or BAM file can be obtained using samtools (tested with version 1.4.1) as follows:

scrappie raw -f sam reads | samtools view -Sb - > output.bam
scrappie raw -f sam reads | samtools view -SC - > output.cram

FASTA

When the output is set to FASTA (default) then some metadata is stored in the description

Squiggle format

When Scrappie is used to predict squiggles, it outputs a tab-separated file with the following columns:

Where the squiggles from more than one sequence is requested, the entries are separated by a line containing a hash symbol '#' followed by the sequence name.

By default, the output of the squiggle prediction network is scaled into natural coordinates. The untransformed values are accessible by using the --no-rescale argument. When this is given, the 'standard deviation' and 'dwell' columns change as follows:

Gotya's and notes

Methylation and other modifications

The models underlying Scrappie are trained from PCR'd data. Methylated bases, and other modifications, will manifest as errors rather than the appropriate cannonical base. Models calling modified bases into cannonical bases will be released in future version of Scrappie.