Awesome
We have a new bioinformatic resource that largely replaces the functionality of this project! See our new repository here: https://github.com/nanoporetech/bonito
This repository is now unsupported and we do not recommend its use. Please contact Oxford Nanopore: support@nanoporetech.com for help with your application if it is not possible to upgrade to our new resources, or we are missing key features.
Scrappie basecaller
Scrappie is a technology demonstrator for the Oxford Nanopore Research Algorithms group.
Ref : GACACAGTGAGGCTGCGTCTC-AAAAAAAAAAAAAAAAAAAAAAAAATTGCCCCTTCTTAAGTTTGCATTTAGATCTCTT
Query : GACACAG-GAGGCTGCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAATTGCCCCTTCTTAAGCTT-CA--CAGA-CT-TT
For a complete release history, see [RELEASES.md]
Dependencies
- A good BLAS library + development headers including cblas.h.
- The HDF5 library and development headers (with multi-threading support).
On Debian based systems, the following packages are sufficient (tested Ubuntu 14.04 and 16.04)
- libcunit1
- libcunit1-dev
- libhdf5
- libhdf5-dev
- libopenblas-base
- libopenblas-dev
Docker files documenting the minimal installs for various flavours of Linux can be found in the docker/ directory.
The Intel MKL may be used to provide the BLAS library. The combination of the Intel icc
compiler and linking against the MKL can result in significant performance improvements, a
gain of 50% being observed on one machine.
On Mac OSX systems, the argp-standalone package is also required. The argp-standalone package can be installed using the brew package manager (http://brew.sh).
brew install argp-standalone
Scrappie makes use of the OpenMP extensions for multi-processing. These are supported by the system installed compiler on most modern Linux systems but requires a more modern version of the clang/llvm compiler than that installed on Mac OSX machines. Support for OpenMP was adding in clang/llvm in version 3.7 (see http://llvm.org or use brew).
Compiling
mkdir build && cd build && cmake .. && make
If HDF5 or OpenBLAS libraries are stored in non-standard locations, you can specify the HDF5_ROOT and/or OPENBLAS_ROOT cmake option.
cmake -DOPENBLAS_ROOT=/software/gcc/openblas -DHDF5_ROOT=/software/gcc/hdf5 ..
Running
# Set some enviromental variables.
# Allow scrappie to use as many threads as the system will support
export OMP_NUM_THREADS=`nproc`
# Use openblas in single-threaded mode
export OPENBLAS_NUM_THREADS=1
# Call a folder of reads via events
scrappie events reads ... > basecalls.fa
# Call a folder of reads from raw signal
scrappie raw reads ... > basecalls.fa
# Call individual reads
scrappie raw reads/read1.fast5 reads/read2.fast5 > basecalls.fa
# Or using a strand list (skipping first line)
tail -n +2 strand_list.txt | sed 's:^:/path/to/reads/:' | xargs scrappie raw > basecalls.fa
# Using Scrappie in single-threaded mode
find path/to/reads/ -name \*.fast5 | parallel -P ${OMP_NUM_THREADS} scrappie raw --threads 1 > basecalls.fa
# Dump read meta-data to tsv
scrappie raw --threads 1 path/to/reads/ | tee basecalls.fa | grep '^>' | cut -d ' ' -f 2- | python3 misc/json_to_tsv.py > meta_data.tsv
Commandline options
The commandline options accepted by Scrappie depend on whether it is being used to call via events or from raw signal, or predicting the squiggle from the sequence.
> scrappie help events
Usage: events [OPTION...] fast5 [fast5 ...]
Scrappie basecaller -- basecall via events
-#, --threads=nreads Number of reads to call in parallel
--dump=filename Dump annotated events to HDF5 file
--dwell, --no-dwell Perform dwell correction of homopolymer lengths
-f, --format=format Format to output reads (FASTA or SAM)
--hdf5-chunk=size Chunk size for HDF5 output
--hdf5-compression=level Gzip compression level for HDF5 output (0:off,
1: quickest, 9: best)
-l, --limit=nreads Maximum number of reads to call (0 is unlimited)
--licence, --license Print licensing information
--local=penalty Penalty for local basecalling
-m, --min_prob=probability Minimum bound on probability of match
-o, --output=filename Write to file rather than stdout
-p, --prefix=string Prefix to append to name of each read
-s, --skip=penalty Penalty for skipping a base
--segmentation=chunk:percentile
Chunk size and percentile for variance based
segmentation
--slip, --no-slip Use slipping
-t, --trim=start:end Number of events to trim, as start:end
-y, --stay=penalty Penalty for staying
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
> scrappie help raw
Usage: raw [OPTION...] fast5 [fast5 ...]
Scrappie basecaller -- basecall from raw signal
Usage: raw [OPTION...] fast5 [fast5 ...]
Scrappie basecaller -- basecall from raw signal
-#, --threads=nparallel Number of reads to call in parallel
-f, --format=format Format to output reads (FASTA or SAM)
--hdf5-chunk=size Chunk size for HDF5 output
--hdf5-compression=level Gzip compression level for HDF5 output (0:off,
1: quickest, 9: best)
-H, --homopolymer=homopolymer Homopolymer run calc. to use: choose from
nochange (the default) or mean. Not implemented
for CRF.
-l, --limit=nreads Maximum number of reads to call (0 is unlimited)
--licence, --license Print licensing information
--local=penalty Penalty for local basecalling
-m, --min_prob=probability Minimum bound on probability of match
--model=name Raw model to use: "raw_r94", "rgrgr_r94"
"rgrgr_r941","rgrgr_r10", "rnnrf_r94"
-o, --output=filename Write to file rather than stdout
-p, --prefix=string Prefix to append to name of each read
-s, --skip=penalty Penalty for skipping a base
--segmentation=chunk:percentile
Chunk size and percentile for variance based
segmentation
--slip, --no-slip Use slipping
--temperature1=factor Temperature for softmax weights
--temperature2=factor Temperature for softmax bias
-t, --trim=start:end Number of samples to trim, as start:end
-y, --stay=penalty Penalty for staying
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
> scrappie help squiggle
Usage: squiggle [OPTION...] fasta [fasta ...]
Scrappie squiggler
-l, --limit=nreads Maximum number of reads to call (0 is unlimited)
--licence, --license Print licensing information
-m, --model=name Squiggle model to use: "squiggle_r94",
"squiggle_r10"
-o, --output=filename Write to file rather than stdout
-p, --prefix=string Prefix to append to name of each read
--rescale, --no-rescale Rescale network output
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
Output formats
Scrappie basecalling current supports two ouput formats, FASTA and SAM. The default format is currently FASTA;
SAM format output is enabled using the --format SAM
commandline argument.
Scrappie can emit SAM "alignment" lines containing the sequences but no quality information. No other fields, include a SAM header are emitted. A CRAM or BAM file can be obtained using samtools
(tested with version 1.4.1) as follows:
scrappie raw -f sam reads | samtools view -Sb - > output.bam
scrappie raw -f sam reads | samtools view -SC - > output.cram
FASTA
When the output is set to FASTA (default) then some metadata is stored in the description
- The sequence ID is the name of the file that was basecalled.
- The description element of the FASTA header is a JSON string containing the following elements:
- Events
filename
Name of fast5 file that read came from (without full path).uuid
UUID for read.normalised_score
Normalised score (total score / number of events or blocks).nevents
Number of events processed.sequence_length
Length of sequence called.events_per_base
Number of events per base called.- Raw
filename
Name of fast5 file that read came from (without full path).uuid
UUID for read.normalised_score
Normalised score (total score / number of events or blocks).nblock
Number of blocks processed.sequence_length
Length of sequence called.blocks_per_base
Mean number of blocks per base.nsample
Number of samples in read.trim
Interval of samples used (lower inclusive, upper exclusive).
Squiggle format
When Scrappie is used to predict squiggles, it outputs a tab-separated file with the following columns:
- Position along reference
- Base from reference at position
- Normalised current
- Standard deviation of normalised current
- Dwell (samples)
Where the squiggles from more than one sequence is requested, the entries are separated by a line containing a hash symbol '#' followed by the sequence name.
By default, the output of the squiggle prediction network is scaled into natural
coordinates. The untransformed values are accessible by using the --no-rescale
argument. When this is given, the 'standard deviation' and 'dwell' columns change as follows:
- Standard deviation -> log Standard deviation
- Dwell -> -log Dwell
Gotya's and notes
- Model is hard-coded. Generate new header files using
- Events:
parse_events.py model.pkl > src/nanonet_events.h
- Raw:
parse_raw.py model.pkl > src/nanonet_raw.h
- Events:
- The normalised score (- total score / number of events) correlates well with read accuracy.
- Reads with unusual rate metrics (number of events or blocks / bases called) may be unreliable.
- Scrappie requires HDF5 library compiled with multi-threading support, see HDF5 concurrent access. If only single-threaded HDF5 library is available then single-threaded Scrappie can be built and parallelized with xargs -- see Running for details.
- The squiggle prediction is based on Laplace distributed errors.
Methylation and other modifications
The models underlying Scrappie are trained from PCR'd data. Methylated bases, and other modifications, will manifest as errors rather than the appropriate cannonical base. Models calling modified bases into cannonical bases will be released in future version of Scrappie.