Awesome
plassembler
Automated Bacterial Plasmid Assembly Program
plassembler
is a program that is designed for automated & fast assembly of plasmids in bacterial genomes that have been hybrid sequenced with long read & paired-end short read sequencing. It was originally designed for Oxford Nanopore Technologies long reads, but it will also work with Pacbio reads. As of v1.3.0, it also works well for long-read only assembled genomes.
If you are assembling a small number of bacterial genomes manually, I would recommend starting by using Trycycler to recover the chromosome before using Plassembler to recover plasmids, especially the small ones.
Otherwise, I recommend you don't actually use Plassembler by itself. If you have more genomes or want to assemble your genomes in a more automated way, I would recommend Hybracter. If you use Hybracter, you will not need to use Plassembler separately, as it is built in. But please still cite Plassembler.
Quick Start
The easiest way to install plassembler
is via conda:
conda install -c bioconda plassembler
Followed by database download and installation:
plassembler download -d <databse directory>
And finally run plassembler
:
plassembler run -d <database directory> -l <long read fastq> -o <output dir> -1 < short read R1 fastq> -2 < short read R2 fastq> -c <estimated chromosome length>
Please read the Installation section for more details, especially if you are an inexperienced command line user.
Container
Alternatively, a Docker/Singularity Linux container image is available for Plassembler (starting from v1.6.2) here. This will likely be useful for running Plassembler in HPC environments.
To install and run v1.6.2 with singularity
IMAGE_DIR="<the directory you want the .sif file to be in >"
singularity pull --dir $IMAGE_DIR docker://quay.io/gbouras13/plassembler:1.6.2
containerImage="$IMAGE_DIR/plassembler_1.6.2.sif"
# example command with test fastqs
singularity exec $containerImage plassembler download -d plassembler_db
singularity exec $containerImage plassembler run -l test_data/Fastqs/test_long_reads.fastq.gz \
-1 test_data/Fastqs/test_short_reads_R1.fastq.gz -2 test_data/Fastqs/test_short_reads_R2.fastq.gz d plassembler_db \
-o output_test_singularity -t 4 -c 50000
Google Colab Notebook
If you don't want to install plassembler
locally, you can run it without any code using the colab notebook https://colab.research.google.com/github/gbouras13/plassembler/blob/main/run_plassembler.ipynb
This is only recommend if you have one or a few samples to assemble (it takes a while per sample due to the limited nature of Google Colab resources - probably an hour or two a sample). If you have more than this, a local install is recommended.
Manuscript
plassembler
has been recently published in Bioinformatics:
George Bouras, Anna E. Sheppard, Vijini Mallawaarachchi, Sarah Vreugde, Plassembler: an automated bacterial plasmid assembly tool, Bioinformatics, Volume 39, Issue 7, July 2023, btad409, https://doi.org/10.1093/bioinformatics/btad409.
If you use plassembler
, please see the full Citations section for a list of all programs plassembler
uses under the hood, in order to fully recognise the creators of these tools for their work.
Documentation
The full documentation for Plassembler can be found here.
Table of Contents
- plassembler
- Automated Bacterial Plasmid Assembly Program
- Quick Start
- Manuscript
- Documentation
- Table of Contents
plassembler
v1.5.0 Update New Database (21 November 2023)plassembler
v1.3.0 Updates (24 October 2023)- Why Does Plassembler Exist?
- Why Not Just Use Unicycler?
- Other Features
- Quality Control
- Metagenomes
- Installation
- Unicycler v0.5.0 Installation Issues
- Running plassembler
- Outputs
- Benchmarking
- Acknowledgements
- Version Log
- Bugs and Suggestions
- Citations
plassembler
v1.5.0 Update New Database (21 November 2023)
- If you upgrade to v1.5.0, you will need to update the database using
plassembler download
- Plassembler v1.5.0 incorporates a new expanded database thanks to the recent PLSDB release 2023_11_03_v2. Thanks @biobrad for the heads up.
plassembler
v1.3.0 Updates (24 October 2023)
plassembler long
should yield improved results. It achieves this by treating long reads as both short reads (in the sense of creating a de Brujin graph based short read assembly to begin) and long reads (for scaffolding) in Unicycler.- While I'd still recommend short reads if you can get them, I am now confident that if your isolate has small plasmids in the long read set,
plassembler long
is very likely to find and recover them. - For more information, see the documentation.
- The ability to specify a
--flye_assembly
and--flye_info
if you already have a Flye assembly for your long reads instead of--flye_directory
has been added. Thanks to @incoherentian's issue - The ability to specify a
--no_copy_numbers
withplassembler assembled
if you just want to run some plasmids against the PLSDB has been added. Thanks to @gaworj's issue.
Why Does Plassembler Exist?
In long-read assembled bacterial genomes, small plasmids are difficult to assemble correctly with long read assemblers. They commonly have circularisation issues and can be duplicated or missed (see this, this and this). This recent paper in Microbial Genomics by Johnson et al also suggests that long read assemblers particularly miss small plasmids.
plassembler
was therefore created as a fast automated tool to ensure plasmids are assembled correctly without duplicated regions for high-throughput uses - like Unicycler but a lot laster - and to provide some useful statistics as well (such as estimate plasmid copy numbers for both long and short read sets).
As it turns out (though this wasn't a motivation for making it), plassembler
also recovers more small plasmids than the existing gold standard tool Unicycler. I think this is because it throws away chromosomal reads, similar to subsampling short reads sets which can improve recovery. As there are more plasmid reads a proportion of the overall read set, there seems to be a higher chance of recovering smaller plasmids.
You can see this increase in accuracy and speed in the benchmarking results for simulated and real datasets.
Plassembler also uses mash as a quick way to determine whether each assembled contig has any similar hits in PLSDB.
Additionally, due to its mapping approach, Plassembler can also be used as a quality control tool for checking whether your long and short read sets come from the same isolate. This may be particularly useful if your read sets come from different extractions, or you have multiplexed many samples (& want to avoid mislabelling).
Why Not Just Use Unicycler?
Unicycler is awesome and still a good way to assemble plasmids from hybrid sequencing - plassembler
uses it! But there are a few reasons to use plassembler instead:
- Time. Plassember throws away all the chromosomal reads (i.e. most of them) before running Unicycler, so it is much faster (wall clock 3-10x faster generally).
- Accuracy. Benchmarking has shown
plassembler
is better than Unicycler in terms of recovering small plasmids. plassembler
will output only the likely plasmids, and can more easily be integrated into pipelines. You shouldn't be assembling the chromosome using Unicycler anymore soplassembler
can get you only what is necessary from Unicycler.plassembler
will give you summary depth and copy number stats for both long and short reads.plassembler
can be used as a quality control to check if your short and long reads come from the same sample - ifplassembler
results in many non-circular contigs (particularly those that have no hits in PLSDB), it is likely because your read sets do not come from the same isolate! See Quality Control.- You will get information whether each assembled contig has a similar entry in PLSDB. Especially for common pathogen species that are well represented in databases, this will likely tell you specifically what plasmid you have in your sample.
- Note: Especially for less commonly sequenced species, I would not suggest that that absence of a PLSDB hit is necessary meaningful, especially for circular contigs - those would likely be novel plasmids uncaptured by PLSDB.
Other Features
- Assembled mode.
- Thanks to a suggestion from gaworj, assembled mode has been added to Plassembler. This allows you to calculate the copy numbers of already assembled plasmids you may have, skipping assembly.
You can use this feature with plassembler assembled
.
- Multi-mapped reads.
- All long reads that map to multiple contigs (mostly, reads that map to both the chromosome and plasmids, but also to multiple putative plasmids) will be extracted when using the
--keep-fastqs
options. These may be of interest if you are looking at shared mobile genetic elements.
- Multiple chromosome bacteria/megaplasmids/chromids
- Plassembler should work with bacteria with multiple chromosomes, megaplasmids or chromids. In this case, I would treat the megaplasmids etc like chromosomes and assemble them using a long-read first approach with Trycycler or Dragonflye, as they are of approximately chromosome size.
- I'd still use Plassembler to recover small plasmids - for example, for Plassembler v1.1.0 recovered the 77.5 kbp plasmiod along with a 5386bp contig (coresponding to phage phiX174, a common sequencing spike-in) in the Vibrio campbellii DS40M4 (see this paper and this bioproject ).
-c
needs to be smaller than the size of the largest chromosome-like element.- For example, for the vibrio example, which had approximately 1.8Mbp and 3.3Mbp chromosomes , I used
-c 1500000
.
Please see here for more details and an example.
- Phages, Phage-Plasmids and Other Extrachromosomal Replicons
- If you have sufficient hybrid sequencing data, Plassembler will theoretically recover assemblies of all non-chromosomal replicons, including phages and phage-plasmids
- A good example of this is the Vibrio campbellii DS40M4 example, where Plassembler recovered the assembly of phage phiX174, albeit it was from sequencing spike-in contamination in that case.
- Plasmid Only Assembly
- You can also use Plassembler for plasmid-only assembly by passing
--no_chromosome
. Use this if your reads only contain plasmids that you would like to assemble.
Quality Control
plassembler
can also be used for quality control to test whether your long and short read sets come from the same isolate, even within the same species.
Please see here for more details and some examples.
Metagenomes
plassembler
is not currently recommended for metagenomic datasets, because of their high diversity, leading to difficulties in recovering chromosome-length contigs for bacteria. Additionally, Unicycler is not recommended for metagenomes. However,plassembler
was tested on a high depth very simple mock community dataset from this paper. It worked quite nicely, recovering the 5 known plasmids, but we don't anticipate it will work as well on your data! If you try it and it works please let us know.
Please see here for more details.
Installation
Plassembler has been tested on Linux and MacOS machines.
Conda
The easiest way to install plassembler
is via conda - Plassembler is on bioconda.
conda install -c bioconda plassembler
or mamba for quicker solving:
mamba install -c bioconda plassembler
This will install all the dependencies along with plassembler
.
Pip
You can install the Python components of plassembler
using pip.
pip install plassembler
You will then need to install the external dependencies separately, which can be found in build/environment.yaml
- Flye >=2.9
- Unicycler >=0.4.8
- Minimap2 >=2.11
- fastp >=0.18.0
- chopper >=0.5.0
- mash >=2.2
- Raven >=1.8
- Samtools >=0.15.0
Source
Alternatively, the development version of plassembler
can be installed manually via github.
git clone https://github.com/gbouras13/plassembler.git
cd plassembler
pip install -e .
Unicycler v0.5.0 Installation Issues
plassembler
works best with Unicycler v0.5.0. With Unicycler v0.4.8, plassembler
should still run without any issue and provide a satisfactory assembly, but you will be warned of this when you run plassembler
. plassembler
will not work with any older version of Unicycler.
Linux
For Linux environments, Unicycler v0.5.0 should be installed automaticall with the plassembler
bioconda installation.
You can force it as follows:
conda install -c bioconda plassembler unicycler==0.5.0
or manually install Unicycler v0.5.0 after installing plassembler
:
conda install -c bioconda plassembler
pip3 install git+https://github.com/rrwick/Unicycler.git
MacOS
For MacOS environments, the current conda installation method will only install the latest available bioconda Unicycler version of v0.4.8.
Ryan Wick (the author of Unicycler) suggests that v0.5.0 should be used, as v0.4.8 is not compatible with the latest versions of spades (see here ). This will require another installation step on MacOS.
To install Unicycler v0.5.0, it is recommended that you install Unicycler from github after installing Plassembler follows:
# installs plassembler into an environment called 'plassemblerENV' and activates it
conda create -n plassemblerENV plassembler
conda activate plassemblerENV
# installs Unicycler v0.5.0
pip3 install git+https://github.com/rrwick/Unicycler.git
Mac M1 users may need to change some compiler settings and install from the Unicycler github repo e.g.
# installs plassembler into an environment called 'plassemblerENV' and activates it
conda create -n plassemblerENV plassembler
conda activate plassemblerENV
# installs Unicycler v0.5.0
git clone https://github.com/rrwick/Unicycler.git
cd Unicycler
python3 setup.py install --makeargs "CXX=g++"
Running plassembler
To run plassembler
, first you need to install the database in a directory of your chosing:
plassembler download -d <database directory>
Once this is finished, you can run plassembler as follows:
plassembler run -d <database directory> -l <long read fastq> -o <output dir> -1 < short read R1 fastq> -2 < short read R2 fastq> -c <estimated chromosome length>
- -c will default to 1000000 if it is absent.
To specify more threads:
plassembler run -d <database directory> -l <long read fastq> -o <output dir> -1 < short read R1 fastq> -2 < short read R2 fastq> -c <estimated chromosome length> -t <threads>
Plassembler defaults to 1 thread.
To specify a prefix for the output files:
plassembler run -d <database directory> -l <long read fastq> -o <output dir> -1 < short read R1 fastq> -2 < short read R2 fastq> -c <estimated chromosome length> -t <threads> -p <prefix>
To specify a minimum long read length and minimum read quality Q-score for filtering with chopper:
plassembler run -d <database directory> -l <long read fastq> -o <output dir> -1 < short read R1 fastq> -2 < short read R2 fastq> -c <estimated chromosome length> -t <threads> -p <prefix> -m <min length> -q <min quality>
- -m will default to 500 and -q will default to 9. Note that for some tiny plasmids, -m should be reduced or perhaps even set to 1 (see this paper ).
To overwrite an existing output directory, use -f
plassembler run -d <database directory> -l <long read fastq> -o <output dir> -1 < short read R1 fastq> -2 < short read R2 fastq> -c <estimated chromosome length> -t <threads> -p <prefix> -m <min length> -q <min quality> -f
To use Raven instead of Flye as a long read assembler, use --use_raven
.
plassembler run -d <database directory> -l <long read fastq> -o <output dir> -1 < short read R1 fastq> -2 < short read R2 fastq> -c <estimated chromosome length> -t <threads> --use_raven
Please see the documentation for more options.
Usage: plassembler run [OPTIONS]
Runs Plassembler
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-d, --database PATH Directory of PLSDB database. [required]
-l, --longreads PATH FASTQ file of long reads. [required]
-1, --short_one PATH R1 short read FASTQ file. [required]
-2, --short_two PATH R2 short read FASTQ file. [required]
-c, --chromosome INTEGER Approximate lower-bound chromosome length of
bacteria (in base pairs). [default: 1000000]
-o, --outdir PATH Directory to write the output to. [default:
plassembler.output/]
-m, --min_length TEXT minimum length for filtering long reads with
chopper. [default: 500]
-q, --min_quality TEXT minimum quality q-score for filtering long reads
with chopper. [default: 9]
-t, --threads TEXT Number of threads. [default: 1]
-f, --force Force overwrites the output directory.
-p, --prefix TEXT Prefix for output files. This is not required.
[default: plassembler]
--skip_qc Skips qc (chopper and fastp).
--pacbio_model TEXT Pacbio model for Flye. Must be one of pacbio-raw,
pacbio-corr or pacbio-hifi. Use pacbio-raw for
PacBio regular CLR reads (<20 percent error),
pacbio-corr for PacBio reads that were corrected
with other methods (<3 percent error) or pacbio-
hifi for PacBio HiFi reads (<1 percent error).
-r, --raw_flag Use --nano-raw for Flye. Designed for Guppy fast
configuration reads. By default, Flye will assume
SUP or HAC reads and use --nano-hq.
--keep_fastqs Whether you want to keep FASTQ files containing
putative plasmid reads and long reads that map to
multiple contigs (plasmid and chromosome).
--keep_chromosome If you want to keep the chromosome assembly.
--use_raven Uses Raven instead of Flye for long read assembly.
May be useful if you want to reduce runtime.
--flye_directory PATH Directory containing Flye long read assembly.
Needs to contain assembly_info.txt and
assembly_info.fasta. Allows Plassembler to Skip
Flye assembly step.
--flye_assembly PATH Path to file containing Flye long read assembly
FASTA. Allows Plassembler to Skip Flye assembly
step in conjunction with --flye_info.
--flye_info PATH Path to file containing Flye long read assembly
info text file. Allows Plassembler to Skip Flye
assembly step in conjunction with
--flye_assembly.
--no_chromosome Run Plassembler assuming no chromosome can be
assembled. Use this if your reads only contain
plasmids that you would like to assemble.
Outputs
Plassembler will output a _plasmids.fasta
file, which will contain the assembled plasmid sequence(s) in FASTA format (including long and short read copy numbers in the header), and a _plasmids.gfa
file, which will contain the assembly graph from Unicycler that can be visualised in Bandage.
Plassembler also outputs a _summary.tsv
file, which gives the estimated copy number for each plasmid, for both short reads and long reads (see this paper for more details about plasmid copy numbers) and also gives each contig's top hit by mash distance in the PLSDB (if there is a hit), along with all its supporting information.
If plassembler
fails to assemble any plasmids at all in _plasmids.fasta
, all these files will still exist, but will be empty (to ensure plassembler
can be easily integrated into workflow managers like Snakemake).
plassembler
will also output a log file, a flye_output
directory, which contains the output from Flye (it may be useful to decide whether you need more sequencing reads, or some strange assembly artifact occured) and a unicycler_output
directory containing the output from Unicycler. If --use_raven
is specified, a raven_output
directory will be present instead.
Benchmarking
The benchmarking results for simulated and real datasets are available. The full benchmarking output can be found here.
All benchmarking was conducted on a Intel® Core™ i7-10700K CPU @ 3.80GHz on a machine running Ubuntu 20.04.6 LTS.
Tldr: Plassembler is much faster than Unicycler (3-10x usually) and is more accurate because it is more likely to recover low coverage plasmids that Unicycler might miss.
Acknowledgements
Many thanks are owed to Ryan Wick, who not only wrote Unicycler and some other code used in Plassembler, but also gave me some initial ideas about how to approach the plasmid assembly problem originally. If you are doing any bacterial genome assembly, you should read all of his work, but if you have read this far you probably already have.
Also thanks to Vijini Mallawaarachchi who helped refactor the code - if you are interested in recovering phages (especially in the metagenome context) please give phables a go.
Version Log
A brief description of what is new in each update of plassembler
can be found in the HISTORY.md file.
Bugs and Suggestions
If you come across bugs with plassembler
, or would like to make any suggestions to improve the program, please open an issue or email george.bouras@adelaide.edu.au.
Citations
plassembler
has been recently published in Bioinformatics:
George Bouras, Anna E. Sheppard, Vijini Mallawaarachchi, Sarah Vreugde, Plassembler: an automated bacterial plasmid assembly tool, Bioinformatics, Volume 39, Issue 7, July 2023, btad409, https://doi.org/10.1093/bioinformatics/btad409.
If you use plassembler
, please also consider citing where relevant:
- Kolmogorov, M., Yuan, J., Lin, Y. et al. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37, 540–546 (2019). https://doi.org/10.1038/s41587-019-0072-8
- Li H., Minimap2: pairwise alignment for nucleotide sequences, Bioinformatics, Volume 34, Issue 18 Pages 3094–3100 (2018), https://doi.org/10.1093/bioinformatics/bty191
- Wick RR, Judd LM, Gorrie CL, Holt KE Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 13(6): e1005595 (2017). https://doi.org/10.1371/journal.pcbi.1005595
- Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, 1000 Genome Project Data Processing Subgroup, The Sequence Alignment/Map format and SAMtools, Bioinformatics, Volume 25, Issue 16, 15 August 2009, Pages 2078–2079, https://doi.org/10.1093/bioinformatics/btp352
- Wick RR, Judd LM, Wyres KL, Holt KE. Recovery of small plasmid sequences via Oxford Nanopore sequencing. Microb Genom. 2021 Aug;7(8):000631. doi: 10.1099/mgen.0.000631. PMID: 34431763; PMCID: PMC8549360.
- Schmartz GP, Hartung A, Hirsch P, Kern F, Fehlmann T, Müller R, Keller A, PLSDB: advancing a comprehensive database of bacterial plasmids, Nucleic Acids Research, Volume 50, Issue D1, 7 January 2022, Pages D273–D278, https://doi.org/10.1093/nar/gkab1111.
- Ondov, B.D., Treangen, T.J., Melsted, P. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132 (2016). https://doi.org/10.1186/s13059-016-0997-x.
- De Coster,W. and Rademakers,R. (2023) NanoPack2: population-scale evaluation of long-read sequencing data. Bioinformatics, 39, btad311. https://doi.org/10.1093/bioinformatics/btad311.
- Vaser,R. and Šikić,M. (2021) Time-and memory-efficient genome assembly with Raven. Nat. Comput. Sci., 1, 332–336. https://doi.org/10.1038/s43588-021-00073-4.
- Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017 May;27(5):722-736. doi: https://doi.org/10.1101/gr.215087.116.
- Bouras, G., Roach, M. J., Mallawaarachchi V., Grigson., S., Papudeshi., B. (2023) Dnaapler: A tool to reorient circular microbial genomes https://github.com/gbouras13/dnaapler