Home

Awesome

<p align="center"><img src="logo.png" alt="Bacterial genome assemblies with multiplex-MinION sequencing"></p>

Completing bacterial genome assemblies with multiplex MinION sequencing

This repository contains supplemental data and code for our paper: Completing bacterial genome assemblies with multiplex MinION sequencing.

Here you will find scripts used to generate our data and figures, links to the reads and assemblies, and summaries of our results. We have also included data from other assemblers not mentioned in the paper to facilitate comparison. If you have different assembly methods that you would like to share, we are happy to include them here. Please do a GitHub pull request with your results or create an issue on this repo.

Links to read data

I did not put the ONT fast5 files on figshare due to their size (157 GB before basecalling and 1.5TB after basecalling). If you are interested in these, please contact me and we can try to work something out.

Basecalling and assembly

The ONT_barcode_basecalling_and_assembly.sh script carries out the following steps:

Each of these steps can be turned on/off using the variables at the top of the script. Details for some of the steps are described below.

Software versions used

ONT read processing

When basecalling ONT reads using Albacore (ONT's command-line basecaller), we used the --barcoding option to sort the reads into barcode bins. We then ran Porechop on each bin to remove adapter sequences and discard chimeric reads.

When running Porechop we used its barcode binning as well so we could keep only reads where both Albacore and Porechop agreed on the barcode bin. For example, Albacore put 95064 reads in the bin for barcode 1. Of these, Porechop put 90919 in the barcode 1 bin, 118 reads into bins for other barcodes, 3941 reads into no bin and 86 reads were discarded as chimeras. By using only the 90919 reads where Albacore and Porechop agree, minimised barcode cross-contamination.

All reads shorter than 2 kbp were discarded for each sample – due to the long read N50s this was a very small proportion of the reads. For samples which still had more than 500 Mbp of reads, we subsampled the read set down to 500 Mbp. This was done using read quality – specifically the reads' minimum qscore over a sliding window. This means that the discarded reads were the one which had the lowest quality regions, as indicated by their qscores. This was done with the fastq_to_fastq.py script in this repo.

Illumina-only assembly

We used the trimmed Illumina reads as input for SPAdes and Unicycler:

For SPAdes, the contigs.fasta file was taken as the final assembly.

ONT-only assembly

The subsampled ONT reads were used as input for Canu and Unicycler:

Hybrid assembly

The trimmed Illumina reads and subsampled ONT reads were used as input for hybrid assemblies:

Polishing with Nanopolish

We used Nanopolish to get the most accurate possible ONT-only assemblies:

For this step we used the full set of trimmed ONT reads (before the read sets were subsampled). After using nanopolish extract to produce a fasta file from Albacore's output directory, we used this script to exclude reads where Porechop disagreed on the bin.

We tried a second round of Nanopolish but found that it did not significantly change the results, so here we only report results from a single round of Nanopolish.

Error rate estimation

The files in the error_rate_estimation directory were used to get error rate estimates for assemblies. We...

By using only large (10+ kbp) contigs, this method only covers non-repetitive DNA. Error rates in repetitive regions will possibly be higher.

Depth per replicon

The files in the depth_per_replicon directory were used to generate Figure S4 which shows the read depth for each plasmid, relative to the chromosomal depth, for both Illumina and ONT reads. It shows that small plasmids are very underrepresented in ONT reads.

ONT-only error rates

The files in the nanopore_only_error_rates were used to generate Figure S3 which shows Canu error rates (before and after Nanopolish) against ONT read depth.

Result table

The results.xlsx file contains statistics on each read set and assembly. The summaries below were taken from this table.

Results: Illumina-only assemblies

AssemblerMean contigsMean N50Complete large plasmidsComplete small plasmidsEstimated error rate
SPAdes379.1218,479n/an/a0.0001%
Unicycler191.8293,6482 / 2812 / 290.0000%

Links to assemblies

Metrics

Conclusions

Overall, Unicycler and SPAdes perform similarly when assembling the Illumina reads – not surprising, since Unicycler uses SPAdes to assemble Illumina reads. Unicycler achieves slightly better values because it uses a wider k-mer range than SPAdes does by default. Experimenting with larger values for SPAdes' -k option would probably give results close to Unicycler's.

Both Unicycler and SPAdes had extremely low error rates. These means that their assemblies are in very good agreement with ABySS and Velvet for the non-repetitive sequences assessed. This agreement between different assemblers supports our assumption that Illumina-only assemblies have near-perfect base-level accuracy (at least for non-repetitive sequence).

The SPAdes mean contig count is greatly inflated by sample INF163 which has some low-depth contamination. The SPAdes assembly has many contigs from this contamination, but they are filtered out in the Unicycler assembly. Excluding that sample, the mean contig count for SPAdes is 213.7, much closer to Unicycler's value.

As expected for short reads, neither assembler was very good at completing large plasmids, as they usually contained shared sequence with other replicons. Even though exact completed-plasmid counts aren't available for SPAdes, it seemed to perform similarly to Unicycler on small plasmids – assembling them into single contigs when they only contain unique sequence, assembling them into incomplete contigs when they share sequence with each other.

Results: ONT-only assemblies

AssemblerMean N50Complete chromosomesComplete large plasmidsComplete small plasmidsEstimated error rate (pre-Nanopolish)Estimated error rate (post-Nanopolish)
Canu4,784,3564 / 1223 / 280 / 291.2219%0.6681%
Unicycler4,965,5847 / 1227 / 285 / 291.0164%0.6164%

Links to assemblies

Metrics

Conclusions

Neither Canu nor Unicycler was particular good recovering small plasmids. This is probably because the small plasmids are very underrepresented in the ONT reads. Unicycler did manage to assemble a few small plasmids and Canu didn't get any. Altering Canu's settings as described here may help.

The estimated error rates for both Canu and Unicycler are much higher than Illumina-only assemblies: near 1% (i.e. one error per ~100 bp in the assembly). Unicycler's error rates were slightly lower than Canu's, probably due to its repeated application of Racon to the assembly. Running Racon on Canu's assembly would most likely result in a similar error rate to Unicycler's assemblies. Nanopolish was able to repair about half of the errors.

Results: hybrid assemblies

AssemblerMean N50Complete chromosomesComplete large plasmidsComplete small plasmids100% completeEstimated error rate
SPAdes4,391,534n/an/an/an/a0.0000%
Canu+Pilon4,831,6604 / 1223 / 280 / 290 / 120.0039%
Unicycler5,334,50912 / 1228 / 2818 / 297 / 120.0000%

Links to assemblies

Metrics

Conclusions

Unicycler does quite well here because hybrid assemblies are its primary focus. Of its five assemblies which were not 100% complete, four were due to incomplete small plasmids (underrepresented in the ONT reads). The remaining incomplete assembly (sample INF164) was due to a discrepancy between the Illumina and ONT reads – an 18 kbp sequence was present in the Illumina sample but absent in the ONT sample, causing an incomplete component in the assembly graph. This was most likely caused by a biological change between cultures used for DNA extraction.

Unicycler and SPAdes both produce their hybrid assemblies by scaffolding an Illumina-only assembly graph. This explains why their error rates are as low as the Illumina-only assemblies.

For Canu+Pilon, the error rates for each of the five rounds of Pilon polishing were: 0.0427, 0.0051, 0.0039, 0.0039 and 0.0039. It plateaued at 0.0039%, suggesting that three rounds of Pilon polishing is sufficient. The error rate never got as low as the SPAdes/Unicycler error rate, indicating that Pilon could correct most but not all errors in the ONT-only assembly. I'm not sure what's causing this and it deserves closer investigation!