Awesome
Lyve-SET
LYVE version of the Snp Extraction Tool (SET), a method of using hqSNPs to create a phylogeny.
Installation
After downloading, run make install
. Some versions are unstable. For the latest stable version, please see the releases tag on GitHub. Additionally, there is a method for installing by containers.
See INSTALL.md for more information including prerequisite software
For the impatient
Toy dataset
Here is a way to just try out the test dataset. It runs the lambda dataset (1st param) and places the results into the lambda directory (2nd param)
set_test.pl --numcpus 8 lambda lambda
# try out a larger dataset
set_test.pl --numcpus 8 listeria_monocytogenes listeria_monocytogenes
With your data
set_manage.pl --create yourProject
# paired end reads have to be shuffled into one file per sample
shuffleSplitReads.pl --numcpus 8 -o interleaved *.fastq.gz
# then moved into your project dir
mv interleaved/*.fastq.gz yourProject/reads/
# cleanup
rmdir interleaved
cp reference.fasta yourProject/ref/
launch_set.pl --numcpus 8 -ref yourProject/ref/reference.fasta yourProject
Usage
To see the help for any script, run it without options or with --help
. For example, set_test.pl -h
. The following is the help for the main script, launch_set.pl
:
Usage: launch_set.pl [project] [-ref reference.fasta|reference.gbk]
If project is not given, then it is assumed to be the current working directory.
-ref ref.fasta The reference genome assembly. If it is
a genbank or embl file, then it will be
converted to reference.gbk.fasta and will
be used for SNP annotation. If a fasta
is given, then no SNP annotation will
happen.
Default: project/reference/reference.fasta
COMMON OPTIONS
--allowedFlanking 0 allowed flanking distance in bp.
Nucleotides this close together cannot be
considered as high-quality.
--min_alt_frac 0.75 The percent consensus that needs
to be reached before a SNP is called.
Otherwise, 'N'
--min_coverage 10 Minimum coverage needed before a
SNP is called. Otherwise, 'N'
--presets "" See presets.conf for more information
--numcpus 1 number of cpus
PERFORM CERTAIN STEPS
--mask-phages Search for and mask phages in the reference genome
--mask-cliffs Search for and mask 'Cliffs' in pileups
SKIP CERTAIN STEPS
--nomatrix Do not create an hqSNP matrix
--nomsa Do not make a multiple sequence alignment
--notrees Do not make phylogenies
--singleend Treat everything like single-end. Useful
for when you think there is a single-
end/paired-end bias.
OTHER SHORTCUTS
--fast Shorthand for --downsample --mapper snap --nomask-phages
--nomask-cliffs --sample-sites
--downsample Downsample all reads to 50x. Approximated according
to the ref genome assembly
--sample-sites Randomly choose a genome and find SNPs in a quick
and dirty way. Then on the SNP-calling stage,
only interrogate those sites for SNPs for each
genome (including the randomly-sampled genome).
MODULES
--read_cleaner none Which read cleaner? Choices: none, CGP, BayesHammer
--mapper smalt Which mapper? Choices: smalt, snap
--snpcaller varscan Which SNP caller? Choices: varscan, vcftools
SCHEDULER AND MULTITHREADING OPTIONS
--queue all.q default queue to use
--numnodes 50 maximum number of nodes
--qsubxopts '-N lyve-set' Extra options to pass to qsub. This is not
sanitized; internal options might overwrite yours.
--noqsub Do not use the scheduler, even if it exists
LOCATIONS OF FILE DIRECTORIES
--readsdir readsdir/ where fastq and fastq.gz files are located
--bamdir bamdir/ where to put bams
--vcfdir vcfdir/ where to put vcfs
--tmpdir tmpdir/ tmp/ Where to put temporary files
--msadir msadir/ multiple sequence alignment and tree files (final output)
--logdir logdir/ Where to put log files. Qsub commands are also stored here.
--asmdir asmdir/ directory of assemblies. Copy or symlink the reference genome assembly
to use it if it is not already in the raw reads directory
Run a test dataset
See: examples.md for more details.
Also see: testdata.md for more details on making your own test data set.
The script set_test.pl
will run an actual test on a given dataset
Runs a test dataset with Lyve-SET
Usage: set_test.pl dataset [project]
dataset names could be one of the following:
escherichia_coli, lambda, listeria_monocytogenes, salmonella_enterica_agona
NOTE: project will be the name of the dataset, if it is not given
--numcpus 1 How many cpus you want to use
--do-nothing To print the commands but do not run system calls
# will run the entire lambda phage dataset and produce meaningful results in ./lambda/msa/
$ set_test.pl lambda lambda
Examples
See: examples.md for more details.
The script set_manage.pl
sets up the project directory and adds reads, and you should use the following syntax. Note that paired end reads should be in interleaved format. Scripts that interleave reads include run_assembly_shuffleReads.pl
in the CG-Pipeline package (included with make install
) and also shuffleSequences_fastq.pl
in the Velvet package.
Lyve-SET also has a special script shuffleSplitReads.pl
that will shuffle many reads at once as shown below in the example.
# Shuffle your reads if they are not already. This command
# creates a folder interleaved and creates interleaved files
$ shuffleSplitReads.pl --numcpus 8 -o interleaved *.fastq.gz
# Create the project directory `setTest`
$ set_manage.pl --create setTest
# Add reads
$ for i in interleaved/*.fastq.gz; do
> set_manage.pl setTest --add-reads $i
> done;
# Add assemblies (optional)
$ set_manage.pl setTest --add-assembly file1.fasta
$ set_manage.pl setTest --add-assembly file2.fasta
# Specify your reference genome
$ set_manage.pl setTest --change-reference file3.fasta
Run Lyve-SET with as few options as possible
$ launch_set.pl setProj
More complex
$ launch_set.pl setProj --queue all.q --numnodes 20 --numcpus 16 --notrees
Output files
Most output files that you will want to see are under project/msa. However for more details please see docs/output.md.
To visualize the results, please see docs/VIZ.md.
Getting help
- Check out the FAQ first to see if your question has already been asked
- Join the Google Group at https://groups.google.com/forum/#!forum/lyve-set. This is sometimes the only way I can count the number of people/institutions using Lyve-SET and so I appreciate the head count.
- Tips and tricks
Citing lyve-SET
https://github.com/lskatz/lyve-SET
Katz LS, Griswold T, Williams-Newkirk AJ, Wagner D, Petkau A, et al. (2017) A Comparative Analysis of the Lyve-SET Phylogenomics Pipeline for Genomic Epidemiology of Foodborne Pathogens. Frontiers in Microbiology 8.
See Also
Other SNP Pipelines
Lyve-SET sites
- Please register for my head count at https://groups.google.com/forum/#!forum/lyve-set
Sites or organizations that use Lyve-SET
- PathoBacTyper http://halst.nhri.org.tw/PathoBacTyper/
- PulseNet https://www.cdc.gov/pulsenet/