Home

Awesome

<img src="https://github.com/chasewnelson/snpgenie/blob/master/logo1a.jpg?raw=true" alt="SNPGenie logo by Elizabeth Ogle" width="450" height="175" align="middle">

SNPGenie is a collection of Perl scripts for estimating π<sub>N</sub>/π<sub>S</sub>, d<sub>N</sub>/d<sub>S</sub>, and gene diversity from next-generation sequencing (NGS) single-nucleotide polymorphism (SNP) variant data. Each analysis uses its own script:

  1. WITHIN-POOL ANALYSIS. Use snpgenie.pl, the original SNPGenie. Analyzes within-sample π<sub>N</sub>/π<sub>S</sub> from pooled NGS SNP data. SNP reports (CLC, Geneious, or VCF) must each correspond to a single population/pool, with variants called relative to one reference sequence (one sequence in one FASTA file). Example:

     snpgenie.pl --vcfformat=4 --snpreport=<variants>.vcf --fastafile=<ref_seq>.fa --gtffile=<CDS_annotations>.gtf
    
  2. WITHIN-GROUP ANALYSIS. Use snpgenie\_within\_group.pl. Traditional within-group analysis of d<sub>N</sub>/d<sub>S</sub> among aligned sequences in a FASTA file, similar to <a target="_blank" href="http://www.megasoftware.net/">MEGA</a>, with automation, sliding windows, parallelism, and overlapping gene features. Example:

     snpgenie_within_group.pl --fasta_file_name=<aligned_seqs>.fa --gtf_file_name=<CDS_annotations>.gtf --num_bootstraps=10000 --procs_per_node=8
    
  3. BETWEEN-GROUP ANALYSIS. Traditional between-group analysis comparing two or more groups (FASTA files) of aligned sequences, such as possible using <a target="_blank" href="http://www.megasoftware.net/">MEGA</a>, with automation, sliding windows, parallelism, and overlapping gene features. Example:

     snpgenie_between_group.pl
    

UPDATE FOR VCF INPUT: given myriad VCF formats, it is NECESSARY TO SPECIFY THE FORMAT OF THE VCF SNP report using the --vcfformat argument. Format 4 is the most common. For details, see the section on VCF.

Contents

<a name="introduction"></a>Introduction

New applications of next-generation sequencing (NGS) use pooled samples containing DNA from multiple individuals to perform population genetic analyses. SNPGenie is a Perl program which analyzes the results of a single-nucleotide polymorphism (SNP) caller to calculate nucleotide diversity (including its nonsynonymous and synonymous partitions, π<sub>N</sub> and π<sub>S</sub>) and gene diversity. Variant calls are typically present in annotation tables and assume that the pooled DNA sample is representative of some population of interest. For example, if one is interested in determining the nucleotide diversity of a virus population within a single host, it might be appropriate to sequence the pooled nucleic acid content of the virus in a blood sample from that host. Comparing π<sub>N</sub> and π<sub>S</sub> for a gene product, or comparing gene diversity at polymorphic sites of different types, may help to identify genes or gene regions subject to positive (Darwinian) selection, negative (purifying) selection, or random genetic drift. SNPGenie also includes such features as minimum allele frequency trimming (see Options), and can be combined with upstream applications such as maximum-likelihood SNP calling techniques (e.g., see Lynch et al. 2014). For additional background, see Nelson & Hughes (2015) in the References.

<a name="snpgenie"></a>SNPGenie

<a name="snpgenie-input"></a>SNPGenie Input

SNPGenie is a command-line application written in Perl, with no additional dependencies beyond the standard Perl package (i.e., simply download and run). Input includes:

  1. One Reference Sequence file, containing one reference sequence, in FASTA format (.fa/.fasta);
  2. One file with CDS annotations in Gene Transfer Format (.gtf); and
  3. One or more tab-delimited (.txt) SNP Reports, each corresponding to a single pooled-sequencing run (i.e., sample from a population) in CLC, VCF, or Geneious formats, with variants called relative to the reference sequence.

All input files must be placed in the same directory. Then simply run the appropriate SNPGenie script from the command line (see Options if you wish for more control). To do this, first download the snpgenie.pl script and place it in your system’s $PATH, or simply in your working directory. Next, place your SNP report(s), FASTA (.fa/.fasta), and GTF (.gtf) files in your working directory. Open the command line prompt (or Terminal) and navigate to the directory containing these files using the "cd" command in your shell. Finally, execute SNPGenie by typing the name of the script and pressing the <RETURN> (<ENTER>) key:

snpgenie.pl

Further details on input and options are below.

<a name="ref-seq"></a>INPUT (1) One Reference Sequence File

Only one reference sequence must be provided in a single FASTA (.fa/.fasta) file (i.e., one file containing one sequence). Thus, all SNP positions in the SNP reports are called relative to the single reference sequence. Because of this one-sequence stipulation, a script has been provided to split a multi-sequence FASTA file into its constituent sequences for multiple or parallel analyses; see Additional Scripts below.

<a name="gtf"></a>INPUT (2) One Gene Transfer Format File

The Gene Transfer Format (.gtf) file is tab (\t)-delimited, and must include non-redundant records for all CDS elements (i.e., open reading frames, or ORFs) present in your SNP report(s). Thus, if multiple records exist with the same gene name (e.g., different transcripts), ONLY ONE may be included in your analysis at a time. The GTF should also include any ORFs which do not contain any variants (if they exist). SNPGenie expects every coding element to be labeled as type "CDS", and for its product name to follow a "gene_id" tag. In the case of CLC and Geneious SNP reports, this name must match that present in the SNP report. If a single coding element has multiple segments (e.g., exons) with different coordinates, simply enter one line for each segment, using the same product name; they will be concatenated. Finally, for sequences with genes on the reverse '-' strand, SNPGenie must be run twice, once for each strand, with the minus strand's own set of input files (i.e., the '-' strand FASTA, GTF, and SNP report); see A Note on Reverse Complement ('-' Strand) Records below. For more information about GTF, please visit <a target="_blank" href="http://mblab.wustl.edu/GTF22.html">The Brent Lab</a>. To convert a GFF file to GTF format, use a tool like <a target="_blank" href="https://ccb.jhu.edu/software/stringtie/gff.shtml#gffread">gffread</a>. A short GTF example follows:

reference.gbk	CLC	CDS	5694	8369	.	+	0	gene_id "ORF1";
reference.gbk	CLC	CDS	8203	8772	.	+	0	gene_id "ORF2";
reference.gbk	CLC	CDS	1465	4485	.	+	0	gene_id "ORF3";
reference.gbk	CLC	CDS	5621	5687	.	+	0	gene_id "ORF4";
reference.gbk	CLC	CDS	7920	8167	.	+	0	gene_id "ORF4";
reference.gbk	CLC	CDS	5395	5687	.	+	0	gene_id "ORF5";
reference.gbk	CLC	CDS	7920	8016	.	+	0	gene_id "ORF5";
reference.gbk	CLC	CDS	4439	5080	.	+	0	gene_id "ORF6";
reference.gbk	CLC	CDS	5247	5549	.	+	0	gene_id "ORF7";
reference.gbk	CLC	CDS	4911	5246	.	+	0	gene_id "ORF8";

<a name="SNP-Reports"></a>INPUT (3) SNP Report File(s)

Each SNP report should contain variant calls for a single pooled-sequencing run (i.e., sample from population), or alternatively a summary of multiple individual sequences. Its site numbers should refer to the exact reference sequence present in the FASTA input. SNP reports must be provided in one of the following formats:

<a name="clc"></a>CLC Genomics Workbench

A <a target="_blank" href="http://www.clcbio.com/products/clc-genomics-workbench/">CLC Genomics Workbench</a> SNP report must include the following columns to be used with SNPGenie, with the unaltered CLC column headers:

In addition to containing the aforementioned columns, the SNP report should ideally be free of thousand separators (,) in the Reference Position, Count, and Coverage columns (default .txt output). The Frequency must remain a percentage (default .txt output). Finally, the user should verify that the reading frame in the CLC input file is correct.

<a name="geneious"></a>Geneious

A <a target="_blank" href="http://www.geneious.com/">Geneious</a> SNP report must include the following columns to be used with SNPGenie, with the unaltered Geneious column headers:

In addition to containing the aforementioned columns, the SNP report should ideally be free of extraneous characters such as thousand separators (,). The Variant Frequency must remain a percentage (default .txt output). Finally, the user should verify that the reading frame in the Geneious output is correct.

<a name="vcf"></a>Variant Call Format (VCF)

A <a target="_blank" href="https://github.com/samtools/hts-specs">VCF</a> SNP report must include the following columns, with the unaltered VCF column headers:

Because VCF files vary widely in format, SNPGenie now requires users to specify exactly which VCF format is being used with the --vcfformat argument. New formats are being added on a case-by-case basis; users should contact the author to have new formats incorporated. FORMAT 4 is the most common, and necessary if your file contains multiple <SAMPLE> columns. Note that these formats are specific to SNPGenie; they do not refer to VCF version numbers. All supported formats are:

  1. FORMAT (1): --vcfformat=1. Multiple individual genomes have been sequenced separately, with SNPs summarized in the VCF file. SNPGenie will require the following in the INFO column:

    • AN, the number of alleles sequenced (e.g., "AN=60"). If absent, will use NS, the number of samples (i.e., individual sequencing experiments) being summarized (e.g., "NS=30");
    • AF, the fractional (decimal) allele frequency(-ies) for the variant alleles in the same order as listed in the ALT column (e.g., "AF=0.200"). If multiple variants exist at the same site, their frequencies must be separated by commans (e.g., "AF=0.200,0.087").
  2. FORMAT (2): --vcfformat=2. Variants have been called from a pooled deep-sequencing sample (genomes from multiple individuals sequenced together). SNPGenie will require the following in the INFO column:

    • DP, the coverage or total read depth (e.g., "DP=4249");
    • AF, the fractional (decimal) allele frequency(-ies) of the variant alleles in the same order as listed in the ALT column (e.g., "AF=0.01247"; "AF=0.01247,0.08956" for two variants; etc.).
  3. FORMAT (3): --vcfformat=3. Like format 2, variants have been called from a pooled deep-sequencing sample (genomes from multiple individuals sequenced together). Unlike format 2, SNPGenie will require the following in the INFO column:

    • DP4, containing the number of reference and variant reads on the forward and reverse strands in the format DP4=<num. fw ref reads>,<num. rev ref reads>,<num. fw var reads>,<num. rev var reads> (e.g., "DP4=11,9,219,38").
    • N.B.: if multiple single nucleotide variants exist at the same site, this VCF format does not specify the number of reads for each variant. SNPGenie thus approximates by dividing the number of non-reference reads evenly amongst the variant alleles.
  4. FORMAT (4): --vcfformat=4. THE MOST COMMON FORMAT. Like formats 2 and 3, variants have been called from a pooled deep-sequencing sample (genomes from multiple individuals sequenced together). For this format, SNPGenie requires AD and DP data in the FORMAT and <SAMPLE> columns, where FORMAT is the final column header before the <SAMPLE> column(s) begins. The order of the data keys in the FORMAT column and the data values in the <SAMPLE> columns must match.

    • The FORMAT column must include the AD (allele depth; read count) tag, which refers to the read depth of the reference and each variant allele in the <SAMPLE> column(s). Values for variant allele(s) must be in the same order as listed in the ALT column (e.g., "AD" in the FORMAT column and "75,77" in the <SAMPLE> column). Note that the first value in AD refers to the read count of the reference nucleotide, followed by any variant nucleotides. Thus, for sites with one SNP, this will contain two values, and the field for the sample should be <ref_count>,<alt_count>; for two SNPs, <ref_count>,<alt1_count>,<alt2_count>; and so on;
    • For coverage (total read depth), include DP in the FORMAT column, which refers to the total read depth in the <SAMPLE> column(s) (e.g., "DP" in the FORMAT column and "152" in the <SAMPLE> column). If the value of DP does not equal the sum of reads for all variant nucleotides (sum of AD values), the latter will be used.

As usual, you will want to make sure to maintain the VCF file's features, such as TAB(\t)-delimited columns. Unlike some other formats, the allele frequency in VCF is a decimal.

<a name="revcom"></a>A Note on Reverse Complement ('-' Strand) Records

Many large genomes have coding products on both strands. In this case, SNPGenie must be run twice: once for the '+' strand, and once for the '-' strand. This requires FASTA, GTF, and SNP report input for the '-' strand. I provide a script for converting input files to their reverse complement strand in the Additional Scripts below. Note that, regardless of the original SNP report format, the reverse complement SNP report is in a CLC-like format that SNPGenie will recognize. For both runs, the GTF should include all products for both strands, with products on the strand being analyzed labeled '+' and coordinates defined with respect to the beginning of that strand's FASTA sequence. Also note that a GTF file containing only '-' strand records will not run; SNPGenie does calculations only for the products on the current + strand, using the '-' strand products only to determine the number of overlapping reading frames for each variant.

<a name="options"></a>Options

To alter how SNPGenie works, the following options may be used:

For example, if you wanted to specify a minimum allele frequency of 1% and specify all three input files, you could run the following:

snpgenie.pl --minfreq=0.01 --snpreport=mySNPreport.txt --fastafile=myFASTA.fa --gtffile=myGTF.gtf

<a name="output"></a>Output

SNPGenie creates a new folder called SNPGenie_Results within the working directory. This contains the following TAB-delimited results files:

  1. SNPGenie_parameters.txt, containing the input parameters and file names.

  2. SNPGenie_LOG.txt, documenting any peculiarities or errors encountered. Warnings are also printed to the Terminal (shell) window.

  3. site_results.txt, providing results for all polymorphic sites. Note that, if the population is genetically homogenous at a site, even if it differs from the reference or ancestral sequence, it will not be considered polymorphic. Also keep in mind that columns are sorted by product first, then site number, with noncoding sites at the end of the file. Columns are:

    • file. The SNP report analyzed.
    • product. The CDS annotation to which the site belongs; "noncoding" if none on this strand.
    • site. The site coordinate of the nucleotide in the reference sequence.
    • ref_nt. The identity of the nucleotide in the reference sequence.
    • maj_nt. The most common nucleotide in the population at this site.
    • position_in_codon. If present in a CDS annotation, the position of this site within its codon (1, 2, or 3).
    • overlapping_ORFs. The number of additional CDS annotations overlapping this site. For example, if the site is part of only one open reading frame, the value will be 0. If the site is part of two open reading frames, the value will be 1.
    • codon_start_site. The site coordinate of the relevant codon's first nucleotide in the reference sequence.
    • codon. The identity of the relevant codon.
    • pi. Nucleotide diversity at this site.
    • gdiv. Gene diversity at this site.
    • class_vs_ref. This site's classification, as compared to the reference sequence. For example, if the site contains only one SNP, and that SNP is synonymous, the site will be classified as Synonymous. Nonsynonymous or Synonymous.
    • class. This site's classification as compared to all sequences present in the population. For example, if the population contains both A and G residues at the third site of a GAA (reference) codon, then the site will be Synonymous, because both GAA and GAG encode Glu. On the other hand, if the population also contains a C at this site, the site will be Ambiguous, because GAC encodes Asp, meaning both nonsynonymous and synonymous polymorphisms exist at the site. Nonsynonymous, Synonymous, or Ambiguous.
    • coverage. The NGS read depth at the site.
    • A. The number of reads containing an A (adenine) nucleotide at this site. N.B.: may be fractional if the coverage and variant frequency given in the SNP report do not imply a whole number.
    • C. For C (cytosine), as for A.
    • G. For G (guanine), as for A.
    • T. For T (thymine), as for A.
  4. codon_results.txt, providing results for all polymorphic sites. Columns are:

    • file. The SNP report analyzed.
    • product. The CDS annotation to which the site belongs; "noncoding" if none.
    • site. The site coordinate of the nucleotide in the reference sequence.
    • codon. The identity of the relevant codon.
    • num_overlap_ORF_nts. The number of nucleotides in this codon (up to 3) which overlap other ORFs (in addition to the current "product" annotation).
    • N_diffs. The mean number of pairwise nucleotide comparisons in this codon which are nonsynonymous (i.e., amino acid-altering) in the pooled sequence sample. The numerator of π<sub>N</sub>.
    • S_diffs. The mean number of pairwise nucleotide comparisons in this codon which are synonymous (i.e., amino acid-conserving) in the pooled sequence sample. The numerator of π<sub>S</sub>.
    • N_sites. The mean number of sites in this codon which are nonsynonymous, given all sequences in the pooled sample. The denominator of π<sub>N</sub> and mean d<sub>N</sub> versus the reference.
    • S_sites. The mean number of sites in this codon which are synonymous, given all sequences in the pooled sample. The denominator of π<sub>S</sub> mean d<sub>S</sub> versus the reference.
    • N_sites_ref. The number of sites in this codon which are nonsynonymous in the reference sequence.
    • S_sites_ref. The number of sites in this codon which are synonymous in the reference sequence.
    • N_diffs_vs_ref. This codon's mean number of nonsynonymous nucleotide differences from the reference sequence in the pooled sequence sample. The numerator of mean d<sub>N</sub> versus the reference.
    • S_diffs_vs_ref. This codon's mean number of synonymous nucleotide differences from the reference sequence in the pooled sequence sample. The numerator of mean d<sub>S</sub> versus the reference.
    • gdiv. Mean gene diversity (observed heterozygosity) for this codon's nucleotide sites.
    • N_gdiv. Mean gene diversity for this codon's nonsynonymous polymorphic sites.
    • S_gdiv. Mean gene diversity for this codon's synonymous polymorphic sites.
  5. product_results.txt, providing results for all CDS elements present in the GTF file for the '+' strand. Columns are:

    • file. The SNP report analyzed.
    • product. The CDS annotation to which the site belongs; "noncoding" if none.
    • N_diffs. The sum over all codons in this product of the mean number of pairwise nucleotide comparisons which are nonsynonymous (i.e., amino acid-altering) in the pooled sequence sample. The numerator of π<sub>N</sub>.
    • S_diffs. The sum over all codons in this product of the mean number of pairwise nucleotide comparisons which are synonymous (i.e., amino acid-conserving) in the pooled sequence sample. The numerator of π<sub>S</sub>.
    • N_diffs_vs_ref. The sum over all codons in this product of the mean number of nonsynonymous nucleotide differences from the reference sequence in the pooled sequence sample. The numerator of mean π<sub>N</sub> versus the reference.
    • S_diffs_vs_ref. The sum over all codons in this product of the mean number of synonymous nucleotide differences from the reference sequence in the pooled sequence sample. The numerator of mean π<sub>S</sub> versus the reference.
    • N_sites. The mean number of sites in this product which are nonsynonymous, given all sequences in the pooled sample. The denominator of π<sub>N</sub> and mean d<sub>N</sub> versus the reference.
    • S_sites. The mean number of sites in this product which are synonymous, given all sequences in the pooled sample. The denominator of π<sub>S</sub> and mean d<sub>S</sub> versus the reference.
    • piN. (π<sub>N</sub>.) The mean number of pairwise nonsynonymous differences per nonsynonymous site in this product.
    • piS. (π<sub>S</sub>.) The mean number of pairwise synonymous differences per synonymous site in this product.
    • mean_dN_vs_ref. The mean number of nonsynonymous differences from the reference per nonsynonymous site in this product.
    • mean_dS_vs_ref. The mean number of synonymous differences from the reference per synonymous site in this product.
    • mean_gdiv_polymorphic. Mean gene diversity (observed heterozygosity) at all polymorphic nucleotide sites in this product.
    • mean_N_gdiv. Mean gene diversity at all nonsynonymous polymorphic nucleotide sites in this product.
    • mean_S_gdiv. Mean gene diversity at all synonymous polymorphic nucleotide sites in this product.
  6. population_summary.txt, providing summary results for each population's sample (SNP report) with respect to the '+' strand. Columns are:

    • file. The SNP report analyzed.
    • sites. Total number of sites in the reference genome.
    • sites_coding. Total number of sites in the reference genome which code for a protein product on the analyzed '+' strand, given the CDS annotations in the GTF file.
    • sites_noncoding. Total number of sites in the reference genome which do not code for a protein product, given the CDS annotations in the GTF file.
    • pi. Mean number of pairwise differences per site in the pooled sample across the whole genome.
    • pi_coding. Mean number of pairwise differences per site in the pooled sample across all coding sites in the genome.
    • pi_noncoding. Mean number of pairwise differences per site in the pooled sample across all noncoding sites in the genome.
    • N_sites. The mean number of sites in the genome which are nonsynonymous, given all sequences in the pooled sample. The denominator of π<sub>N</sub> and mean d<sub>N</sub> versus the reference.
    • S_sites. The mean number of sites in the genome which are synonymous, given all sequences in the pooled sample. The denominator of π<sub>S</sub> and mean d<sub>S</sub> versus the reference.
    • piN. The mean number of pairwise nonsynonymous differences per nonsynonymous site across the genome of the pooled sample.
    • piS. The mean number of pairwise synonymous differences per synonymous site across the genome of the pooled sample.
    • mean_dN_vs_ref. The mean number of nonsynonymous differences from the reference per nonsynonymous site across the genome of the pooled sample.
    • mean_dS_vs_ref. The mean number of synonymous differences from the reference per synonymous site across the genome of the pooled sample.
    • mean_gdiv_polymorphic. Mean gene diversity (observed heterozygosity) at all polymorphic nucleotide sites in the genome of the pooled sample.
    • mean_N_gdiv. Mean gene diversity at all nonsynonymous polymorphic nucleotide sites in the genome of the pooled sample.
    • mean_S_gdiv. Mean gene diversity at all synonymous polymorphic nucleotide sites in the genome of the pooled sample.
    • mean_gdiv. Mean gene diversity at all nucleotide sites in the genome of the pooled sample.
    • sites_polymorphic. The number of sites in the genome of the pooled sample which are polymorphic.
    • mean_gdiv_coding_poly. Mean gene diversity at all polymorphic nucleotide sites in the genome of the pooled sample which code for a protein product, given the CDS annotations in the GTF file.
    • sites_coding_poly. The number of sites in the genome of the pooled sample which are polymorphic and code for a protein product, given the CDS annotations in the GTF file.
    • mean_gdiv_noncoding_poly. Mean gene diversity at all polymorphic nucleotide sites in the genome of the pooled sample which do not code for a protein product, given the CDS annotations in the GTF file.
    • sites_noncoding_poly. The number of sites in the genome of the pooled sample which are polymorphic and do not code for a protein product, given the CDS annotations in the GTF file.
  7. sliding_window_length<Length>_results.txt, containing codon-based results over a sliding window, with a default length of 9 codons.

<a name="snpgenie-within"></a>SNPGenie Within-Group

The script snpgenie_within_group.pl estimates mean d<sub>N</sub> and d<sub>S</sub> within a group of aligned sequences in a single FASTA file. Designed for use with sequence data that outsizes what can be handled by other software platforms, this script invokes parallelism for codons and bootstrapping, and as a result has one dependency: the <a target="_blank" href="http://search.cpan.org/~dlux/Parallel-ForkManager-0.7.5/ForkManager.pm">Parallel::ForkManager</a> module. If this isn't already installed, please use CPAN to install it before running the script. <a target="_blank" href="http://www.cpan.org/modules/INSTALL.html">Click here</a> to learn how.

Note that within-group mean d<sub>N</sub> and d<sub>S</sub> are mathematically equivalent to π<sub>N</sub> and π<sub>S</sub> when groups contain sequences sampled from a single population.

Provide the script with the following arguments:

The format for using this script is:

    snpgenie_within_group.pl --fasta_file_name=<aligned_sequences>.fa --gtf_file_name=<coding_annotations>.gtf --num_bootstraps=<bootstraps> --procs_per_node=<procs_per_node>

For example, on a node with access to 64 processors in a high performance computing environment, you might submit a job including the following (DON'T run this from the head node!):

    snpgenie_within_group.pl --fasta_file_name=my_virus_genomes.fa --gtf_file_name=my_virus_genes.gtf --num_bootstraps=10000 --procs_per_node=64

Results files will be generated in the working directory. For a detailed explanation of the results, consult the output descriptions above (however, note that not all files and columns generated for the within-pool analysis will be present for the within-group analysis).

To execute bootstrapping and/or sliding window analyses, run the processing script SNPGenie_sliding_windows.R (see Sliding Windows and Bootstrapping).

SNPGenie_sliding_windows.R replaces the bootstrapping and sliding window script snpgenie_within_group_processor.pl, which is no longer supported. Its (deprecated) description is below:

The format for using the script is:

snpgenie_within_group_processor.pl --codon_file=within_group_codon_results.txt --sliding_window_size=10 --sliding_window_step=1 --num_bootstraps=10000

A typical workflow thus includes one run of snpgenie_within_group.pl to obtain codon results, followed a run of SNPGenie_sliding_windows.R (previously snpgenie_within_group_processor.pl) for each coding product to obtain sliding window results with bootstrap results.

<a name="snpgenie-between"></a>SNPGenie Between-Group

The script snpgenie_between_group.pl estimates mean d<sub>N</sub> and d<sub>S</sub> between two or more groups of sequences in FASTA format. Designed for use with sequence data that outsizes what can be handled by other software platforms, this script invokes parallelism for codons and bootstrapping, and as a result has one dependency: the <a target="_blank" href="http://search.cpan.org/~dlux/Parallel-ForkManager-0.7.5/ForkManager.pm">Parallel::ForkManager</a> module. If this isn't already installed, please use CPAN to install it before running the script. <a target="_blank" href="http://www.cpan.org/modules/INSTALL.html">Click here</a> to learn how.

Just run the script in a directory containing two or more FASTA files, each containing a group of aligned sequences which are also aligned to one another across groups (files), ending with a .fa, .fas, or .fasta extension. In other words, each FASTA file contains one group. For example, one might wish to compare sequences of the West Nile Virus 2K gene derived from three different mosquito vector species. In such a case, all sequences would first be aligned. Next, all sequences from mosquito species 1 would be placed together in one FASTA file; all sequences from mosquito species 2 would be placed in a second FASTA file; and all sequences from mosquito species 3 would be placed in a third FASTA file.

SNPGenie automatically detects all FASTA files in the working directory. The GTF file (see Gene Transfer Format for details) must be specified at the command line, along with an optional the number of bootstraps and processes:

snpgenie_between_group.pl --gtf_file=<CDS_annotations>.gtf --num_bootstraps=<integer> --procs_per_node=<integer>

Results files will be generated in the working directory. For a detailed explanation of the results, consult the output descriptions above (however, note that not all files and columns generated for the within-pool analysis will be present for the within-group analysis).

To execute bootstrapping and/or sliding window analyses, run the processing script SNPGenie_sliding_windows.R (see Sliding Windows and Bootstrapping).

SNPGenie_sliding_windows.R replaces the bootstrapping and sliding window script snpgenie_between_group_processor.pl, which is no longer supported. Its (deprecated) description is below:

The format for using the script is:

snpgenie_between_group_processor.pl --between_group_codon_file=<between_group_codon_results>.txt --sliding_window_size=10 --sliding_window_step=1 --codon_min_taxa_total=5 --codon_min_taxa_group=2 --num_bootstraps=10000 --procs_per_node=8

A typical workflow thus includes one run of snpgenie_between_group.pl to obtain codon results, followed a run of SNPGenie_sliding_windows.R (previously snpgenie_between_group_processor.pl) for each coding product to obtain sliding window results with bootstrap results.

<a name="sliding-windows-bootstrapping"></a>Sliding Windows and Bootstrapping

Bootstrapping and/or sliding window analyses can be computed for any codon results file produced by snpgenie.pl, snpgenie_within_group.pl, or snpgenie_between_group.pl. The codon results file must (1) contain one codon per line; (2) contain the columns N_diffs, N_sites, S_diffs, and S_sites; and (3) contain results for only one gene/product (no redundant codon numbers).

To run this analysis, use processing script SNPGenie_sliding_windows.R with the following unnamed arguments:

  1. CODON RESULTS FILE. The name or path of the codon_results.txt file produced using snpgenie.pl or snpgenie_within_group.pl. Must contain results for only one gene/product.
  2. NUMERATOR SITE TYPE. Usually 'N', for nonsynonymous (d<sub>N</sub>; π<sub>N</sub>).
  3. DENOMINATOR SITE TYPE. Usually 'S', for synonymous (d<sub>S</sub>; π<sub>S</sub>).
  4. SLIDING WINDOW SIZE (CODONS). Must be ≥2; ≥10 recommended.
  5. SLIDING WINDOW STEP SIZE (CODONS). Must be ≥1.
  6. NUMBER OF BOOTSTRAP REPLICATES PER WINDOW (OPTIONAL). Must be ≥2; DEFAULT is 1000.
  7. MINIMUM NUMBER OF DEFINED CODONS PER CODON POSITION (OPTIONAL). Must be ≥2; DEFAULT is 6. For snpgenie.pl results, a minimum read depth of 100 is assumed at all sites; it is incumbent on the user to check for sufficient coverage in their data, and to filter variants for false positives.
  8. MULTIPLE HITS CORRECTION (OPTIONAL). Must be "NONE" or "JC" (Jukes-Cantor); DEFAULT=NONE.
  9. NUMBER OF CPUS (OPTIONAL). Must be ≥1; DEFAULT=1.
  10. STRING TO PREPEND TO OUTPUT LINES (OPTIONAL). DEFAULT="" (empty string).

For example, to compute d<sub>N</sub>/d<sub>S</sub> or π<sub>N</sub>/π<sub>S</sub> in sliding windows of 40 codons with a step size of 1 codon; 1,000 bootstrap replicates per window; requiring a minimum of 100 defined codons (e.g., unambiguous reads) per position; using no multiple hits correction; and parallelizing over 6 CPUs, call the following:

SNPGenie_sliding_windows.R codon_results_oneProduct.txt N S 40 1 1000 100 NONE 6 > SNPGenie_sliding_windows_oneProduct.out

The output of this script provides the following TAB-delimited columns:

<a name="how-snpgenie-works"></a>How SNPGenie Works

SNPGenie calculates gene and nucleotide diversities for sites in a protein-coding sequence. Nucleotide diversity, π, is the average number of nucleotide variants per nucleotide site for all pairwise comparisons. To distinguish between nonsynonymous and synonymous differences and sites, it is necessary to consider the codon context of each nucleotide in a sequence. This is why the user must submit the starting and ending sites of the coding regions in the .gtf file, along with the reference FASTA sequence file, so that the numbers of nonsynonymous and synonymous sites for each codon may be accurately estimated by a derivation of the Nei-Gojobori (1986) method.

SNPGenie first splits the coding sequence into codons, each of which contains 3 sites. The software then determines which are nonsynonymous and synonymous by testing all possible polymorphisms present at each site of every codon in the sequence. Because different nucleotide variants at the same site may lead to both nonsynonymous and synonymous polymorphisms, fractional sites may occur (e.g., only 2 of 3 possible nucleotide substitutions at the third position of AGA cause an amino acid change; thus, that site is considered 2/3 nonsynonymous and 1/3 synonymous). Next, the SNP report is consulted for the presence of variants to produce a revised estimate of the number of sites based on the presence of variant alleles. Variants are incorporated through averaging, weighted by their frequency. Although it is relatively rare, high levels of sequence variation may alter the number of nonsynonymous and synonymous sites in a particular codon.

SNPGenie calculates the number of nucleotide differences for each codon in each ORF as follows. For every variant in the SNP Report, the number of variants is calculated as the product of the variant’s relative frequency and the coverage at that site. For each variant nucleotide (up to 3 non-reference nucleotides), the number of variants is stored, and their sum is subtracted from the coverage to yield the reference’s absolute frequency. Next, for each pairwise nucleotide comparison at the site, it is determined whether the comparison represents a nonsynonymous or synonymous change. If the former, the product of their absolute frequencies contributes to the number of nonsynonymous pairwise differences; if the latter, it contributes to the number of synonymous pairwise differences. When comparing codons with more than one nucleotide difference, all possible mutational pathways are considered, per the method of Nei & Gojobori (1986). The sum of pairwise differences is divided by the total number of pairwise comparisons at the codon (<sub>n</sub>C<sub>2</sub>, where n = coverage) to yield the mean number of differences per site of each type. This is calculated separately for nonsynonymous and synonymous comparisons. Calculating nucleotide diversity codon-by-codon enables sliding-window analyses that may help to pinpoint important nucleotide regions subject to varying forms of natural selection.

For further background, see Nelson & Hughes (2015) in the References.

<a name="additional-scripts"></a>Additional Scripts

Previous versions of the additional scripts, provided to help in the preparation of SNPGenie input, have moved to the <a target="_blank" href="https://github.com/chasewnelson/EBT">Evolutionary Bioinformatics Toolkit (EBT)</a> repository. Here, you can find the following scripts for use in converting files to the reverse complement strand, to be run as follows:

vcf2revcom.pl <file_name>.vcf <seq_length>

gtf2revcom.pl <file_name>.gtf <seq_length>

fasta2revcom.pl <seq>.fa

Tajima.D(1000, 15, 0.05)

<a name="troubleshooting"></a>Troubleshooting

<a name="citation"></a>Citation

When using this software, please refer to and cite:

Nelson CW, Moncla LH, Hughes AL (2015) <a target="_blank" href="http://bioinformatics.oxfordjournals.org/content/31/22/3709.long">SNPGenie: estimating evolutionary parameters to detect natural selection using pooled next-generation sequencing data</a>. Bioinformatics 31(22):3709-11. doi: 10.1093/bioinformatics/btv449.

and this page:

https://github.com/chasewnelson/snpgenie

<a name="studies-using-snpgenie"></a>Studies Using SNPGenie

<a name="acknowledgments"></a>Acknowledgments

SNPGenie was written with support from a University of South Carolina Presidential Fellowship (2011-2015), a National Science Foundation (NSF) Graduate Research Fellowship (GRF; 2013-2016), and an NSF East Asian and Pacific Summer Institutes (EAPSI; 2013) Fellowship. SNPGenie has been further maintained by a Gerstner Scholars Fellowship from the Gerstner Family Foundation at the American Museum of Natural History (2016-2019). The logo image was designed by Elizabeth Ogle (2015).

<a name="contact"></a>Contact

If you have questions about using SNPGenie, please click on the "<a target="_blank" href="https://github.com/chasewnelson/snpgenie/issues">Issues</a>" tab at the top of this page and begin a new thread, so that others might benefit from the discussion.

Other correspondence should be addressed to Chase W. Nelson:

<a name="references"></a>References