Home

Awesome

WARNING: If you see the following error, then downgrade your spp to 1.14 or 1.15.x.

Error in lwcc(x, y, s, e, return.peaks = return.peaks, bg.x = bg.x, bg.y = bg.y,  :
  INTEGER() can only be applied to a 'integer', not a 'character'

Summary

This package computes informative enrichment and quality measures for ChIP-seq/DNase-seq/FAIRE-seq/MNase-seq data. It can also be used to obtain robust estimates of the predominant fragment length or characteristic tag shift values in these assays.

Introduction

This set of programs operate on mapped Illumina single-end read datasets in tagAlign or BAM format. They can be used to

  1. Compute the predominant insert-size (fragment length) based on strand cross-correlation peak
  2. Compute Data quality measures based on relative phantom peak
  3. Call Peaks and regions for punctate binding datasets

Citations

If you are using the code or results in any formal publication please cite

[1] Landt SG1, Marinov GK, Kundaje A et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012 Sep;22(9):1813-31. doi: 10.1101/gr.136184.111.

[2] Kharchenko PK, Tolstorukov MY, Park PJ, Design and analysis of ChIP-seq experiments for DNA-binding proteins Nat Biotechnol. 2008 Dec;26(12):1351-9

Dependencies

NOTE: The current package does not run on a MacOS or Windows.

Files

  1. spp_1.14.tar.gz : modified SPP peak-caller package (The original SPP-peak caller package was written by Peter Kharchenko [2], https://github.com/hms-dbmi/spp)
  2. run_spp.R : The script to compute the frag length, data quality characteristics based on cross-correlation analysis and/or peak calling

Installation

  1. First make sure that you have installed R (version 3.1 or higher)

  2. Also, you must have the Boost C++ libraries installed. Most linux distributions have these preinstalled. If not, you can easily get these from your standard package manager for your linux distribution. e.g synaptic package manager (apt-get) for ubuntu or emerge for gentoo.

  3. Clone the repo.

    $ git clone https://github.com/kundajelab/phantompeakqualtools
    
  4. Install the following R packages

    • snow (if you want parallel processing)
    • snowfall
    • bitops
    • caTools
    • Rsamtools (in the Bioconductor package)
    $ cd phantompeakqualtools
    $ R
    (From within R)
    > install.packages("snow", repos="http://cran.us.r-project.org")
    > install.packages("snowfall", repos="http://cran.us.r-project.org")
    > install.packages("bitops", repos="http://cran.us.r-project.org")
    > install.packages("caTools", repos="http://cran.us.r-project.org")
    > source("http://bioconductor.org/biocLite.R")
    > biocLite("Rsamtools",suppressUpdates=TRUE)
    > install.packages("./spp_1.14.tar.gz")
    
  5. If your alignment files are BAM, you must have the samtools executable in your path so that the R script run_spp.R can call it using the system() command You can get samtools from (here)[http://samtools.sourceforge.net/] You can add the following line to your ~/.bashrc file

    export PATH="<path_to_samtools_executable>:${PATH}"
    
  6. Run run_spp.R

    Rscript run_spp.R <options>
    

General usage

Usage: Rscript run_spp.R <options>

Typical usage

  1. Determine strand cross-correlation peak / predominant fragment length OR print out quality measures. -out=<outFile> will create and/or append to a file named <outFile> several important characteristics of the dataset.

    Rscript run_spp.R -c=<tagAlign/BAMfile> -savp -out=<outFile>
    

    The file contains 11 tab delimited columns.

    col.abbreviationdescription
    1FilenametagAlign/BAM filename
    2numReadseffective sequencing depth i.e. total number of mapped reads in input file
    3estFragLencomma separated strand cross-correlation peak(s) in decreasing order of correlation.
    4corr_estFragLencomma separated strand cross-correlation value(s) in decreasing order (COL2 follows the same order)
    5phantomPeakRead length/phantom peak strand shift
    6corr_phantomPeakCorrelation value at phantom peak
    7argmin_corrstrand shift at which cross-correlation is lowest
    8min_corrminimum value of cross-correlation
    9NSCNormalized strand cross-correlation coefficient (NSC) = COL4 / COL8
    10RSCRelative strand cross-correlation coefficient (RSC) = (COL4 - COL8) / (COL6 - COL8)
    11QualityTagQuality tag based on thresholded RSC (codes= -2:veryLow, -1:Low, 0:Medium, 1:High, 2:veryHigh)

    The top 3 local maxima locations that are within 90% of the maximum cross-correlation value are output. In almost all cases, the top (first) value in the list represents the predominant fragment length. If you want to keep only the top value simply run

    sed -r 's/,[^\t]+//g' <outFile> > <newOutFile>
    

    You can run the program on multiple datasets in parallel and append all the quality information to the same <outFile> for a summary analysis.

    NSC values range from a minimum of 1 to larger positive numbers. 1.1 is the critical threshold. Datasets with NSC values much less than 1.1 (< 1.05) tend to have low signal to noise or few peaks (this could be biological eg.a factor that truly binds only a few sites in a particular tissue type OR it could be due to poor quality)

    RSC values range from 0 to larger positive values. 1 is the critical threshold. RSC values significantly lower than 1 (< 0.8) tend to have low signal to noise. The low scores can be due to failed and poor quality ChIP, low read sequence quality and hence lots of mismappings, shallow sequencing depth (significantly below saturation) or a combination of these. Like the NSC, datasets with few binding sites (< 200) which is biologically justifiable also show low RSC scores.

    Qtag is a thresholded version of RSC.

  2. Peak calling

    Rscript run_spp.R -c=<ChIP_tagalign/BAM_file> -i=<control_tagalign/BAM_file> -fdr=<fdr> -odir=<peak_call_output_dir> -savr -savp -savd -rf
    Rscript run_spp.R -c=<ChIP_tagalign/BAM_file> -i=<control_tagalign/BAM_file> -npeak=<npeaks> -odir=<peak_call_output_dir> -savr -savp -savd -rf
    
  3. For IDR analysis you want to call a large number of peaks (relaxed threshold) so that the IDR model has access to a sufficient noise component.

    Rscript run_spp.R -c=<ChIP_tagalign/BAM_file> -i=<control_tagalign/BAM_file> -npeak=300000 -odir=<peak_call_output_dir> -savr -savp -rf -out=<resultFile>
    

Notes

Input file formats

  1. BAM format: This is a binary alignment format specified in http://samtools.sourceforge.net/SAM-1.3.pdf You MUST have samtools installed to use run_spp.R with BAM files

  2. TagAlign files: This a text-based BED3+3 alignment format that is easier to manipulate. It contains 6 tab delimited columns.

    col.abbrv.typedescription
    1chromstringName of the chromosome
    2chromStartintThe starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
    3chromEndintThe ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
    4sequencestringSequence of this read
    5scoreintIndicates uniqueness or quality (preferably 1000/alignmentCount).
    6strandcharOrientation of this read (+ or -)

NOTE: You dont have to store the sequence of reads in the sequence field as the peak caller never really uses that field. You can just put the letter 'N' in that field. This saves space significantly.

For the IDR rescue strategy, one needs to use shuffled and subsampled version of the alignment files. This is much easier to implement on tagAlign text files using the unix shuf command. So it is recommended to preferably use the tagAlign format.

Converting BAM to TAGALIGN files

It is very quick to convert pre-filtered BAM files (after removing unmapped, low quality reads, multimapping reads and duplicates) to gzipped tagAlign files using

samtools view -F 0x0204 -o - <bamFile> | awk 'BEGIN{OFS="\t"}{if (and($2,16) > 0) {print $3,($4-1),($4-1+length($10)),"N","1000","-"} else {print $3,($4-1),($4-1+length($10)),"N","1000","+"} }' | gzip -c > <gzip_TagAlignFileName>

Output file formats

  1. NarrowPeak/RegionPeak format: The output peak file is in BED6+4 format known as tagAlign. It consists of 10 tab-delimited columns

    col.abbrv.typedescription
    1chromstringName of the chromosome
    2chromStartintThe starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
    3chromEndintThe ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
    4namestringName given to a region (preferably unique). Use '.' if no name is assigned.
    5scoreintIndicates how dark the peak will be displayed in the browser (1-1000). If '0', the DCC will assign this based on signal value. Ideally average signalValue per base spread between 100-1000.
    6strandchar+/- to denote strand or orientation (whenever applicable). Use '.' if no orientation is assigned.
    7signalValuefloatMeasurement of overall (usually, average) enrichment for the region.
    8pValuefloatMeasurement of statistical signficance (-log10). Use -1 if no pValue is assigned.
    9qValuefloatMeasurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.
    10peakintPoint-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called.

Releted pipelines and tools

References

ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Landt SG*, Marinov GK*, Kundaje A*, Kheradpour P*, Pauli F, Batzoglou S, Bernstein BE, Bickel P, Brown JB, Cayting P, Chen Y, DeSalvo G, Epstein C, Fisher-Aylor KI, Euskirchen G, Gerstein M, Gertz J, Hartemink AJ, Hoffman MM, Iyer VR, Jung YL, Karmakar S, Kellis M, Kharchenko PV, Li Q, Liu T, Liu XS, Ma L, Milosavljevic A, Myers RM, Park PJ, Pazin MJ, Perry MD, Raha D, Reddy TE, Rozowsky J, Shoresh N, Sidow A, Slattery M, Stamatoyannopoulos JA, Tolstorukov MY, White KP, Xi S, Farnham PJ, Lieb JD, Wold BJ, Snyder M. Genome Res. 2012 Sep;22(9):1813-31. doi: 10.1101/gr.136184.111.

Contributors