Home

Awesome

INSTRUCTIONS

ABC -- (Allele-specific Binding from ChIP-Seq)

Identifies potential allele-specific binding events at known heterozygous positions within the aligned reads of a ChIP-Seq experiment. ABC requires at a minimum two (2) files a sorted BAM/SAM file of a ChIP-Seq experiment and a file containing the position, strand and allele information of heterozygous Single Nucleotide Variants (SNVs), either SNPs and/or Mutations. ABC calls allele-specific binding by identifying a bias in the distribution of the SNV alleles while attempting to control for potential false positives. If you have genomic sequence data you can use the allele ratio in the DNA as the expected frequency to control for chromosome copy number.

CAUTION

Care should be taken in the selection of SNVs. Homozygous SNVs or SNVs within duplications will appear to have strong allele-specific effects. It is recommended that you filter SNPs prior to running ABC (i.e., those mapping to a motif). In addition, ABC does not filter duplicated reads, therefore the user may wish to remove duplicated reads prior to running ABC.

RUNNING ABC

Usage: perl ABC.pl --align-file <input.sam> --snv-file <snv filename> --out <output filename>

--align-file (Required)		Specify the ChIP-Seq BAM/SAM file. Note: the BAM/SAM file must be sorted. If a BAM file is supplied ABC will create a temporary SAM file.

--bg (Optional)			Specify a .bedgraph file capturing the ChIP-Seq signal (shifted read pileups) of the BAM/SAM file.
				This can be useful to prioritize SNPs with low coverage, since they may fall within centre of the +ve and -ve strand peaks. 
				This phenomenon is caused by short reads and is not necessary for longer reads.

--snv-file (Required)		A tab-delimited text file containing the list of heterozygous SNPs. 
				Important: the CHR column should match the build used for the alignments (ie. chr# for hg19 or just the # for b37).

				The format of the SNV file is as follows (Do not include the header):

				SNV_ID	CHR	POSITION	STRAND REF_ALLELE	OBSERVED_ALLELES ALLELE_RATIO_DNA
       
				rs111	chr1	1111	+	A	A/C	0.5
				SNV1	chr10	10234	-	C	C/G	0.4

				It is the responsibility of the user to verify the quality of the SNVs (i.e., do they pass Hardy-Weinberg equilibrium, etc...).

--out (Optional)		Specify the output file prefix (default ABC).
				(i.e., ABC will create two output files ABC.dist and ABC.align)

--min-reads (Optional) 		The minimum number of reads covering the a SNVs (default: 25).
				(ie. ABC will report only those SNPs/Mutations with # reads overlapping them.)

--d (Optional)			Divide chromosomes into segments until reaching d lines for faster retrieval (default: 2000). 
				A large BAM/SAM file can take a long time to process. You may try to increase the value of d. 
				However, very large numbers will not necessarily increase speed. Consider splitting the alignment file by chromosome.

--mw-thres (Optional)		P-value threshold for the Mann-Whitney U test used to test a bias in the read position between the SNV alleles. (default: 0.05)
				To report all SNVs set this parameter to 0.

--f-thres (Optional)		P-value threshold for the Fisher's exact test used to test for a bias between the strand distribution of the SNV alleles. (default: 0.05)
       	     			To report all SNVs set this parameter to 0.

--min-mapq (Optional).
			 	Set the minimum allowed MAPQ score for each read (default: 0).
	
--verbose (Optional)		Print progress and SNV results summary to screen.	

--help				Prints command line options

VISUALIZING SNV RESULTS

Create a figure of the distribution of reads containing a SNV of interest

This step requires that the ABC has finished running. Once ABC is finished a figure can be generated by specifying the output file prefix used in the initial run and the SNV ID as follows:

Usage: perl ABC.pl --out <output filename prefix> --visualize-snv <SNV ID>

OUTPUT

A table of the results can be found in the output file with the .dist extension.

SNV			SNV Identifier
CHR			Chromosome
BP			Position on chromosome
REF			Reference allele
OBS			Observed alleles
A1			Allele 1
N_A1			Number of reads containing A1
F_A1			Frequency of A1
N_A1_POS		Number of reads aligned to the +ve strand containing A1
N_A1_NEG		Number of reads aligned to the -ve strand containing A1
A2			Allele 2 
N_A2			Number of reads containing A2
F_A2			Frequency of A2
N_A2_POS		Number of reads aligned to the +ve strand containing A2
N_A2_NEG		Number of reads aligned to the -ve strand containing A2
N_TOTAL	Total number of reads overlapping a SNV, including: N_ERRORS, N_OMITTED, MISSING_N
N_ERRORS		Number of reads containing a nucleotide other than A1 or A2
N_OMITTED			Number of reads where the SNV falls within a clipped or deleted region
MISSING_N		Number of reads containing an ambigous nucleotide N
MAX		The maximum value of from the .bedgraph file (if specified) 	within the 2x read length window centered on the SNV
P_BINOM		The P-value used to call allele-specific binding
P_MANN_WHIT	The P-value used to call a bias in read position between alleles 
P_FISHER	The P-value used to call a bias in the strand distribution between alleles
P_CHISQ	 	(Same as P_FISHER greater number of reads required)
P_STRAND		A binomial p-value of the differences in strand abundance 
A1_Position		Position of A1 within reads
A1_Strand		Strand of reads containing A1
A2_Position		Position of A2 within reads 
A2_strand		Strand of reads containing A2

The alignments separated by the alleles of each SNV can be viewed in the output file with the .align extension.

TUTORIAL (CASE EXAMPLES)

Setting up the software environment

Perl and R should be installed on the system; instructions regarding their installation can be found from http://www.perl.org/get.html and http://cran.r-project.org, respectively.

Additionally the perl module Statistics::R should be installed using the following command in the terminal:
	cpan Statistics::R
Further information on how to install Perl modules can be found here: http://www.cpan.org/modules/INSTALL.html

Download ABC and example data

First, our ABC tool and documentation are publicly available from https://github.com/mlupien/ABC.

Second, to test ABC users should download the test SAM file (ERR022033.sorted.sam) from http://www.pmgenomics.ca/lupienlab/tools.html (766MB compressed using gzip). The file is a 
sorted SAM file of the aligned ChIP-Seq reads for the FOXA1 in MCF7 cells.

Decompress ERR022033.sorted.sam.gz using gunzip or other related tools. 

The FOXA1 test data can be generated following these steps. 

1.	The ChIP-Seq data can be downloads here: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM631471.

2.	The ERR022033.sra can be converted to a .fastq file with the sratoolkit (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software) using fastq-dump. 

	fastq-dump ERR022033.sra

3.	The resulting ERR022033.fastq file can be aligned to the reference genome using Bowtie2 (Langmead and Salzberg, 2012) and converted to a .bam file with samtools (Li, et al., 2009).

	bowtie2 –x /path to index/hg19 –U ERR022033.fastq | samtools view –bS - > ERR022033.bam

4.	The .bam file can be easily sorted with samtools and converted back to a .sam file.

	samtools sort ERR022033.bam ERR022033.sorted
	samtools view -h -o ERR022033.sorted.sam ERR022033.sorted.bam

Third, users should create a tab-delimited text file containing the SNV information. In this case study, we have created ABC_SNPs.txt, available at
http://www.pmgenomics.ca/lupienlab/tools.html, which contains the following line once decompressed using gunzip or other related tools.

	rs4784227	chr16	52599188	+	C	C/T	0.5

Finally ABC can be run using the following command in the terminal.

	perl ABC.pl --snv-file ABC_SNPs.txt -–align-file ERR022033.sorted.sam --out ABC_SNPs 

The results can be visualized (limited to one SNP at a time) with the following command.

	perl ABC.pl --out ABC_SNPs --visualize-snv rs4784227

ABC will output a .pdf file (rs4784227.ABC.SNPs.align.pdf) of the read distributions for the specified SNV