Home

Awesome

SUPPA2

Fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions. Publications:

Here we provide an extensive explanation about how SUPPA works and all the running options. Read our short SUPPA tutorial on an example dataset.


Table of Contents



Overview


SUPPA is a flexible and powerful tool to study splicing at the transcript isoform or at the local alternative splicing event level, across multiple conditions, and at high speed and accuracy. SUPPA has various modular operations that can be run separately to:

Slide1.jpg

Fig 1. SUPPA generates the alternative splicing events from an input annotation file (GTF format). The method reads transcript and gene information solely from the "exon" lines in the GTF. It then generates the events and outputs an ioe file, which contains the for each event the transcripts that describe either form of the event. Specifically, it provides the transcripts that contribute to the numerator (one form of the event) and the denominator (both forms of the event) of the PSI calculation. For the generation of PSI values, SUPPA reads the ioe file generated in the previous step and a transcript expression file with the transcript abundances to calculate the PSI value for each of the events.

figure2_for_readme_bitbucket.jpg

Fig 2. (a) SUPPA calculates the magnitude of splicing change (ΔPSI) (for events or transcripts) and its significance across multiple biological conditions, using two or more replicates per condition. Conditions are analyzed in a sequential order specified as input. (b) Statistical significance is calculated by comparing the observed ΔPSI between conditions with the distribution of the ΔPSI between replicates as a function of the expression of the transcripts defining the events (for events) or as a functon of the gene expression (for transcripts). When there is a large (>10) number of replicates per condition, you can also run SUPPA with a classical statistical test (Wilcoxon) per local event or per transcsript (not shown in the figure). (c) Using the output from the differential splicing analysis, SUPPA can cluster events or transcripts with similar splicing patterns across conditions using a density-based clustering algorithm.

We provide below detailed information on how to install and run SUPPA. Please join the SUPPA google-group for sharing your thoughts and questions (suppa-users@googlegroups.com).


Installation


SUPPA has been developed in Python 3.4.

If necessary, to install python3 we recommend to download from the official site https://www.python.org/downloads/ the corresponding version for your OS.

A installation using pip is available using the next command:

pip install SUPPA==2.3

By default SUPPA is installed into the Python package library directory. The following command can be executed to obtain the directory location:

pip show SUPPA 

SUPPA is ready to use. Once downloaded, it can be used directly from the command line by specifying the absolute path to the SUPPA executable (suppa.py).

Important: we have discontinued the conda download, as it is out of sync with the github. We recommend directly copying the code from github. The code is ready to use without any additional steps.

Alternatively, you can obtain the code for the SUPPA v2.4 latest release here: https://github.com/comprna/SUPPA/releases/tag/v2.4


Command and subcommand structure


SUPPA works with a command/subcommand structure:

suppa.py subcommand options

where the subcommand can be one of these five:

Note: Unless suppa was installed using a dependency manager (i.e. suppa would just be in the PATH), you can run suppa as:

python3.4 suppa.py subcommand options

Generation of transcript events and local alternative splicing events


SUPPA can work with local alternative splicing events or with transcripts "events" per gene. The local alternative splicing events are standard local splicing variations (see below), whereas a transcript event is an isoform-centric approach, where each isoform in a gene is described separately.

SUPPA generates the AS events or transcript events from an input annotation file (GTF format). The method reads transcript and gene information solely from the "exon" lines. It then generates the events and outputs an event file: .ioe format for local AS events, and .ioi format for transcripts.

The ioe file provides for each AS event in a gene, the transcripts that describe either form of the event. Specifically, it provides the transcripts that contribute to the numerator (one form of the event) and the denominator (both forms of the event) of the PSI calculation.

The ioi file provides for each transcript in a gene, the set of all transcripts from that gene from which the transcript relative abundance is calculated. Transcript events are important as they can describe splicing variations that are complex and cannot be encapsulated in a simple event (see e.g. Sebestyen et al. 2015 https://www.ncbi.nlm.nih.gov/pubmed/25578962).

Different local event types generated by SUPPA:

Events_legend1.jpg

Events_legend2.jpg

Fig 3. The figure describes the nomenclature for events in the forward (upper panel) and reverse (lower panel) strands. Each event is identified uniquely by a set of coordinates: The start (s) and end (e) coordinates for the different exonic regions involved in the event. The external coordinates of the event are only used for the RI, AF and AL events. The form of the alternative splicing event that includes the region in black is the one for which the relative inclusion level (PSI) is given. The gray area denotes the other alternative form of the event. For instance, for RI the inclusion level is given for the form that retains the intron. Important: for the non-symmetrical events note that the meaning of the coordinates varies depending on the strand

Slide2.jpg

Fig 4. In case the option for variable boundaries is used -b V (see below), an user input variability (detault: 10nt) is allowed in some of the boundaries (indicated with grey arrows). For some even types there will be less coordinates defining the event, as some boundaries are not tested any more. The variable boundaries (grey arrows) allow incorporating other transcripts contributing to the event, and therefore mimicking more closely the PSI calculate from RT-PCR primers. The events are still reported with the indicated coordinates. Those events that are redundant in this new setting will have the same transcripts contributing to the inclusion and skipping forms, however, they will be still reported in separate lines. As before, the form of the alternative splicing event that includes the region in black is the one for which the relative inclusion level or PSI is given, whereas the gray area denotes the other alternative form of the event.

Input files

An annotation file in GTF format is required (see e.g. http://mblab.wustl.edu/GTF22.html):

chr14 Ensembl exon  73741918  73744001  0.0 - . gene_id "ENSG00000000001"; transcript_id "ENST00000000001.1"; 
chr14 Ensembl exon  73749067  73749213  0.0 - . gene_id "ENSG00000000001"; transcript_id "ENST00000000001.1";  
chr14 Ensembl exon  73750789  73751082  0.0 - . gene_id "ENSG00000000001"; transcript_id "ENST00000000001.1"; 
chr14 Ensembl exon  73753818  73754022  0.0 - . gene_id "ENSG00000000001"; transcript_id "ENST00000000001.1"; 

The generateEvents operation uses the lines where the feature (column 3) is "exon". It then reads the different transcripts and genes. For that purpose "gene_id" and "transcript_id" tags are required in the attributes field (column 9). For local AS events, this command also generates a GTF file with the calculated events and with a track header ready to be uploaded into the UCSC browser for visualization (see below).

Command and options

To generate the events from the GTF file one has to run the following command:

python3.4 suppa.py generateEvents [options]

List of options available:

The command line to generate local AS events will be of the form:

python3.4 suppa.py generateEvents -i <input-file.gtf> -o <output-file> -f ioe -e <list-of-events>

The command to generate the transcript "events" would be of the form:

python3.4 suppa.py generateEvents -i <input-file.gtf> -o <output-file> -f ioi 

Output files

For transcripts, the generateEvents operation outputs one single file: An ioi file that shows the transcript "events" in each gene. This is a tab separated file with the following fields

ioi


  1. seqname: field 1 from the input GTF file of the generateEvents operation (generally the chromosome name)

  2. gene_id: ID of the gene where the event appears taken from the GTF file. (comma separated list in case of using the --pool-genes option)

  3. event_id: ID of the event, formatted as gene_id;transcript_id

  4. transcript_id: ID of the transcript that defines the event, for which the relative inclusion (PSI) is calculated.

  5. Total transcripts: IDs of the all transcripts in the gene (including the transcript in 4.)

Note that we call a transcript "event", a transcript in a gene.

An example of an ioi file is the following one:

seqname	gene_id	event_id	inclusion_transcripts	total_transcripts
chr14   ENSG00000133961.15      ENSG00000133961.15;ENST00000556772.1    ENST00000556772.1       ENST00000554546.1,ENST00000356296.4,ENST00000557597.1,ENST00000555238.1,ENST00000556772.1,ENST00000355058.3,ENST00000359560.3,ENST00000555394.1,ENST00000454166.4,ENST00000544991.3,ENST00000560335.1,ENST00000559312.1,ENST00000554521.2,ENST00000555738.2,ENST00000535282.1,ENST00000554014.2,ENST00000553997.1,ENST00000557486.1,ENST00000555859.1,ENST00000555307.1,ENST00000554394.1,ENST00000554315.1,ENST00000556989.1,ENST00000555987.1,ENST00000556112.1,ENST00000557774.1,ENST00000554818.1,ENST00000557031.1,ENST00000556700.1,ENST00000556600.1,ENST00000553415.1,ENST00000557577.1,ENST00000557581.1
chr14   ENSG00000133961.15      ENSG00000133961.15;ENST00000554818.1    ENST00000554818.1       ENST00000554546.1,ENST00000356296.4,ENST00000557597.1,ENST00000555238.1,ENST00000556772.1,ENST00000355058.3,ENST00000359560.3,ENST00000555394.1,ENST00000454166.4,ENST00000544991.3,ENST00000560335.1,ENST00000559312.1,ENST00000554521.2,ENST00000555738.2,ENST00000535282.1,ENST00000554014.2,ENST00000553997.1,ENST00000557486.1,ENST00000555859.1,ENST00000555307.1,ENST00000554394.1,ENST00000554315.1,ENST00000556989.1,ENST00000555987.1,ENST00000556112.1,ENST00000557774.1,ENST00000554818.1,ENST00000557031.1,ENST00000556700.1,ENST00000556600.1,ENST00000553415.1,ENST00000557577.1,ENST00000557581.1
chr14   ENSG00000133961.15      ENSG00000133961.15;ENST00000556600.1    ENST00000556600.1       ENST00000554546.1,ENST00000356296.4,ENST00000557597.1,ENST00000555238.1,ENST00000556772.1,ENST00000355058.3,ENST00000359560.3,ENST00000555394.1,ENST00000454166.4,ENST00000544991.3,ENST00000560335.1,ENST00000559312.1,ENST00000554521.2,ENST00000555738.2,ENST00000535282.1,ENST00000554014.2,ENST00000553997.1,ENST00000557486.1,ENST00000555859.1,ENST00000555307.1,ENST00000554394.1,ENST00000554315.1,ENST00000556989.1,ENST00000555987.1,ENST00000556112.1,ENST00000557774.1,ENST00000554818.1,ENST00000557031.1,ENST00000556700.1,ENST00000556600.1,ENST00000553415.1,ENST00000557577.1,ENST00000557581.1
chr14   ENSG00000133961.15      ENSG00000133961.15;ENST00000557774.1    ENST00000557774.1       ENST00000554546.1,ENST00000356296.4,ENST00000557597.1,ENST00000555238.1,ENST00000556772.1,ENST00000355058.3,ENST00000359560.3,ENST00000555394.1,ENST00000454166.4,ENST00000544991.3,ENST00000560335.1,ENST00000559312.1,ENST00000554521.2,ENST00000555738.2,ENST00000535282.1,ENST00000554014.2,ENST00000553997.1,ENST00000557486.1,ENST00000555859.1,ENST00000555307.1,ENST00000554394.1,ENST00000554315.1,ENST00000556989.1,ENST00000555987.1,ENST00000556112.1,ENST00000557774.1,ENST00000554818.1,ENST00000557031.1,ENST00000556700.1,ENST00000556600.1,ENST00000553415.1,ENST00000557577.1,ENST00000557581.1

For local AS events, the generateEvents operation outputs two files:

  1. An ioe file that shows the relationship between each event and the transcripts that define that particular event.

  2. A GTF file (with a track header) to load into the UCSC genome browser to visualize the different local AS events.

The name of the output is generated as follows:

<output-file>.<event-type>.ioe/gtf

<output-file>: -o option specified when launching the program.

<event_type>: a two letter code referring to the event type (SE, A3, A5, MX, RI, AF, AL).

The ioe file has the following fields:

ioe


  1. seqname: field 1 from the input GTF file of the generateEvents operation (generally the chromosome name)

  2. gene_id: ID of the gene where the event appears. (comma separated list in case of using the --pool-genes option)

  3. event_id: ID of the event.

  4. Inclusion transcripts: IDs of the transcripts that define the form of the event for which we calculate the PSI (e.g. exon inclusion) and contribute to the numerator of the PSI formula. For more details see Figures above.

  5. Total transcripts: IDs of the all transcripts that define either of the two forms of the event (e.g. inclusion and skipping) and contribute to the denominator of the PSI formula. For more details see Figures above.

Event ID: The event_id is formatted as follows:

<gene_id>;<event-type>:<seqname>:<coordinates-of-the-event>:<strand>

where:

Inclusion transcripts: The IDs must be the same as those provided in the expression (TPM) files. These transcripts define the inclusion form of the event. This form is chosen according to the convention described above in Figures 3 and 4:

Total transcripts: The IDs must be the same as those provided in the expression (TPM) files. Total transcripts are the transcripts that define both forms of the event and therefore contribute to the denominator of the PSI formula.

An example of an ioe file is the following one:

seqname	gene_id	event_id	inclusion_transcripts	total_transcripts
chr14	ENSG00000133961.15	ENSG00000133961.15;SE:chr14:73763993-73789838:73789937-73822334:-	ENST00000556112.1    ENST00000556112.1,ENST00000555859.1
chr14	ENSG00000133961.15	ENSG00000133961.15;SE:chr14:73876776-73925740:73926011-73929235:-	ENST00000553415.1    ENST00000553415.1,ENST00000556700.1
chr14	ENSG00000133961.15	ENSG00000133961.15;SE:chr14:73744001-73745989:73746132-73749067:-	ENST00000557597.1,ENST00000555238.1,ENST00000556772.1,ENST00000359560.3,ENST00000355058.3,ENST00000535282.1   ENST00000557597.1,ENST00000555238.1,ENST00000556772.1,ENST00000359560.3,ENST00000355058.3,ENST00000535282.1,ENST00000554546.1,ENST00000356296.4,ENST00000555394.1,ENST00000454166.4,ENST00000560335.1,ENST00000555738.2,ENST00000554014.2
chr14	ENSG00000133961.15	ENSG00000133961.15;SE:chr14:73749213-73750789:73751082-73753818:-	ENST00000554546.1,ENST00000356296.4,ENST00000557597.1,ENST00000555238.1,ENST00000556772.1,ENST00000359560.3,ENST00000355058.3,ENST00000555394.1,ENST00000535282.1,ENST00000553997.1,ENST00000557486.1    ENST00000554546.1,ENST00000356296.4,ENST00000557597.1,ENST00000555238.1,ENST00000556772.1,ENST00000359560.3,ENST00000355058.3,ENST00000555394.1,ENST00000535282.1,ENST00000553997.1,ENST00000557486.1,ENST00000454166.4,ENST00000560335.1,ENST00000555738.2

Important: --pool-genes

This option is important when creating ioe/ioi from annotations that are not loci-based, e.g.: RefSeq and UCSC genes. Unlike Ensembl or Gencode, which annotate gene loci, i.e. a set of transcripts will be uniquely be related to a gene at a locus, other annotations, like UCSC and Refseq dowloaded from UCSC, do not have this unequivocal link of transcripts to a genomic locus.

We thus re-cluster transcripts according to their overlap in genomic extent on the same strand, and according to them sharing at least one splice-site position in the genome. This method borrows from the (pre-Gencode) Ensembl function to create genes from transcripts (see Curwen et al. 2004 https://www.ncbi.nlm.nih.gov/pubmed/15123590). If transcripts are not appropriately clustered, transcript relative abundances or event inclusion levels may not make sense.

Using the --pool-genes option is also advisable to use with Ensembl and Gencode. The annotation contains genes with overlapping transcripts that share a great deal of sequence, hence their relative contribution to alternative splicing events should be taken into account.

GTF for local events

This output file is aimed for visualization and it is a regular gtf in half-open coordinate convention with a track header to be directly uploaded in UCSC. In this file, each of the two possible forms of an event is described as a separated transcript_id. Moreover, the gene_id is defined as the event_id. Note that SUPPA generally operates with closed coordinates.

The -l | --exon-length option of the eventGenerator operation defines the length used to display the constitutive exons in the GTF file and it is used only for visualization purposes. For instance, in a Skipping exon event the flanking exons will be shown with a length given by this parameter.


PSI calculation for Transcripts and Events


SUPPA reads the ioi or ioe file generated in the previous step and a transcript expression file with the transcript abundances (TPM units) to calculate the relative abundance (PSI) value per sample for each transcript or each local event.

Input files

To calculate the PSI for local events, the required input files are an ioe file and a transcript expression file.

However, to calculate the PSI per isoform, the required input files are an GTF file and a transcript expression file.

The transcript expression file is a tab separated file where each line provides the estimated abundance of each transcript (in TPM units). This file might contain multiple columns with the expression values in different samples. The expression file must have a header with the naming of the different expression fields, i.e., the sample name of each expression value.

An example of a transcript expression file for one single sample:

sample1
transcript1 <expression>
transcript1 <expression>
transcript1 <expression>

A transcript expression file with multiple samples:

sample1 sample2 sample3 sample4
transcript1 <expression>  <expression>  <expression>  <expression>
transcript2 <expression>  <expression>  <expression>  <expression>
transcript3 <expression>  <expression>  <expression>  <expression>

Note: these files have a header with only the sample names (1 less column)

PSI per transcript Isoform

At the moment the PSI per transcript isoform is calculated in the following way:

python3.4 suppa.py psiPerIsoform [options]

List of options available:

An example of the usage of the program is:

python3.4 suppa.py psiPerIsoform -g <gtf-file> -e <expression-file> -o <output-file>

PSI per local event

To calculate the PSI value for each event from the ioe and the transcript expression file one has to run the following command:

python3.4 suppa.py psiPerEvent [options]

List of options available:

An example of the usage of the program is:

python3.4 suppa.py psiPerEvent --ioe-file <ioe-file> --expression-file <expression-file> -o <output-file> 

Output files

These operations generate a psi file per each expression file. This file follows a similar format to the input expression file used in the psiPerEvent operation, where in the first column you find the event ID (transcript event or local AS event ID), and the following columns you find the PSI values in each sample. The PSI values are normalized between 0 and 1 ([0,1]). There are two exceptions for the PSI value:

Example of a PSI file for transcript "events":

sample1 sample2 sample3
ENSG00000000003.10;ENST00000373020.4	0.9891352763622773      0.8702144635757746      0.9758296160419822
ENSG00000000003.10;ENST00000494424.1	0.012124454316514821    0.01863029006342797     0.014658155865899102
ENSG00000000003.10;ENST00000496771.1	0.9067534025774345      0.9116862902516033      0.911059790520273

Example of a PSI file for local AS events:

sample1 sample2 sample3
ENSG00000000003.14;SE:chrX:100630866-100632485:100632568-100633405:-    0.9420871643913276      0.9304369364607596	0.8999960720735679
ENSG00000000419.12;SE:chr20:50940933-50941105:50941209-50942031:-       0.02302214526941425     0.08920651168486651	0.01093999389147593

Combining multiple expression files


Transcript expression files used with SUPPA typically come from calculations with multiple samples. To facilitate the generation of a single file with all the transcript expression values for all samples, SUPPA distribution includes a program to combine multiple simple transcript expression files into one single file:

python3.4 suppa.py joinFiles [options]

where the options are:

We show below an example of the usage of the program for reading multiple output files from Sailfish to join together the 3rd column, given that all files have in the first column the transcript ids (which are kept for the output):

python3.4 suppa.py joinFiles -f tpm -i sample1.tpm sample2.tpm sample3.tpm -o all_samples_tpms  

The output will look like an expression file with multiple files as described above.

Important: Right now this method assumes that the individual files have in the first column the IDs that are common across all files, and in the second column the PSI or TPM values to be joined. This will be changed in the future to a more flexible method so that generic tab separated files can be used directly.


Differential splicing analysis for transcripts and local events


SUPPA reads the PSI for the (transcripts or local) events and the transcript expression values from multiple samples, grouped by condition, and the ioe/ioi file, to calucate the events that are differentially spliced between a pair of conditions. SUPPA can test multiple conditions with a variabe number of samples per condition.

Input files

The ioe/ioi file, the PSI files and the transcript expression files are required.

The PSI and transcript expression files are both tab separated files, where each line provides the (transcript or local) event PSI values and the estimated abundance of each transcript, respectively. These files must contain multiple columns with the PSI values and expression values in the different samples, always keeping the same order of samples in both files.

The PSI and transcript expression files must have a header with the sample names. For instance, the PSI file could look like this:

sample1 sample2 sample3 sample4
event1  <psi_value> <psi_value> <psi_value> <psi_value>
event2  <psi_value> <psi_value> <psi_value> <psi_value>
event3  <psi_value> <psi_value> <psi_value> <psi_value>

and the transcript expression file should be then of the form:

sample1 sample2 sample3 sample4
transcript1 <expression>  <expression>  <expression>  <expression>
transcript2 <expression>  <expression>  <expression>  <expression>
transcript3 <expression>  <expression>  <expression>  <expression>

where the expression values are given in TPM units.

Note: these files have a header with only the sample names (1 less column)

Important: SUPPA will read one PSI file and one transcript expression file per condition. Each of these files will contain the multiple replicates (or individual samples) that are grouped into a given condition, and should be specified in the same order within the files. Additionally, you should have the same IDs in both conditions.

Command and options

To calculate the dpsi from the ioe, psi and the expression file one has to run the following command:

python3.4 suppa.py diffSplice [options]

List of options available:

Differential transcript usage

An example of the usage of the program with transcripts is, indicating that replicates are paired (-pa), to apply a multple testing correction (-gc):

python3.4 suppa.py diffSplice --method <empirical> --input <ioi-file> --psi <Cond1.psi> <Cond2.psi> --tpm <Cond1_expression-file> <Cond2_expression-file> --area <1000> --lower-bound <0.05> -pa -gc -o <output-file>

Differential splicing with local events

An example of the usage of the program with local events, applying a multple testing correction (-gc):

python3.4 suppa.py diffSplice --method <empirical> --input <ioe-file> --psi <Cond1.psi> <Cond2.psi> --tpm <Cond1_expression-file> <Cond2_expression-file> --area <1000> --lower-bound <0.05> -gc -o <output-file>

Output files

The differential splicing operation generates a dpsi file and a psivec file for both cases, differential splicing of events and differential transcript usage.

dpsi file

dpsi is a tab separated file, with the (transcript or local) event ID in the first column, followed by a variable even number of fields, two for each pair of conditions being compared:

  1. Cond1_Cond2_dPSI: Event PSI difference (ΔPSI) between Cond1 and Cond2 (ΔPSI = PSI_2 - PSI_1).

  2. Cond1_Cond2_pvalue: Significance of the difference of PSI between Cond1 and Cond2

An example of a dpsi file for transcript events is the following one:

Cond1_Cond2_dPSI	Cond1_Cond2_p-val
ENSG00000000419.8;ENST00000413082.1	0.1204556785	0.0119388277
ENSG00000000419.8;ENST00000371583.5	0.0940010396	0.0897896221
ENSG00000000419.8;ENST00000494752.1	0.1904161332	0.2039492118

An example of a dpsi file for local events is the following one:

Cond1_Cond2_dPSI	Cond1_Cond2_p-val
ENSG00000000419;A3:chr20:49557492-49558568:49557470-49558568:-	0.1307455855	0.0484515485
ENSG00000000419;SE:chr20:49557470-49557642:49557746-49558568:-	0.0469449126	0.0954045953
ENSG00000000419;SE:chr20:49557470-49557666:49557746-49558568:-	0.0164051352	0.1518481518

By default, SUPPA calculates ΔPSI values pairwise between each pair of adjacent conditions as provided, and calculating the PSI difference between a given condition and the previous one. For instance, for three conditions 1,2,3, the ΔPSI values will be calculated for 2 - 1 (Cond1_Cond2_dPSI) and 3 - 2 (Cond2_Cond3_dPSI). On the other hand, if the --combination flag is indicated, ΔPSI values are calculated pairwise between all the possible combinations of conditions. For the previous example with conditions provided as 1,2,3, the ΔPSI values will then be calculated for 2 - 1 (Cond1_Cond2_dPSI), 3 - 1 (Cond1_Cond3_dPSI) and 3 - 2 (Cond2_Cond3_dPSI).

psivec file

psivec is a tab separated file, with the (transcript or local) event ID in the first column, followed by a variable number of fields, each giving the PSI value per replicate:

  1. sample PSI: PSI value for a given replicate/condition.

An example of an psivec file is the following one:

sample1 sample2 sample1 sample2
ENSG00000000003;A5:chrX:99890743-99891188:99890743-99891605:- 0.14343855447180462 0.02929736320730957 0.12495621749266525 -1.0
ENSG00000000003;A5:chrX:99890743-99891605:99890743-99891790:- 1.0     1.0     1.0       -1.0
ENSG00000000419;A3:chr20:49557492-49557642:49557470-49557642:-  0.0     0.0     0.0       0.0
ENSG00000000419;A3:chr20:49557492-49558568:49557470-49558568:-  0.004508763698090278  0.0     0.028759955911389606 0.0638008250417565 
ENSG00000000419;A5:chr20:49557470-49557642:49557470-49557666:-  0.4495047957648805  0.490664195110865 0.6304018235083193   0.6781317189253738 

Note: these files have a header with only the sample names (1 less column)


Cluster analysis


Using the relative abundances (PSI values) of (transcript or local) events in all samples, and the information of which events change significantly in at least one comparison, SUPPA calculates clusters of events using a density-based clustering. Density-based clustering has various advantages over other methods: it does not require to choose the number of clusters, as this is driven by the data; and it cluster together events that even though they might not have similar PSI values, they behave similarly across conditions as long as there are sufficient events between them. This function works in the same way for transcripts or for local alternative splicing events.

Input files

A dpsi file and psivec file are required (see definitions above). The dpsi file contains the information about which events are signficantly differentially spliced in each pairwise comparison. For instance, a dpsi file with data from the comparison of three conditions will look like:

Cond1_Cond2_dPSI  Cond1_Cond2_p-val Cond2_Cond3_dPSI  Cond2_Cond3_p-val
event1    <dpsi_value>    <dpsi_value>    <dpsi_value>    <dpsi_value>
event2    <dpsi_value>    <dpsi_value>    <dpsi_value>    <dpsi_value>
event3    <dpsi_value>    <dpsi_value>    <dpsi_value>    <dpsi_value>

The psivec file contains the PSI values for all samples, either per replicate or the average PSI value per condition, averaging over the replicates. For instance:

sample1 sample2 sample3 sample1 sample2 sample3 
event1    <psi_value> <psi_value> <psi_value> <psi_value> <psi_value> <psi_value>
event2    <psi_value> <psi_value> <psi_value> <psi_value> <psi_value> <psi_value>
event3    <psi_value> <psi_value> <psi_value> <psi_value> <psi_value> <psi_value> 

Command and options

SUPPA will use the psivec file to cluster events according to the PSI values across samples using those events that show significant change in at least one pairwise comparison using the dpsi file. Two methods are available: DBSCAN and OPTICS. Both methods require as input the minimum number of events in a cluster. OPTICS also requires as the maximum reachability distance (s), which represents the maximum distance in PSI space of an event to a cluster To perform the clustering from the dpsi and the psivec one has to run the following command:

python3.4 suppa.py clusterEvents [options]

List of options available:

An example of the usage of the program is:

python3.4 suppa.py clusterEvents --dpsi <dpsi-file> --psivec <psivec-file> --sig-threshold <0.05> --eps <0.05> --min-pts <20> --groups <1-3,4-6> -o <output-file> 

Note: In the example above, the replicates 1,2,3 and 4,5,6 are grouped into 2 conditions.

Output files

SUPPA clusterEvents operation generates a clustvec file per each pair of dpsi and psivec files.

clustvec

clustvec is a tab separated file, with the event ID in the first column, the cluster ID as second column, and then followed by a variable even number of fields, one for each condition, giving the mean PSI value per condition:

  1. event_id: Event ID.

  2. cluster_id: Cluster ID. An integer starting from 0 that indicates to which cluster the events belongs to. Singletons (non-clustered events) are indicated with -1.

  3. cond_PSI_avg: (variable number of columns): Average PSI value for each condition.

An example of a clustvec file is the following one:

Note: The clustvec file doesn't contain a header. The header is for illustraion purpouse only.

event_id  cluster_id  cond1_PSI_avg cond2_PSI_avg
ENSG00000004897;SE:chr17:45247408-45249283:45249430-45258928:-  1 0.967582617827  0.885135296685
ENSG00000004961;A5:chrX:11129580-11130139:11129543-11130139:+ 1 0.950872555163  0.720740016826
ENSG00000005189;A3:chr16:20824624-20826249:20824624-20826307:+  0 0.632147193342  0.332088364504
ENSG00000005189;A5:chr16:20818139-20818274:20818027-20818274:+  0 0.344510537149  0.161862745525
ENSG00000005189;SE:chr16:20851790-20855256:20855348-20855951:+  -1  0.853283401943  0.548674896371
ENSG00000005194;SE:chr16:57463192-57464170:57464251-57466399:-  -1  0.971516032983  0.829712829609

and similarly for transcript events: Note: The clustvec file doesn't contain a header. The header is for illustraion purpouse only.

event_id	cluster_id	cond1_PSI_avg	cond2_PSI_avg
ENSG00000133961.15;ENST00000556772.1 0    0.728937981178  0.223877291002  
ENSG00000133961.15;ENST00000554818.1 1    0.041992298773  0.508135647228
ENSG00000133961.15;ENST00000556600.1 -1   0.106752966827  0.127736627782

The cluster_id is


License


SUPPA is released under the MIT license.