Home

Awesome

<!-- title: ExOrthist pipeline output: html_document: default pdf_document: default --- -->

License: MIT Nextflow version Singularity version Docker version

<!-- <img align="middle" href="https://github.com/biocorecrg/exon_intron_orthology_pipeline" src="https://github.com/biocorecrg/exon_intron_orthology_pipeline/blob/master/docs/logo_s.png?raw=true" /> --> <!-- <img style="float:left; padding-right:20px" src="/Users/federica/Documents/CRG_PhD/weekly_meetings/20_10_15/exorthist_new.png" width=300 height=350 /> --> <!-- <style> figcaption { text-align: left; font-size: 0.7em; } </style> -->

ExOrthist pipeline

<figure> <img align="middle" src="https://github.com/biocorecrg/exon_intron_orthology_pipeline/blob/master/docs/Exorthist_logo.png" width=270 height=400 /> <figcaption>Original logo by Queralt Tolosa</figcaption> <figure>

Meeting the ExOrthist

ExOrthist is a Nextflow based pipeline to infer exon orthology groups at all evolutionary distances. As a crucial innovation, ExOrthist sequentially evaluates three features when inferring exon homologous relationships: conservation of up/downstream intron positions and phases, conservation of the exon sequence, and conservation of the up/downstream exon sequences. ExOrthist has three modules:

<figure> <img align="middle" src="https://github.com/biocorecrg/exon_intron_orthology_pipeline/blob/master/docs/ExOrthist_overview.png" width=500 height=500 /> <figure>

Table of contents

Requirements

ExOrthist requires the following software:

Additionally, liftOver, bedtools as well as specific pairwise liftOver files are required to run get_liftovers.pl [see below].

Installation

Install Nextflow (tested with 24.10.0):

curl -s https://get.nextflow.io | bash

Clone the ExOrthist repository:

git clone https://github.com/biocorecrg/ExOrthist.git

Install Docker:

ExOrthist will take care of downloading the required docker image from DockerHub and eventually convert it into a singularity one.

ExOrthist main module

ExOrthist main module infers exon homologous pairs and exon orthogroups within the gene orthogroups provided as input (e.g. generated by Orthofinder, Broccoli or similar tools) for all the species for which annotation (GTF) and genomic (fasta) files are provided. In order to fine-tune the orthology calls, evolutionary distance ranges between each pair of species (short, medium long) also need to be specified [see Algorithm for a detailed explanation of the main.nf logic]. In order to help the users reduce the computational burden of their own runs, we are sharing pre-computed pairwise protein alignments between some common model organisms, which can be integrated by new ExOrthist main runs. [see below]

Running ExOrthist main.nf

The pipeline can be launched in this way:

NXF_VER=24.10.0 nextflow run main.nf [-profile docker | -profile singularity | ... ] -bg > log.txt

There are also several profiles available for different institutions or systems (source: nf-core configs). Example:

NXF_VER=24.10.0 nextflow run main.nf -profile crg -bg > log.txt

If the pipeline crashes at any step, it can be re-launched using the -resume option (- not --):

NXF_VER=24.10.0 nextflow run main.nf -bg -resume > log.txt

Test run

The ExOrthist repository includes a folder named test containing all the input files necessary for a test run. The relative configuration files (nextflow.config and params.yaml) are also provided and can be used as templates for customized runs.

The test run will extract the exon orthology for 37 gene orthogroups shared between hg38 (human), mm10 (mouse) and bosTau9 (cow) selected from a Broccoli run. In order to familiarize yourself with ExOrthist main output, simply run the following code from the ExOrthist-master directory:

NXF_VER=24.10.0 nextflow run main.nf [-profile docker | -profile singularity | ... ] > test_log.txt

ExOrthist will save all outputs in the output_test directory. The following sections explain all inputs and outputs in detail.

Inputs

params.yaml file

The root directory provides a params.yaml file with the default test values. This file can be used as a template for generating a file with your own custom parameters.

To use that custom values, -params-file can be used as below:

NXF_VER=24.10.0 nextflow run main.nf [-profile docker | -profile singularity | ... ] -bg -params-file my_params.yaml > my_test.txt

Alternatively, the arguments in the params.yaml can be specified as independent command-line flags (use: --). The command-line provided values overwrite the default ones.

Required inputs

--genomes: a folder with genome files (fasta). The prefix (i.e. the species ID) must match the corresponding annotation file. The files can be compressed (.gz). Example:

hg38_gDNA.fasta
mm10_gDNA.fasta
bosTau9_gDNA.fasta

--annotations: a folder with annotation files (GTF). The prefix (i.e. the species ID) must match the corresponding genome file. The files can be compressed (.gz). Example:

hg38_annot.gtf
mm10_annot.gtf
bosTau9_annot.gtf

--cluster: a tsv file containing gene orthogroups with the following format: ClusterID SpeciesID GeneID. The SpeciesID must match the prefixes of --genomes and --annotations. All genes represented in the gene orthogroups should be protein-coding genes present in the provided GTF annotations. Otherwise, ExOrthist will create warnings at different steps. The file can be compressed (.gz). Example:

GF000001	hg38    ENSG00000151690
GF000001	mm10    ENSMUSG0000010.039
GF000001	bosTau9 ENSBTAG00000007719
GF000002	hg38    ENSG00000091127
GF000002	mm10    ENSMUSG00000057541
GF000002	bosTau9	ENSBTAG00000007743
GF000003	hg38    ENSG00000029534
GF000003	mm10    ENSMUSG00000031543
GF000003	bosTau9	ENSBTAG00000003275

--alignmentnum: number of alignments to submit within each chunk (default: 1000) [see Algorithm, section B].
--orthogroupnum: number of gene orthogroups to process together during the exon clustering step (default: 500) [see Algorithm, section D.
--evodists: a tsv file with evolutionary distance ranges (short, medium, long) for all pairs of species. Pairwise species combinations between all species of interest need to be specified. This information is used to adapt the stringency of the pairwise exon orthology call based on the evolutionary distance [see Algorithm, section C]. Example:

hg38	mm10    short
hg38	bosTau9 short
mm10	bosTau9 short

--long_dist, --medium_dist, --short_dist: list of conservation cut-offs for the pairwise orthology call corresponding to each evolutionary range. The list is in the format "int_num,ex_sim,ex_len,prot_sim".
int_num: whether the two intron positions around the exon (2), only one (1) or none (0) must be conserved. When set to 0, the exon orthology call will not take conservation of intron positions into account.
ex_sim: minimum percentage of sequence similarity between exon pairs. The same cut-off is applied to the conservation of the query exon and its up/downstream exons.
ex_len: minimum ratio between the lengths of a query-target exon pair (shortest/longest). Values from 0 to 1. Values closer to 1 require more similar exon lengths.
prot_sim: minimum global protein similarity required for a query and target protein isoforms to be compared.

Example for each evolutionary distance range:

--long_dist "2,0.10,0.40,0.15"
--medium_dist "2,0.30,0.60,0.20"
--short_dist "2,0.50,0.60,0.25"

See the ExOrthist paper for how the optimal parameters for each evolutionary distance range were derived.

--output: output folder destination.

Facultative inputs

--extraexons: a folder with tsv files of non-annotated exons. The prefix (i.e. the species ID) must match the annotation, genome and cluster files. Example:

hg38_extraexons.tab
mm10_extraexons.tab
bosTau_extraexons.tab

Required format (the header is expected):

ExonID	GeneID	Exon_coords_A	C1_Ref	C2_Ref
HsaEX0000026	ENSG00000255857	chr12:120210930-120211106	chr12:120209839-120209934	chr12:120212375-120212919
HsaEX0000056	ENSG00000258038	chr14:28837253-28837330	chr14:28832006-28832060	chr14:28841804-28842082
HsaEX0000064	ENSG00000258498	chr14:101558633-101558834	chr14:101559800-101560025	chr14:101557754-101557819
HsaEX0000091	ENSG00000137871	chr15:56917065-56917212	chr15:56917540-56918571	chr15:56890014-56890167

NB: The ExonIDs reported in this example are the vastID of exons quantified by vast-tools. However, the ExonID column accepts any kind of identifier. In case an "official" identifier is missing, simply assign a numerical ID (e.g. 1,2,3, etc.).
NB1: ExOrthist differentiates between C1 (upstream) and C2 (downstream) based on the strand, exactly as defined in transcription. In contrast, programs like rMATS ignore strand information and always report neighboring exons following the (+) strand definition of upstream (C1) and downstream (C2).
For the extraexon table, make sure that the exon coordinates satisfy these statements:

(+) strand:           C1_Ref < Exon_coords_A < C2_Ref
(-) strand:           C2_Ref < Exon_coords_A < C1_Ref

--bonafide_pairs: a tsv file with exon pairs considered bona fide orthologs to be incorporated in the exon orthology inference [see Algorithm, section C]. The input for this argument can be generated by the user by manual curation, but also by running the script get_liftovers.pl to obtain liftOver-based relationships among closely related species [see below]. Example (the header is expected):

GeneID_Sp1 Exon_Coord_Sp1 GeneID_Sp2 Exon_Coord_Sp2 Sp1 Sp2
ENSG00000171055	chr2:36552056-36552268:-	ENSMUSG00000056121  chr17:78377717-78377890:- hg38  mm10
ENSG00000171055	chr2:36578597-36578832:-	ENSMUSG00000056121	chr17:78400630-78400865:- hg38  mm10
ENSG00000171055	chr2:36558438-36558513:-	ENSMUSG00000056121	chr17:78384744-78384819:- hg38  mm10
<!-- **--minmembership:** minimum membership score value used to filter the final exon cluster file [see Algorithm, section D]. Default=X -->

--orthopairs: a file with gene orthologous pairs (for all species combinations) between the genes represented in --cluster. This file can be easily obtained from Orthofinder output (see Orthofinder, Results Files: Orthologues Directory) or it is directly provided as Broccoli output (see Broccoli, step 4). If provided, the gene orthopairs are used to recluster the exon orthogroups for each pair of species and later derive exon orthologous relationships only between direct orthologs. [see Algorithm, section D]. Example:

ENSMUSG00000067996  ENSBTAG00000031354
ENSMUSG00000067998  ENSBTAG00000031354
ENSMUSG00000072674  ENSBTAG00000020939
ENSG00000189129 ENSBTAG00000020939
ENSG00000189129 ENSMUSG00000094800
ENSG00000167863 ENSMUSG00000034566

--prevaln: path to the output folder of a previous ExOrthist main.nf run. If this argument is provided, the previously generated protein alignments are integrated into the current run. Only the alignments between newly introduced proteinID pairs (for previously run species pairs) or extra species pairs will be generated de novo [see Algorithm, section B].

Pre-computed protein alignments for various pairs of species have been generated using ExOrthist v1.0.0 and can be downloaded from the links below. Species annotations: hg38: Ensembl v88, mm10: Ensembl v88, danRer11: Ensembl v99, dm6: Ensembl Metazoa: v26.

--email: email address for notifications (yourmail@yourdomain)

Outputs

Default outputs

output/:

ExOrthist creates a folder in the output directory for each species, containing the following files:
output/${species}/:

ExOrthist creates a folder in the output directory for each species pair, containing the following files:
[NB: in all cases, both species1 and species2 are considered as query and target in turn.]
output/${species_pair}/:

Facultative outputs

ExOrthist is also able to consider non-annotated exons in the orthology inference, which can be added via the --extraexons argument [see Facultative inputs]. If non-annotated exons are added, ExOrthist also generates the following files:
output/${species}/:

ExOrthist provides the possibility to subset the exon orthology files based on reclustering information. This step is performed when a file with orthologous pairs from all species pairwise combinations is provided with the --orthopairs argument [see Facultative inputs]. In that case, an extra "reclustering"" folder is saved to the output folder directory with the following files:
output/reclustering/:

Algorithm

ExOrthist infers exon orthology groups within gene orthogroups (or clusters) provided as input (e.g. generated by Orthofinder, Broccoli or similar tools) for all the species for which annotation (GTF) and genomic (fasta) files are provided [see Inputs for details].

ExOrthist starts by creating files with annotation information for all the considered species [see Algorithm, section A]. Next, it works by species pairs (species1-species2) and within gene orthogroups, generating Intron Position Aware (IPA) protein alignments for all isoforms in species1 vs all isoforms in species2. Considering species1 as query and species 2 as target (and vice versa), ExOrthist extracts alignment features at the protein, exon and intron level [see Algorithm, section B]. For each query exon, the best exon match in each target isoform is selected based on sequence similarity.

All extracted features are translated into partial scores used to infer pairwise exon homologous relationships. In particular, ExOrthist considers five partial scores reflecting the conservation of different features of the exon-intron context: (1, 2) conservation of the immediately upstream and downstream intron positions and phases, (3) conservation of the query exon sequence and (4, 5) conservation of the immediately upstream and downstream exon sequences. Only the exon matches for which all features are above the specified conservation cutoffs will be selected as potential exon homologs. Importantly, ExOrthist sequentially evaluates the conservation of these features (1-2, 3, 4-5). Pairs of exons whose upstream and downstream intron positions and phases are not conserved will not be considered orthologs, independently of their sequence conservation. As an optional feature, ExOrthist allows to set different conservation cut-offs for short, medium and long evolutionary distances [see Inputs].

The sum of all partial scores gives a global score ranging from 0 to 1 and represents the overall goodness of a match. ExOrthist uses the global score to select the best exon match in a target gene for each query exon, specifically among the exon matches passing all the previous filters. The selected query exon-target gene best matches are considered pairwise exon homologs. While the ExOrthist logic requires a query exon to match a unique exon in the target gene, each target-gene exon can potentially be matched by more query exons in the same gene. This setting captures cases of in-tandem exon duplication while preserving the information about which duplicated exon is more similar to the ancestral one [see Algorithm, section C].

Finally, the selected exon homologous matches for all species pairs are joined and translated in a directed graph, from which exon orthogroups (clusters) are inferred. ExOrthist computes a Membership Score (MS) for each exon in its relative exon cluster based on graph properties. Best reciprocal matches (i.e. a query exon is the best match of its own target exon) are taken into account in the MS computation. [see Algorithm, section D].

At the end of a run, ExOrthist returns two files with complete and filtered exon clusters information, together with relevant intermediate files generated in each step.

A. Input generation

ExOrthist first checks that all the necessary input files (both genome and GTF for each species) and information (e.g. evo distances for all species pairs) are provided. It then generates files with annotation information for each considered species, which will be the input of all species pairwise comparisons. [see Outputs]

B. Pairwise alignments and feature extraction

For each species pair and gene orthogroup, ExOrthist executes the following steps.
It first generates Intron Position Aware (IPA) pairwise protein alignments between all isoforms of species1 and species2. These alignments are divided into chunks of --alignmentnum [see Inputs], which will be processed in parallel. Considering species1 as query and species2 as target (and vice versa), ExOrthist parses the IPA alignments to derive:

ExOrthist performs the alignments and the realignments in a parallelized way, to later join all the best isoform-wise matches for a given species pair. If the output of a previous ExOrthist run is provided with the --prevaln flag, the alignment and realignment steps are skipped for all the query and target isoforms which have already been aligned [see Facultative inputs]. Pre-computed alignments between some common model organisms can also be downloaded from here.

C. Scoring and best matches selection

ExOrthist keeps executing the processes separately for each species pair. In here, it first filters the exon matches (at the target isoform level) based on the conservation of different features of the exon-intron context. It then selects the best match (at the target gene level) for each query exon among the filtered isoform-level matches. The filter is based on 5 partial scores derived from the protein, exon and intron alignment features relative to each exon pair [see Algorithm, section B]:

  1. Conservation of upstream intron phase: ranging [-0.25, 0.25]. Positive values indicate phase conservation.
  2. Conservation of downstream intron phase: ranging [-0.25, 0.25]. Positive values indicate phase conservation.
  3. Conservation of the exon sequence: ranging [0, 0.2]. 0.2 indicates 100% sequence similarity.
  4. Conservation of the upstream exon sequence: ranging from [0, 0.15]. 0.15 indicates 100% sequence similarity.
  5. Conservation of the downstream exon sequence: ranging from [0, 0.15] 0.15 indicates 100% sequence similarity.

These partial scores for internal exons are then summed up to generate a global score ranging [0,1]. In the case of first or last exons, for which scores (2) and (5) or (1) and (4) do not exist, respectively, the global score is divided by 0.6 to make it range [0,1]. The best pairwise IPA alignment for a pair of exons can be automatically retrieved using the script retrieve_IPA_aln.pl.
For each species pair, different filtering criteria will be applied based on the evolutionary distance specified in the --evodists file argument and the relative parameters (int_num, exsim, exlen) defined in the --long_dist, --medium_dist and --short_dist arguments [see Inputs].
The up/downstream intron conservation, the exon sequence conservation and the up/downstream exons conservation are sequentially evaluated in the filtering.

In the end, among these pre-filtered matches, the match with the highest global score is selected (and considered as an homologous match) for each query exon and target gene. While the ExOrthist logic requires a query exon to match a unique target exon, each target exon can potentially be matched by more query exons. Overlapping query exons (i.e. alternative forms of the same exon) might be present in the pool of filtered matches. In order to univocally derived homologous relationships for each exon, ExOrthist selects the form with the highest number of matches (among all target genes) as representative of its overlap group. In case bone fide exon matches are provided with the --bonafide_pairs option, these matches are given priority over all the other matches inferred by ExOrthist (see next section).

Addition of manually curated exon orthology pairs

ExOrthist also allows the addition of bona fide homology pairs, which will be directly integrated in the exon orthogroups inference. Such exons can be specified with the --bonafide_pairs flag [see Inputs].

To help generate a list with high-confidence relationships across short evolutionary distances, ExOrthist includes a get_liftovers.pl script that extracts exon matches in a target species using the liftOver tool. This scripts needs annotation files (GTF) of the two considered species, the gene orthogroups file, and a UCSC over.chain file to derive genome-wide exon pairs; alternatively, a list of exons from the query species can be provided.

get_liftovers.pl parses exons at the CDS level by default (as ExOrthist), but it can work with mRNA exons if the option --type exon is specified. Furthermore, the script can require matches to have at least one cannonical dinucleotide using the --canonical_ss flag. Additional information is provided in the help message of get_liftovers.pl.

NB: Whenever bona fide exon pairs are provided by the users, they will be automatically chosen as representative of their overlapping group ([see Algorithm, section C]. Thus, they will be given priority over all exon matches inferred between annotated exons and (eventually) the additional not-annotated exons provided by the user [see facultative inputs].

D. Exon clustering

After deriving exon homologous relationships between each pair of species, ExOrthist works on the combined orthopair information to infer the exon orthogroups. For each gene orthogroup, ExOrthist builds a directed graph (through the R igraph package igraph) with exons as nodes and their pairwise homology relationships represented as edges. In case of a best reciprocal match between homologous exons, two directed edges will be drawn between the correspondent nodes. ExOrthist then applies the R igraph edge-betweenness algorithm to select the optimal graph topology, with communities highly intra-connected and lowly inter-connected. Although the directionality of the graph is not considered by the edge- betweenness algorithm, the reciprocality of the matches is represented by the number of edges. The exon communities identified in the optimal topology correspond to the multi-species exon orthogroups returned by the pipeline. For each exon, ExOrthist also computes a Membership Score (MS), which reflects its degree of similarity to all the exons belonging to the same orthogroup (OG). The MS is defined as follows:

MS = (IN_degree + OUT_degree + N_reciprocals) / (2*(TOT_exons_in_OG - SPECIES_exons_in_OG) + (TOT_genes_in_OG - SPECIES_genes_in_exOG))

with:

<!-- The default value of X can be modified through the **--minmembership** argument [[see Inputs](#inputs)]. <span style="color:red"> This still needs to be implemented. </span> -->

The number of gene orthogroups to be processed within a unique nextflow job can be specified by the --orthogroupnum argument [see Inputs].
ExOrthist provides the possibility to subset the exon orthology files based on reclustering information. This step is performed when a file with gene orthologous pairs from all species pairwise combinations is provided with the --orthopairs argument [see Inputs]. In that case, an extra reclustering folder is saved to the output directory [see Facultative outputs]

Exon clusters statistics

ExOrthist includes a script (get_cluster_stats.pl) to calculate some basic statistics on the generated exon clusters (orthogroups; OGs). This script only requires the output folder of main.nf as input. It consists of three related tables: a) the number of CDS exons from each species and the percent of those present in the final OGs (i.e. with at least one homolog); b) different types of OGs depending on the homology relationships (1:1, etc); and c) for those OGs in which at least one species is missing a homolog, the number of cases missed per species. Example for a genome-wide run between hg38, mm10 and bosTau9:

perl ~/ExOrthist/bin/GetStatsExonsClusters.pl --main_output hg38_mm10_bosTau9-test/

Summary statistics of exon orthogroups (OGs)
Species	Total CDS exons	Exons in geneOGs	Exons in OGs	% recovered
bosTau9	198432	186534	170678	91.50%
hg38	208106	192725	174243	90.41%
mm10	205956	188058	173892	92.47%

Exon OG type		Number	% from total OGs
Total exon OGs		177949	100%
OGs 1:1			148255	83.31%
OGs 2:2			303	0.17%
OGs >=3 exons/species	715	0.40%
OGs >=4 exons/species	280	0.16%
OGs missing species	25832	14.52%

Missing species		Number	% from missing
bosTau9			10585	40.98%
hg38			7104	27.50%
mm10			8143	31.52%

Retrieve IPA alignments

ExOrthist also offers the retrieve_IPA_aln.pl script to facilitate the retrieval and visualization of the best protein alignment between a pair of exons of interest. The script only requires the output folder of a main.nf run, the geneIDs of the query/target genes and the exon coordinates of the query/target exons. See the script help for further details on how to run it.

ExOrthist exint_plotter module

The exint plotter module allows to visualize conservation/changes in the exon-intron structure between a provided query gene and its homologs (in other species) starting from the output of ExOrthist main module. exint_plotter.nf is composed by (1) a few processes calling python scripts which build the necessary input files and (2) a process calling an R script generating the plot.

Running ExOrthist exint_plotter

nextflow run main.nf --wf plot [-with-docker | -with-singularity] -bg > exint_plotter_log.txt

Test run

After running ExOrthist main.nf module on the test set provided in this github repository [see above], it will be possible to perform a test run of the exint_plotter workflow.
The exint_plotter worfklow will take one of the genes for which the main.nftest run inferred an exon orthgroup (specifically: ENSG00000159055), and plot its exon-intron structure together with that of all its homologs. To get acquainted with the exint_plotter` output, simply run the following command:

nextflow run main.nf --wf plot [-profile docker | -profile singularity | ... ] > plot_test_log.txt

By default, ExOrthist will save a PDF file containing the exint plot in the output_plot directory (the expected output is shown in the Plot structure section). The following sections explain all inputs and outputs in detail.

Inputs:

params.yaml file

A params.yaml file with the default test values is provided in the root directory. This can be used as a template for generating a file with your own custom parameters.

To use that custom values, -params-file can be used as below:

NXF_VER=24.10.0 nextflow run main.nf [-profile docker | -profile singularity | ... ] --wf plot -bg -params-file my_params.yaml > my_test.txt

Alternatively, the arguments in the params.yaml can be specified as independent command-line flags (use: --). The command-line provided values overwrite the default ones.

Required inputs

--geneID: query gene ID (it has to match a geneID included in the EX_clusters.tab in --output).
--output: output directory of the ExOrthist main.nf workflow run.
--output_plot: output folder destination. [See Output]

Facultative inputs

--ordered_species: a comma-separated list of speciesID, specifying the vertical order (top-bottom) of the species in the plot. The query gene is always printed on top. If not provided, the order of the species will be derived from the gene cluster file in the --output directory (gene_cluster_file.gz). Example:

--ordered_species "hg38,mm10,bosTau9"

--sub_orthologs: subset of the gene_cluster_file.gz in --output, with a selection of homologs of --geneID to be plotted.
--relevant_exs: comma-separated list of coordinates (chr:start-stop) of query gene exons (NB: coordinates must match the CDS portion of the exon), which will be highlighted in the plot with different colors. All homologous exons in the target genes will be highlighted with the same color. This feature is useful to quickly visualize the conservation of one/few exons of interest. Example:

--relevant_exs ""chr21:32274830-32274896""

--isoformID: protein isoform ID of an isoform of query gene. It has to match the isoform ID in the GTF provided as input of the main.nf run. The exons included in the specified isoform will be highlighted with a red border.

Output

ExOrthist saves the exint plot in a pdf file containing the query geneID (e.g ENSG00000159055_exint_plot.pdf) in the --output folder. If the --isoformID argument is provided, this is introduced in the file name (e.g. ENSG00000159055-ENSP00000290130_exint_plot.pdf).

Plot structure

Homologous genes are plotted on a parallel horizontal axis. Exons are illustrated as gray rectangles. The query gene is plotted on top of the others, with all its exons represented with the sequential genomic order and relative exon length. Exons in the target genes are vertically aligned to their homologous exon in the query species.
The following plot is the one returned by the exint_plotter.nf test run:

<br /> <img align="middle" src="https://github.com/biocorecrg/exon_intron_orthology_pipeline/blob/master/docs/Exint_plotter_example.png" />

Features representation

ExOrthist compare_exon_sets module

ExOrthist contains a module to perform evolutionary comparisons of sets of exons (e.g. regulated exons) between pairs of species. There are two main types of analyses: providing an exon set for a query species or an exon set for both species. In the first case, it will report a series of conservation statistics at the genomic level in the target species. In the second case, it will report statistics at the genomic level, but it will also assess the overlap between the two sets (i.e. regulatory conservation).

The module relies on a single perl script, compare_exon_sets.pl which takes as arguments the two species identifiers (as used in main.nf), one or two lists of exons of interest, and the main output folder for main.nf. Alternatively to the latter, individual arguments can be used to provide specific gene or exon orthogroup files and CDS exons.

The input query exon list(s) must contain gene ID, exon coordinate and, optionally, a third column with regulatory information. There are three types of regulatory information that can be provided, using the option --dPSI_info: (a) none, no information provided in the third column (all exons in the list are considered REGULATED); (b) qual_call, qualitative information of the regulation (valid values: UP, DOWN, REGULATED, NO_CHANGE, NO_COVERAGE (=NA or missing)); (c) dPSI: a numeric value from -100 to 100 corresponding to a change in inclusion levels (PSI) between two conditions. For dPSI, a delta PSI cut-off for an exon to be considered as UP or DOWN regulated should be provided using the option --min_dPSI. Finally, if --dPSI_info is not provided, compare_exon_sets.pl will automatically detect the type of regulatory information.

Running compare_exon_sets for a single exon set

When a single set of exons is provided, compare_exon_sets.pl will assess the conservation of each exon from the query species (sp1) in the genome of the target species (sp2). This conservation is referred to as Genome-conservation and will be assessed at two levels: (i) gene-level: whether or not the exon is in a gene with an ortholog in the target species; and (ii) exon-level: whether or not the exon has an ortholog in the target species (i.e. the target species has an exon in the same exon orthogroup). Example output text of the summary statistics:

- Gene-level stats (mm10 => dm6):
Genes with mm10 exons in the exon lists				664
Genes with mm10 exons with gene orthologs in dm6		355	53.46%

- Exon-level stats (mm10 => dm6):
Exons from mm10 in exon list					827
Exons from mm10 with gene orthologs in dm6			453	54.78%
Exons from mm10 with exon orthologs in dm6 (G-conserved)	43	5.20%
    Only genes with orthologs in dm6					9.49%

Running compare_exon_sets for two exon sets

When two sets of exons are provided, one for each compared species, both will be used as target and query (sp1 <=> sp2). Therefore, for each set of exons, compare_exon_sets.pl will assess the Genome-conservation at the gene and exon levels, and the Regulatory-conservation at the exon level. In particular, a Regulatory-conserved (R-conserved) exon is a Genome-conserved (G-conserved) exon whose ortholog is also regulated (i.e. the exon orthogroup of the regulated query exon contains at least one regulated target exon).

The following scheme highlights all different conservation scenarios:

<br /> <img align="middle" src="https://github.com/biocorecrg/exon_intron_orthology_pipeline/blob/master/docs/Compare_exon_sets_A-01.png" width=700 height=550 />

Moreover, for all regulated exons in both species that fall in orthologous genes, compare_exon_sets.pl will perform a pairwise comparison to define whether the pair of exons are: (i) R-conserved; (ii) best exon matches (even if they do not fulfill all the conditions required, see XXXX); (iii) non-orthologous, when it can be confidently determined that the two exons fall in different regions of the proteins; or (iv) unclear, when neither of this can be determined confidently. The different scenarios and sub-scenarios are summarized in the following figure: <br />

<img align="middle" src="https://github.com/biocorecrg/exon_intron_orthology_pipeline/blob/master/docs/Compare_exon_sets_B-01.png" />

compare_exon_sets.pl run on two exon sets by default provides two kinds of outputs: (1) the summary statistics, which are printed to standard output and (2) a graphical output, saved in a file named Compare_exons_summary-sp1-sp2.pdf in the working directory. Moreover, the flag --print_out allows to generate two extra files containing the output of the pairwise comparison: OrthoGenes_with_reg_exons-sp1-sp2.tab and Conserved_exons-sp1-sp2.tab, which contain the information about each exon as well as the result of the exon orthology test.

(1) Example output text of the summary statistics:

- Gene-level stats:
   - mm10 => dm6
Genes with mm10 exons in the exon list	664
Genes with mm10 exons with gene orthologs in dm6	355	53.46%
Genes with mm10 exons with gene orthologs in dm6 with regulated exons	50	7.53%

   - dm6 => mm10
Genes with dm6 exon in the exon list	276
Genes with dm6 exons with gene orthologs in mm10	196	71.01%
Genes with dm6 exons with gene orthologs in mm10 with regulated exons	41	14.86%


- Exon-level stats:
   - mm10 => dm6
Exons from mm10 in exon list	827
Exons from mm10 with gene orthologs in dm6	453	54.78%
Exons from mm10 with exon orthologs in dm6 (G-conserved)	43	5.20%
    Out of genes with orthologs		9.49%
Exons from mm10 with regulated exon orthologs in dm6 (R-conserved)	4	0.48%
    Out of genes with orthologs		0.88%
    Percent of R-conserved / G-conserved exons in mm10		9.30%
Exons from mm10 with gene orthologs with regulated exons in dm6	80	9.67%
    Exon orthologous	4	5.00%
    Best hit exon	8	10.00%
    Unclear case	7	8.75%
    Not conserved	61	76.25%

   - dm6 => mm10
Exons from dm6 in exon list	407
Exons from dm6 with gene orthologs in mm10	312	76.66%
Exons from dm6 with exon orthologs in mm10 (G-conserved)	73	17.94%
    Out of genes with orthologs		23.40%
Exons from dm6 with regulated exon orthologs in mm10 (R-conserved)	4	0.98%
    Out of genes with orthologs		1.28%
    Percent of R-conserved / G-conserved exons in dm6		5.48%
Exons from dm6 with gene orthologs with regulated exons in mm10	77	18.92%
    Exon orthologous	4	5.19%
    Best hit exon	6	7.79%
    Unclear case	6	7.79%
    Not conserved	61	79.22%

(2) Example of graphical output: <br />

<img align="middle" src="https://github.com/biocorecrg/exon_intron_orthology_pipeline/blob/master/docs/Compare_exon_sets_C-01.png" />

References

Marquez Y, Mantica F, Cozzuto L, Burguera D, Hermoso-Pulido H, Ponomarenko J, Roy SW, Irimia M. (2021). ExOrthist: a tool to infer exon orthologies at any evolutionary distance. Genome Biol, 22:239.