Home

Awesome

Documentation Home {#mainpage}

C++ tool to check for and enumerate terraces in phylogenetic tree space.


Usage: terraces/build/release/app <nwk file> <gene/site file>

Terraphast takes a .nkw file in Newick format and a genes/sites file, which denotes whether (1) or not (0) gene i is present in species j.

Program output states some imput data properties, the species whose leaf edge is used as a new tree root, and the resulting supertree in compressed newick format.

Compressed Newick Format: The resulting supertree representation cann be plain Newick, but can also contain the following two notation enhancements:

Both enhancements were chosen such that the result is standard newick format if there's only one possible supertree.

The Terrace Phenomenon and Problem

In recent years, it has become common practice to infer phylogenies on so-called multi-gene datasets. Concatenated multi-gene datasets usually exhibit holes, that is, sequence data for some species might not be available for some genes Gi in our concatenated dataset. This can be due to a plethora of reasons, for instance, a specific species might simply not have a specific gene G i or the specific gene has simply not been sequenced for some of the species. After concatenating genes (partitions) we therefore end up with an alignment that contains patches of missing data:

index       0123

Species 1   AC--
Species 2   AG--
Species 3   ACTT
Species 4   --AG
Species 5   --GG

Under the likelihood model conditions that generate terraces, the log likelihood LnL(T) of a tree T can be computed as follows: LnL(T) = LnL(T|G1) + LnL(T|G2) where T|Gi denotes the tree topology induced by T for the species/sequences in partition i for which we have sequence data. In our example, the trees induced by G1 and G2 contain only three taxa. We know that there's only one tree topology with three taxa. On the other hand, there are 15 possible topologys for 5-taxa trees. So all 15 possible 5-taxon trees for our example dataset will induce the same per-gene/partition trees and therefore span a terrace of size 15. This example dataset is bad: It does not contain any signal for disentangling the phylogenetic history of these 5 species, since they are only connected via species 3.

Terraces: two distinct comprehensive (containing all n species) trees are on a terrace if all induced per-partition subtrees of the two trees are identical. This phenomenon was named and described in [SMS11].

Knowing about the phenomenon of terraces, researchers might want to know (i) if a given tree is on a terrace, (ii) how many trees there are on that terrace, and (iii) how the trees on that terrace look like.

The Basic Approach

TO PUT IN HERE:

A Short Guide to the Code

This can be found here.

Improvements and Optimizations to the basic approach

Implemented:

Planned:

References

[SMS11] Michael J Sanderson, Michelle M McMahon, and Mike Steel. Terraces in phylogenetic tree space. Science, 333(6041):448–450, 2011.