Notes 20100511 CS 798 Course Notes Phylogeny
From SnOwy - Ed's Wiki Notebook
lecturer: Dr. Dan G. Brown
Contents |
Using Dissimilarity Matrices
- distance matrix methods
- if given one of these distance matrices, how do we reconstruct a tree that can make use of them?
- ultrametric trees and ultrametric matrices
A dissimilarity matrix should look like ...
- in general, D is a symmetric n by n matrix for D[i, i] = 0 for all i;
- D[i, j] ≥ 0 for all i, j.
- Tree T is an ultrametric tree for D if n leaves
- each leaf is assigned to a row of D.
- in the way we expect; each leaf has n-taxa, one taxon for each leaf.
- is a rooted tree.
- each internal node of T is labelled with a positive number.
- for a path from leaf to root, the labels strictly increase
- for any leaves i, j -- label of their least common ancestor is D[i, j]
- in a given tree
7
/ \
5 4
/ | | \
A B C D
(we really wanted to label A to D as 1 up to n (but that was confusing since the weights are given by integers too).)
A B C D
A 0 5 7 7
B 0 7 7
C 0 4
D 0
- the above is an ultrametric tree and an ultrametric matrix respectively
Let's consider a symmetric matrix...
- D is a symmetric matrix with zeros on the diagonal
- satisfy the rule ...
- for every i ≠ j ≠ k there is a tie for max[D[i, j], D[i, k], D[j, k]]
- D is an ultrametric matrix
- then if D is ultrametric, then there exists a tree T for it that is ultrametric and vice versa
- that means that if we pick any three values, then the largest one is common to two [taxa] (?)
- if we have an ultrametric tree, then we can create the distance matrix that goes along with it.
What does this have to do with phylogeny?
- presumption: the matrix says something about the distance between taxa
- different presumption: the matrix says something about time lapse between ancestral organisms
- the distance is the time in the past when the two taxa diverged
Theorem -- ultrametric matrix
- if the theorem is true: if we can create an ultrametric tree from an ultrametric matrix, then this should be easy to do.
- let's prove this theorem.
- Theorem: if D is ultrametric, then there exists an ultrametric tree (T) for D
- AND we can construct it in O(n^2) time.
- this is the only quadratic time algorithm we will discuss this term;
- notice that the runtime is equal to the read time-- there are (n(n-1))/2 unique entries (O(n^2)).
- (it's a tradition in the field to give runtimes in the number of taxa)
- if this theorem is true, we can divide the partitions into the left side of the taxa and the right side of the taxa
- this is true, however that is in O(n^3) time -- using O(n^2) time algorithm
- consider G nodes are taxa, start at an arbitrary taxon:
- start at an arbitrary taxon
- pick the shortest-distance edge counting out of that taxon
- remove the taxon you just left
- repeat until we've vistited all taxa
1 2 3 4 5 (1,4) starting from 1-- find the shortest path (is 4), join; remove (1). (1,4) 2 3 4 5 (4,3) continuing from 3-- find the shortest path (is 3), join; remove (4). ((1,4),3) 2 5 (((1,4),3),5) 2 ((((1,4),3),5),2)
- we now look for the longest edge in this path:
- let's say is (3,4)
- then that means that the left side of the path is one half of the tree-- and the right path is the other half
- we then recurse-- we don't need to recalculate the path!
- we continue in this direction-- ordering the taxa takes n^2 time, then we do something with the path drawn
highest weight edge in L
|
*---*---*---*---*---*---*---*---*
i i' p q j
{---------------} {-----------}
| |
Lp Lq
- Claim: for every i in Lp, j in Lq; D[i, j] = D[p, q]
- this is induced from the ultrametricity max(,,) rule!
- we need: D[i',q] = D[p,q]
- we know: D[i',q] ≥ D[i',p]
- two possibilities: Either '>'(1) or '='(2).
- if (1), then there must be a tie for biggest -- then by ultrametricity, D[i',p] = D[p,q]
- if (2): D[i',q] = D[i',p] AND D[i',p] > D[p,q], we didn't pick p and q correctly.
- SO! Every node on the left side ends up being distance p for every node for the right side.
- D[p,q] forevery pair lp in Lp, lq in Lq respectively.
D[p,q]
/ \
/ \
/ \
lp in Lp lq in Lq
- we don't need to re-run the algorithm to create the path (the ordering algorithm)
- just recurse by continuing to look for the largest edge
- what if there's a tie for the longest edge?
- then we see a root with multiple immediate subtrees.
Stipulations on the above algorithm
- does this create a good phylogeny?
- not sure, in reality-- phylogeny distance matrices are only going to be nearly ultrametric
- the above does not really have a name-- it might be referred to as a most distant neighbour approach (is not furthest neighbour joining however).
- this path referred to above is actually a minimum spanning tree!
Dirty Data: only almost ultrametric
- what if D is only almost ultrametric?
- this all depends on the molecular clock hypothesis.
- why?
- D[i, j] is approximately the time when i & j diverged.
- if we look at a whole bunch of contemporary taxa, then we should be seeing a bunch of events in history--
- as we move forward in time, there should be fewer and fewer changes on each level of the tree
- until there are zero changes in the leaf line of the tree
- this hypothesis turns out to be false.
- that's ok, we're just using this as a framework for discussion
- we assume that everyone has the same molecular clock (metronome)
- if the rate of evolution (change) is consistent across each edge of the tree, then the phylogenetic construction is valid
- we not only estimate the number of mutations, we also map how this occurs across time.
- example: rodent mutations accumulate twice as fast than primates; plant mutations take a very long time; prokarya accumulate mutations very quickly
- we can say this is relatively true when we discuss say, only primates, only rodentia, etc.;
- sequence mutations accumulate in a clock-like fashion.
- let's discuss the minimum spanning tree
- if there is a shortest edge, then it will always appear in the minimum spanning tree (the path)
- we will always have the shortest edge
- this suggests a very different kind of algorithm
Utilizing the Shortest Pair
- we can find the shortest pair and join them
- that is consistent with building an ultrametric tree
- (even if the distance matrix was particularly non-ultrametric, it is possible to bifurcate against longest distance -- no phylogeny algorithms like this though)
Phylogeny is not Clustering
- Internal nodes are actually varieties of life
- Models are designed by population geneticists
UPGMA or Average Linkage Clustering
- we need to change the definition of D here-- D[i, j] was the time since i, j split; now it's the distance from i, j ==> D[i, j] *= 2;
- each taxon is in its own cluster, while > 1 cluster, find Ci,Cj such that D[Ci,Cj] is minimum
- the new cluster Ck -- Ck = Ci join (union) Cj
- remove Ci and Cj
- D[k, l] = (|Ci| * D[i, l] + |Cj| * D[j, l]) / (|Ci|+|Cj|)
- there exists a top-down version of this bottom-up algorithm; this may be more fragile.
- So we can see k is the parent of i, j.
- D[i,k] = D[j,k] = D[i,j] / 2
- retains ultrametric
- this is a tree building algorithm.
- runtime is O(n^3)
- all of the actual work actually occurs in finding Ci,Cj
- we can treat this as a heap
/\ /\
/ \ / \
/ /\ / /\
/\ /\ \ / /\ \
i j k
- as we went from i,j to k, we should expect that the distance from i, j to everything is equal to the distance between k and everything (except for the stuff from i to j)
- this means the weighted average is not really necessary
- really used to minimize error
- if we replace the averaging function with min, then it's "minimum-linkage clustering"; ... "maximum-linkage clustering"; etc.,
- UPGMA has acquired a lot of confidence from phylogenists than many of the other methods
Stipulations on UPGMA
- there are O(n^2lgn) algorithms derived from UPGMA
- what if D has many values that are not filled in (where there are many values that are just "BIG")
- then we can run the algorithm as O(E log n) time-- where E is the number of entries.
- there is no O(n^2) algorithm for UPGMA (open problem).
- are the o(n^2lgn) algorithms useful?