Notes 20100518 CS 798 Phylogeny Statistical and Other Methods
From SnOwy - Ed's Wiki Notebook
Contents |
Summary
- Basically, we started by asking how we evaluate how good a distance tree is -- one solution is to minimize the distance between the distance matrix (D hat) (induced by the distance tree) and the actual distance.
- We went over two methods: both rely on creating an initial tree that is approximate.
- We can prune and splice subtrees to improve the d-hat
- We can switch the nearest neighbours to improve d-hat-- that is, switching a ((1, 2), (3, 4)) vs ((1, 3), (2, 4)).
Notes 20100518 - Neighbour Joining Continued
Claim: NJ applied to an additive matrix yields a tree compatible with that matrix;
Q[i, j] = D[i, j] - r[i] - r[j]
Meaning: That the sum of the path length equals the additive entries in the matrix D; The edge labels correspond to the entries in the matrix D.
Paper by David Brian -- gives an excellent explanation.
Today: Distance matrix methods and parsimony methods
-- Is this the only way that we can compute a distance matrix?
-- Statistician cap...
Given D, Construct T -> \hat{D}.
\hat{D} is additive, it's coming from a tree.
We want to construct T so that || D - \hat{D} || is minimized.
LS(D, \hat{D}) = sum of i, j for w(i, j) [D[i, j] - \hat{D}[i, j]]^2
w(i, j) are weights (for most of this lecture, most 'w's are 1)
\hat{D}[i, j] is the path length between i and j -- but for what tree?
A C
\ /
e1 e3
>-e5-< \hat{D}[A,D] = e1 + e5 + e4
e2 e4
/ \
B D
A C
\ /
e1 e3
>-e5-< \hat{D}[A,D] = e1 + e2
e2 e4
/ \
D B
A B
\ /
e1 e3
>-e5-<
e2 e4
/ \
C D
We are calculating the number of unrooted topologies.
(n = 3) => 1 topology (n = 4) => 3 topologies (n = 5) => 3 * 5 topologies (n = 6) => 3 * 5 * 7 top ologies (n) => 1 * 3 * 5 * . . . (2n - 5)
Roughly -- (2^n)n! Steppomg Least Squares Phylogeny -- LS(D, \hat{D}) is NP-hard
For any topology, -- run time is O(n^3 • 2^n • n!) -- let 2^n • n! be T(n) .
Nearest neighbour interchange.
- phylogenetic tree has a single edge that is considered whose orientation is altered
Subtree pruning and regraft.
- edge is cut, attempt to rejoin the two subtrees
- linear number of edges
Criteria for termination: - hilllimbih- slgotdas
Stepping back...How do we comput T <=> \hat{D} given a topology?
A
\ /
e1 e3
>-e5-<
e2 e4
/ \
j
\hat{D}[i, j] = e1, e2, e7 [[D[i, j] - \hat{D}[i, j]]^2 = [D[i, j] - e1 - e3 - e6]^2
We assume all 'w' = 1.
How do we minimize the distances? Use calculus.
\partial{Q} / \partial{e_i} = -2 sum of i, j for (D[i, j] - \hat{D}[i, j]); X i,j,k = 1 n-3 equations, n-3 unknowns => O(n^2)
Sum for i not j for ([D[i,j] - \hat{D}[i,j]] ^ 2) / (Var[D[i, j]])
Var[D[i,j]] = k[D[i,j]^2]
This is the basic structure that's built into least squares phylogeny.
An LS algorithm is consistent: ||D-\hat{D}|| should approach zero as the corresponding trees get better (closer to reality)
Minimum evolution principle
(a different class of algorithms, in contrast to the LS algorithms)
Do LS approx for all topologies-- pick topology that minimizes sum of evolutionary distances.
-- turns out to be inconsistent. -- phylogenists will use this method only as a follow up whereas LS algorithms are often preferred
Distances
What is D[i,j]? - an estimate of accumulated time - an estimate of number of mutations per site - some additive quantity that is based on the edges of the tree (actual, not mathematical)
Simplest possible model - ACGT -- mutation rate is 'u' --> p(A -> C) = u/3; p(A -> G) = u/3; p(A -> T) = u/3; - "mutation" should really be "substitution" - stochastic process -- simplest model -- ignored features: indels etc., - this is a poisson process!