Notes 20100518 CS 798 Phylogeny Statistical and Other Methods

From SnOwy - Ed's Wiki Notebook

Jump to: navigation, search

Contents

Summary

Notes 20100518 - Neighbour Joining Continued

Claim: NJ applied to an additive matrix yields a tree compatible with that matrix;

Q[i, j] = D[i, j] - r[i] - r[j]

Meaning: That the sum of the path length equals the additive entries in the matrix D; The edge labels correspond to the entries in the matrix D.

Paper by David Brian -- gives an excellent explanation.

Today: Distance matrix methods and parsimony methods

-- Is this the only way that we can compute a distance matrix?

-- Statistician cap...

Given D, Construct T -> \hat{D}.

\hat{D} is additive, it's coming from a tree.

We want to construct T so that || D - \hat{D} || is minimized.

LS(D, \hat{D}) = sum of i, j for w(i, j) [D[i, j] - \hat{D}[i, j]]^2

w(i, j) are weights (for most of this lecture, most 'w's are 1)

\hat{D}[i, j] is the path length between i and j -- but for what tree?

A            C
 \          /
  e1      e3
    >-e5-<     \hat{D}[A,D] = e1 + e5 + e4
  e2      e4
 /          \
B            D


A            C
 \          /
  e1      e3
    >-e5-<     \hat{D}[A,D] = e1 + e2
  e2      e4
 /          \
D            B


A            B
 \          /
  e1      e3
    >-e5-<
  e2      e4
 /          \
C            D

We are calculating the number of unrooted topologies.

(n = 3) => 1 topology (n = 4) => 3 topologies (n = 5) => 3 * 5 topologies (n = 6) => 3 * 5 * 7 top ologies (n) => 1 * 3 * 5 * . . . (2n - 5)

Roughly -- (2^n)n! Steppomg Least Squares Phylogeny -- LS(D, \hat{D}) is NP-hard

For any topology, -- run time is O(n^3 • 2^n • n!) -- let 2^n • n! be T(n) .

Nearest neighbour interchange.

Subtree pruning and regraft.

Criteria for termination: - hilllimbih- slgotdas

Stepping back...How do we comput T <=> \hat{D} given a topology?

A            
 \          /
  e1      e3
    >-e5-<
  e2      e4
 /          \
             j

\hat{D}[i, j] = e1, e2, e7 [[D[i, j] - \hat{D}[i, j]]^2 = [D[i, j] - e1 - e3 - e6]^2

We assume all 'w' = 1.

How do we minimize the distances? Use calculus.

\partial{Q} / \partial{e_i} = -2 sum of i, j for (D[i, j] - \hat{D}[i, j]); X i,j,k = 1 n-3 equations, n-3 unknowns => O(n^2)

Sum for i not j for ([D[i,j] - \hat{D}[i,j]] ^ 2) / (Var[D[i, j]])

Var[D[i,j]] = k[D[i,j]^2]

This is the basic structure that's built into least squares phylogeny.

An LS algorithm is consistent: ||D-\hat{D}|| should approach zero as the corresponding trees get better (closer to reality)

Minimum evolution principle

(a different class of algorithms, in contrast to the LS algorithms)

Do LS approx for all topologies-- pick topology that minimizes sum of evolutionary distances.

-- turns out to be inconsistent. -- phylogenists will use this method only as a follow up whereas LS algorithms are often preferred

Distances

What is D[i,j]? - an estimate of accumulated time - an estimate of number of mutations per site - some additive quantity that is based on the edges of the tree (actual, not mathematical)

Simplest possible model - ACGT -- mutation rate is 'u' --> p(A -> C) = u/3; p(A -> G) = u/3; p(A -> T) = u/3; - "mutation" should really be "substitution" - stochastic process -- simplest model -- ignored features: indels etc., - this is a poisson process!

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox