From SnOwy - Ed's Wiki Notebook
Projects
- Send in survey (HR)
- Send in project summaries
- Website is now online
Information
- The projects should have something to do with evolution, phylogeny
- Use office hours
- "Find something neat and work on it"
Last Time
- parsimony -- the Hamming distance Steiner tree (HDST) problem
- working toward a proof
- theorem: Given a vertex-cover (VC) intstance G=(V,E) and the parameter k,
- we can produce a HDST instance S subset of {0, 1}n
- such that the VC instance has a k-vertex-cover iff the HDST instance has a solution with k Steiner points
- is np-hard since VC is np-hard; -- HDST problem instance is np-hard
Reduction
- if (i, j) in E -- (E(dge) in G(raph))
(0 --- 010 --- 010 --- 0) = <i,j>
i j
- the HDST instance and also add <>
- trivial if y subset of v is the vertex cover then add <i> for all i in y.
- going in the other direction is a bit harder--
- assume there exists a k-Steiner point solution--
- (we can prove a simple lemma)
- then there exists a k-Steiner point solution where all of the Steiner points are of the form <i>
- and
- there is a node for each vertex of the
- ?
- Suppose not-- then pick a node in the Steiner tree that is of maximum distance in the tree from the origin (<>)
-
Let's discuss parsimony in tree construction
- All we needed from the previous discussion are two things
- That Steiner trees are 'similar to' minimum spanning trees except they allow for additional artificial points to reduce path lengths
- That the Steiner tree can be considered an expression of parsimony
- 4 Taxa
- binary characters
AB | CD
AC | BD
AD | BD
A 0 0 0
B 0 0 0
C 0 0 1
D 0 1 1
A0 1C
\0 1/
*- | -*
/ \
B0 1D
A0 0B
\0 0/
*- | -*
X X
C1 1D
- we are counting mutations -- that is for taxa A, B, C D;
- parsimony on four taxa-- count the number of 0011, 0101, 1001, 1100, 1010, 0110 characters
- each votes for its 'favourite' topology; most votes wins.
- The true topology:
A C
\p q p/
*---*
/q q\
B C
- we only have the parameters p, q to keep things simple
- assume that m goes to infinity--
- we have an enormous amount of data
- we only need to look at 0011, 0101, 0110 -- the remaining are reflections (identical, symmetrical) of these three.
- We need to see what the probabilities of seeing the characters 0011 or 0101 or 0110 (where a bit quartet is referred to as a single character)
- let's be specific: the letters 'p' and 'q' are each the probability of mismatch
- Pr[0011] = (1-p)(1-q)(1-q)pq + ... (each of these terms contributes to the overall polynomial of p, q)
- Pr[0101] = ...
- Pr[0110] = ...
- it's always the case that: Pr[0011] is as big as Pr[0110]
- If we do some factoring, we can show that Pr[0011] - Pr[0101] = (1-2q)(q(1-q)-p2)
- 1-2q is always positive
- (1-2q)(q(1-q)-p2) is positive as long as p2 < q(1-q)
- in a plot of q (independent), p (dependent), the range and domain are 0.5; there is an asymptotic approach to q = 0.5. -- the point 0,0 is valid.
- these sequences are aligned that assumes we don't need to worry about gaps
- we are comparing columns
- those values that exceed q = 0.5 causes the probability of finding the correct topology to converge to zero (as in, Zero; not just less than one).
- Framework
- ground truth: tree (topology and edge probabilities)
- that generates data
- then there's an algorithm that maps data to the topology
- the statistical inconsistency argument: as m -> infinity; that is, as the model generates more and more data--
- we want it to be the case that the algorithm's topology is the correct one to converge to 1.0;
- if that is the case, - then the algorithm is statistically consistent
- in the true topology above, the generative model that creates the characters
- generative models for trees exist; if the tree were like this, and the edge length were this dot dot dot, then we can create sequences
- separately; there are phylogeny models that produce topologies given data
- So the whole point is to ask-- if we go to an infinite number of exemplars, do we converge to 1.0?
- FOR THIS specific generative model, p < sqrt (q (1-q)) is consistent.
- physically-- it's saying that this tree is bad (inconsistent)-- where the ground truth is weird--
A C
\ /
\ /
\ /
\ /
*-*
/ \
B D
- -- then the probability of ever finding this tree is zero.
- in the parsimony frame, there is only one mutation in a given character.
- problems-- we assume that the real world uses this generative model!
- we also assume that we get data from the real world coming from p < sqrt (q (1-q)) ... which may not be true.
A different framework
- coming out of parsimony
- "Compatibility"
- Parsimony: argument: mutations are rare, forced mutation isn't common.
- compatibility: a specific kind of mutation is rare: homoplasy
- homoplasy
- a homoplastic character
- In reality: this implies a tree where sections of the tree are divided by a different character of the alphabet
- just s
- In this framework, we can calculate the probability of success as before--
- the inconsistency problem boils down to something similar-- the 'inconsistent' region still exists
Where are we?
- Distance methods
- UPGMA
- assuming we have good distances, this is consistent on clock trees -- inconsistent on non-clock-trees
- NJ
- consistent (if we have a good distance matrix to begin with)
- Least squares
- consistent -- by arguing that the tree that minimizes the least squares criterion will yield the correct distance matrix
- unfortunately, NP-hard (prune and splice; neighbour switching)
- Minimum evolution principle
- Parsimony
- inconsistent -- and also NP-hard (people still like this)
- Quartets
So how do we reconstruct trees WITHOUT the NP-hardness?
- let's assume that it's always possible to produce quartets (four-taxon trees) -- this is false, but let's pretend it's true for this discourse
- can we go back to polynomial world with quartets?
- why quartets?
- is there sufficient information in quartets to assist in reconstruction?
- (recall we use duplets in distance methods)