From SnOwy - Ed's Wiki Notebook
Previous
- Distance matrix based phylogeny
- Build an alignment
- Estimate distance from it
- Align(S, T); S, T are sequences
S: ATGGCT-AGT
|||X||-|X|
T: ATGCCTAACT
- 7 match, 2 mismatches, 1 gap
- The HMM that models the alignment
- has three states: (Mis)Match, InsertS, InsertT
- See ch. 2, 3;
- Can arbitrarily add weights against gaps => high number of mismatches
- or weights against mismatches => high number of gaps
- problem: model-- foreshadows result
- Project options: Phylogeny vs Alignment
- Many phylogenists will hand tweak alignments after they've been produced
- but they appear to have absolute faith in the software (probability, phylogenetic, etc.,) that is run afterward
Scoring a phylogeny algorithm
- True tree (T)...; Algorithm returned tree (U)
T
A---*---B
|
G---*---*---*---*---C
| | |
F E D
U
A---*---B
|
G---*---*---*---*---E
| | |
F C D
T:
AB|CDEFG
CD|ABEFG
CDE|ABFG
FG|ABCDE
U:
AB|
DE|
CDE|
FG|
- S1 is an entry for T
- S2 is an entry for U
- Distance = ( |S1-S2|+|S2-S1| ) /2
- In order to evaluate a method absolutely, we must know what happened in reality.
Moving on away from Distance Matrix Methods
- Because mutations are rare, we should not see them very often
- Mutations are more likely to be deleterious than beneficial
- Working with five taxa, let's consider a column
1 T
2 A
3 C
4 C
5 T
T A
1 2
! -> \ /
* <--- A
! --> |
* <--- C
/ \
3 * <- C
C |\ < !
4 5
C T
- Where '!' is a mutation-- there has to have been at least three mutations
- three mutations forced
- parsimony: the total number of mutations should be as small as possible
- (parsimony => "greedy", "miserly")
- perhaps a better tree...
T1 T5
\ /
* <- T
! --> |
* <- A
! -> / \
C > * A2
/ \
C3 C4
- two mutations forced
- input is an n cross m character matrix
- n, row is a taxon
- m, column is a single position in all sequences 'presumed to share evolutionary ancestry'
- assume no gaps
- goal: pick a topology to minimize the total number of forced mutations across all m columns
- fewer forced mutation is better than more mutations
- this model assumes that all mutations are independent
- For all topologies, for all columns; compute: the best way to assign a character to the internal nodes
- the total score for the topology is the sum of the column scores
- pick the best one.
- O(T(n)mn) -- T(n) the number of topologies for n; m is the number of columns; n is the number of taxa -- NP-hard.
- All we need to do is to determine the characters of the internal nodes
Algorithm
\ /
*
| /
*-ROOT-*
/ \
- we arbitrarily root the tree to begin its consideration
ROOT
/ \
* *
/| |\
. . . * u
/ \
. .
- we know the values at the leaves (these are fixed, these are the characters at each of the taxa)
- for a node u; C(u,a) = minimum number of forced mutations in the subtree rooted at u if n has a symbol a.
- We want... min C(r, a) when a is in Σ
- Leaves: C(x, a) = 0 if a is 'right' letter, infinite; otherwise C(u, a) =
*u
/ \
v-subree w-subtree
- Our options are simple-- a fits in neither, a fits in v, or a fits in w.
- C(u,a) = min[c(v,a), min(foreach b in Σ)[c(v, b) + 1]] + min[c(w,a), min(foreach b in Σ)[c(w, b) +1]]
- O(n.|Σ|2)
- Can we do this in sigmaless time?
- i.e. can we do the c(u,a) in constant time?
- min b in Σ can be computed once and stored. It is not affected by 'u'.
- with this device,
Hardness of NP-hard problems in bioinformatics
- n does it Sigma; water iuf
] Mpre Sigma; = {0, 4}
- Input is n elemnents of {0, 1}m
- D[i, j] = # of positions where i and j differ
- Parsimony <=> Hamming distance in Steiner tree problem
- Edges in hypercube corresponding to hamming distance of 1
- Steiner tree has size 1.0
- 1 new node required
- Parsimony tree has distance 3.0
- if there exists k Steiner points, the tree has n+k-1 total length
Proof -- convert np-hard problem into case of Steiner tree
- exact cover by three sets
- from vertex-cover distance
- G = (V, E)
- where V = (1 ... m)
- where E = subset of (1 ... m)2
- of size n
- Instance of Hamming distance Steiner tree (HDST):
- n vectors if ei = (j, k)-- add...
0---010---010---0
j k
- to our HDST instance
- If there exists a vector cover Q subset V; Q = {v1 ... vk},
- then the vectors
0 ... 1 ... 0
vi
- form a valid set of Steiner points for HDST
- Not as easy -- if there exists a 'k'-point Steiner tree--
- then there exists a k-element vertex cover