From SnOwy - Ed's Wiki Notebook
Distance Matrices-- What are they measuring?
- Continuous Markov Chain -- a model for mutation of a sequence
AAAAA
AGAAA
AGATA
AGACA
- In this simplest model, there is no possibility (, modeled probability) of two mutation events occurring at the exact same time.
* AAAAA
.02 /
* AGAAA
.10 / \ .01
* *
AATAG AGAAA
- This is a Poisson process -- there is nothing that prevents a descendant from becoming its ancestor after some intermediate sequence
* *
| |u/2
|u *
| |u/2
* *
- Additive distance process, we don't care when something happens-- 'u' units of time elapses
- Each time step indicates one point mutation
- Here's a corresponding FSA
A ---- C
| \ / |
| \/ |
| /\ |
| / \ |
T ---- G
- There is also a self loop for each nucleotide
- All state transitions have the probability of u/3
- In t time steps, E[# of Poisson blips] = 4/3 ut
- u is a time unit
- t is the number of time units
- Pr[0 blips] = e ^{(-4/3)ut} -- the probability that no blips occurs
- Where a blip is a state transition -- seen as a change in the sequence
- Pr[A -> A in t units] = e ^{(-4/3)ut} + .25(1- e ^{(-4/3)ut})
- Pr[A -> C in t units] = .25(1- e ^{(-4/3)ut})
- Both a transition that a nucleotide transitions to itself and that a nucleotide transitions to something else converges onto 1/4 as ut goes up
ut
- is our branch length
- as a function of the percent mismatch, we can get our measure of ut -- our expectation of ut
- we end up with a graph that has a vertical asymptope at mismatches = 0.75 (x) as ut (y) goes up.
- i & j differ in x fraction of positions D[i, j] = (-3/4) ln (1 - (4/3) x)
- this curve isn't linear and converges to 0.75 because there is the ability for a token to become itself after a long time
- there is the issue when the two sequences have less than 0.25 token identity that they are "infinitely" distant
UPGMA experiments
* *
\.1 .02 /.1
*-----*
/.1 \.1
* *
- the above tree is an ultrametric tree
- experiment 1: vary the sequences by length (length is m)
- success(m = 20) = 60%
- success(m = 100) = 64%
- success(m = 500) = 82%
* *
\.5 .1 /.5
*-----*
/.5 \.5
* *
- experiment 2: increase the differences by a factor of 5
- success(m = 20) = 47%
- success(m = 100) = 60%
- success(m = 500) = 80%
- the single nucleotide substitution is the Jukes Cantor model wherein each transition is possible and has the same probability
Now including Transitions (α) and Transversions (β)
- in one unit of time, we now expect to see α + 2β substitutions
- A <-> G, T <-> C α
- A <-> T, C <-> G β
- α / (2β) = R
- R is based on observation -- thus it depends on organisms --
- (in the Jukes Cantor model, R = 1/2)
- (α + 2β = u)
- this works very well for neutrally evolving non-coding nuclear DNA
- that is, natural selection has no say
- calculating P, Q -- P = Pr[transition], Q = Pr[transversion]
- P + Q = x (-- i and j differ in x fraction of positions)
- computing P and Q may not be a good idea... ?
- two-parameter Kimura model -- R, u.
- picking a value of R while attempting to estimate distances is a popular way to calculate distances u as a tree is constructed ?
- in practice people normally estimate an R, then calculate u
- DNA will either be AT rich or GC rich --
- Now we have a model where the kinds of substitutions that we see respect a certain distribution of nucleotides
- We can build a model such that we take the stationary probability of the target
- We add the probabilities that things become a character A with βΠA
- Kimura...
- for the change going from T to C -- αγ(ΠC)/(ΠCΠT)
- There is one beta, and two alphas-- one corresponding to purines, the other to pyrimadines
- HKY model -- when the two alphas are the same
- HSF? -- when ?
Protein Models
- The models we discussed today describe only nucleotides
- The closet thing for amino acids describes a Jukes Cantor process for codons -- the PAM matrix
- BLOSSUM is based on observed frequencies
- Neither of these make a correct assumption about reality-- the first assumes no selection gradient, the second does not describe who is more ancestral
A-T
- p = Pr[2 ends of an edge differ]
- Var(\hat{p}) = 9p(1-p)/((3-4p)^2 n)
- Gives justification to being uncertain about distances for short n
* ATGGC i
|
| d(i, j)?
|
* AACGC j
- To get d, we can guess using the named methods today
- However, how do we know that i and j are lined up correctly?
- we have to use an alignment
- this is where there's the circularity in this field :(
AGTTGTCT
AC--GTAT
- Reintroducing gaps
- Alignment: O(m2)
- heuristically: roughly O(m)
- (alignment algorithm)
- the formal definition varies; in general: adding a finite number of dashes to maximize the scoring function of the two sequences
- gaps do not align over other gaps
- alignment algorithm assumptions are dangerous
- we assign a score to match, mismatch, gap open, gap close scores
- all of these scores correspond directly to a probabilistic model of evolution
- note: match = 1, mismatch = -1, gap = -1 is called the "edit distance"
- this is dangerous because we want to create a phylogeny-- but we're assigning values to the evolutionary distance--
- we are indirectly assigning the parameters for a model for a model we want to find.
- Problem: Alignment assumes that we know distance!
- So we compute the maximum likelihood distance
- so we simultaneously attempt to compute the parameters of the alignment AS WELL as the distance matrix