Notes 20100617 CS 798 Computational Phylogenetics
From SnOwy - Ed's Wiki Notebook
Contents |
Markov Chain Monte Carlo
- Build a Markov Chain that mixes rapidly
- ( def : mixes rapidly → achieves convergence in a reasonable amount of time )
- Run the chain to stationarity many times
- Summarize PrD based on our sampled points
- in the metropolis algorithm, we have a graph where the graph has a number of points representing our sample space
- we transition from point to point based on the probability of the points occurring
- if the point to transition to has higher probability than the current point, we go to that point
- if that point has lower probability, we use a random assignment based on the odds ratio of the two points (the new point divided by the old point)
- we want chains that is well connected, and that does not have attractor hubs
- Metropolis
- k-regular, undirected, uniform, random walk
- we end up with an algorithm that breaks all of these four parameters
- in the Metropolis-Hastings algorithm, we fix this
- we have a probability distribution Q(a → b) = ...
- Pr[ Propose to go to b | Currently in a ]
- we have a probability distribution Q(a → b) = ...
- we accept the proposal (i.e. adopt the transition) ...
- with probability equal to the odds ratio as before but we scale the odds ratio given Pr[b] / Pr[a] ...
- min[1, Q(b→a)/Q(a→b) × Pr[b]/Pr[a] ]
- with probability equal to the odds ratio as before but we scale the odds ratio given Pr[b] / Pr[a] ...
- requires that each edge has a corresponding edge going the other direction with a positive probability
- (otherwise we would get some zero transitions)
- PrD[b] / PrD[a] = (Pr[D | b] / Pr[D | a]) × (Pr[b] / Pr[a])
- we now need to think of Pr[b]/Pr[a]
- so we need to know this ratio
- there are two parts to our priors
- topology
- edge lengths or parameterizations
Back to Bayes
- in order to do this, we need to make some claims about how we think evolution happens
- several religious camps of thought about how to justify using the Bayesian frame for phylogenetics
- Bayesians are distinguished by unjustified priors
- -- but how are you going to represent what you know?
- Almost everything a Bayesian believes can be rephrased in other flavours of statistics
- If your prior matters a lot, then you're doing it wrong
- Bayesians are distinguished by unjustified priors
- possible to use Bayes' theorem but we don't believe in priors--
- that means we always believe in flat priors-- that is Pr[b]/Pr[a] always equals 1
- turns out this isn't always so good...
- a flat prior works for finite sets but not so good for continuous sample spaces
- in some sense, we assume that our parameterization is right
- one approach: parameter p is uniform in [0, 1]
- or Parameter p2 is uniform in [0, 1]
Coalescent processes
- a coalescent event is where one parent has two children (no sex)
- in the continuous version of the coalescent processes two simultaneous events is not allowed
- coalescent processes are useful models for population genetics etc.
- at time t, a population of size n, at time t-1, a population of size n-- where a parent may give rise to several children
- a child may only have one parent
- we can view this as a sequence of rows of poisson blips
- other models include complications such as bottlenecks where n′ becomes n/2 for a generation
- finally, there are models that allow for meiosis (mating)
- the trivial case as we have described it works for humans for Y-chromosome (men) or mitochondria (women)
- also -- very underparameterized-- the only thing that affects edge length is population size: very nice to study
Trusting the analysis
- our experiments so far are unitary-- we have some data, it is processed and a result is returned
- it would be nice if each time we ran this process, we received the same result
- do we have statistics?
- we could study multiple genes
- dissatisfying: each experiment is not an independent replicate-- genes do not evolve independently
- i.e. one gene gains function, another one loses-- two genes coevolve, one gene is stable, another is allowed to be very unstable
- -- one way to try to get independent replicates (not really)
- we could try to use some non-parametric statistic approaches
- boot strapping and jack-knifing
- sequence alignment column frequencies are used to create a background distribution
- assumes each column is independent of one another (false)
- boot strapping parameters is the same as the probabilistic chromosome in the population genetic algorithms
- we say that we technically sample m columns with replacement
- jack-knifing is boot strapping where we choose km columns such that k ≤ 1.0
- sequence alignment column frequencies are used to create a background distribution
- boot strapping and jack-knifing
- we could study multiple genes
- why are these two good ideas?
- assuming that the data is large enough-- the statistician and machine learning people say...
- as 'm' grows without bound, choosing some value of 'm' (length, number of elements in the parameter vector) will produce the same distribution we observed
- note, in the above, we used ungapped sequence alignments.
What do we do with 104 trees?
- Summarization of trees
- Given 'k' tree topologies over the same 'n' taxa: ...
- we want to find a tree (we call t*) that summarizes them best
- def : summarizes → finding a tree that best represents them-- most 'median' tree