Notes 20100610 CS 798 Course
From SnOwy - Ed's Wiki Notebook
Contents |
Inference with a Flat Prior back in Inference World
- for all the topologies that include some particular subtree
- a subquartet ((A,B),(C,D))
- what is the average likelihood of the data?
- likelihood of data is generally something unbelievably small
- compare to ((A,C),(B,D)) or ((A,D),(B,C))
- so we don't know what the right tree is-- but we can estimate the likelihood of a given quartet
- if that number is >1, then a given quartet is likely-- if it is <1, then it is unlikely
Using inference
- If we assumed
- Bayes' theorem
- Pr(A | B) = Pr(B | A) Pr(A) ÷ Pr(B)
- Data has been given Pr(A | Data) = Pr(Data | A) Pr(A) ÷ Pr(Data)
- Data has been given Pr(A' | Data) = Pr(Data | A') Pr(A') ÷ Pr(Data)
- Pr(A | Data) ÷ Pr(A' | Data) = (Pr(Data | A)Pr(Data | A')) ÷ (Pr(A)Pr(A'))
- the factor Pr(Data | A)Pr(Data | A') is the likelihood ratio
- the factor Pr(A)Pr(A') is the prior ratio
- note that we want to get rid of Pr(Data) where possible--
- if we can't get rid of it, then we have to estimate
- Pr(Data) = ∑models of APr(Data | A)Pr(A) -- we don't want to estimate this.
- suppose we have a prior distribution over some family f of how likely we expect each entry is
- we could imagine doing Bayesian inference
- the edge length of a given edge is normally distributed ...
- N(80, 10) -- truncated at 0 ⇐ N(real average, real standard deviation)
- let's suppose we see an alignment that corresponds to a distance of .75
- now what do we think?
- what is the p that maximizes Pr(p | Data)?
- Pr(p | .75) α Pr(.75 | p) ⋅ Pr(p | prior)
- = N(p, σ)[74]⋅N(.8, .1)[p]
Note to self
- review the notion of prior in Bayes' theorem
One specific case
- where the world is not so good
- let's say that the true model f^ does not fall in F
- that is, reality is not a member of our set of models
- trivial example
- suppose that every edge of a tree has two distances (parameters) attached to it
- which correspond to fast evolution and slow evolution
- one such model will generate a distribution over all of the choices of columns
- that is, we have two magical buttons in the generative model where fast produces long edge lengths, slow produces short.
- suppose f^ is such a model, but F is a Juke-Cantor model of evolution
- there exists a small (5 taxa) example where MLE in F is the wrong topology
Consistency
- as the number of columns goes to infinity, is the MLE the true topology
- MLE = maximum likelihood estimate
- this point of contention actually became motivation for the phylogeny wars
- in verifying consistency, we say that there exists a family of generative models (F)
- which contains the one generating our data
- each model f in F gives a distribution over all possible columns in the character matrix
- each column in the generative model is an independent event (assumption)
- the generative model, f^ should be contained in F
- as m (the number of columns) grows without bounds,
- LLN (law of large numbers) requires that frequency of each kind of column converges to its frequency f^
- these are profile columns
- every model f in F gives a distribution of columns, we just know that f^ is one of those models
- there is no guarantee that some f does not give the same distribution as another f
- consider f in F,
- Pr(Data|f) = Ti=1...m Pr(col i|f) = Tchoice in column i Pr(choice in column i | f)count of column of type i
- Pr(Data|f) -> Tchoice in column iPr(choice of column i | f)Pr(choice of column i | f^) * m
- logPr(Data|f) = (1/m)ΣCiPr(Ci|f^)logPr(Ci|f) -- maximum if f^ is the same as f
- where Ci is the choice of type of column in column i (where a type is a distribution of alphabetical tokens)
- Assume f^ in F
- then as m → ∞, Pr(data | f^) is the max of all the choices of f ∈ F
Discussion
- the object is to prove consistency
- we have taken the statistician's method by looking at how a model behaves when the quantity of data it generates becomes large
- the problem is that we can't guarantee that there aren't alternative generative models that are capable of generating the same data
- we can't touch that problem when using only a consistency proof
Likelihood Methods
- Is pruning a gigantic combinatorial space actually worth it?
- Consistency