Notes 20100506 CS 798 Dan Brown, Phylogeny Kmer Counting
From SnOwy - Ed's Wiki Notebook
Contents |
Word Counting, Building Phylogenies, Alignment Free
- Pairwise alignment: oblique way to describe how much change is required to convert one sequence to another
- Recommended Books: Biological Sequence Analysis -- http://www.amazon.ca/Biological-Sequence-Analysis-Probabilistic-Proteins/dp/0521629713/ref=sr_1_1?ie=UTF8&s=books&qid=1273160157&sr=8-1
- Get the second edition.
- Learn the probabilistic underpinning of pairwise alignments
Alignment history
Keywords probability sequence kullback-leibler kmer shannon vector JS distance
- a combinatoric approach is O(n^2m^3)
- a later idea-- Pick P2: count for all w in sigma^k the number of times w is a substring of Si for all i.
- we use this as a signature for some sequence Si for some value of k;
- we normalize this to a probability vector we call fi
- remark: okay, we're now talking about the kmer counting
- we can call this a distance -- D(i,j) = JS(fi, fj) =
- (1/2)KL[fi, (fi+fj)/2] + (1/2)KL[fj, (fi+fj)/2]
- KL[x,y] = sum(i in omega) xi log (xi/yi)
- is a probability vector
- this has an information theory meaning
- KL says how much more information we need to describe y if we already have x.
- when the distribution is similar, distance is small-- vice versa;
- KL is not symmetric. -- but JS is symmetric-- it's the average of the KL.
- Metric: satisfies the triangle inequality, dist(self, self) = 0,
- JS does not satisfy the triangle inequality-- but it's square root does.
- so now we have D(i,j) --
- We would like it to be the case where a tree with weighted edges in a phylogenic tree ...
- the modern taxa should be related to one another given their branch lengths
- NJ -- "does well sometimes" -- that's actually very good!
- is robust enough to handle some noisy distance
- UPGMA -- has a few unrealistic expectations about how evolution actually operates
Variable names
- S = sequence
- w = substring
- m = number of sequences
- n = number of taxa
How long does it take to compute this distance
- linear on sequence length
- run time is reduced to O(n^2m)
- why doesn't everyone use this method?
- the sequence alignments allow us to make some sequence-sensitive (not only token-sensitive) assertions about evolutionary events
- two other alignment methods are allowed
Kolmogorov complexity
- Si and Sj
- kappa(Si) = length of the shortest program whose output is Si.
- note: kappa(Si) is not computable -- cannot approximate -- is interesting as a mathematical object
- runtime is infinite
- it should be the case that most highly compressible language should have the shortest program
- how do we use this to measure the distance between strings (relate to Si, Sj)
- so let's say that Si and Sj are similar
- we could probably write a program to write Sj with a very small change in Si--
- if Si and Sj are vastly different, then we can't make some trivial change (in size of program) in order to get Sj.
- d(a,b) = ( k(ab) - min(k(a), k(b)) ) / max( k(a), k(b) )
- -- ab is a concatenation of a, b;
- remember that kappa is not computable
- how do we relate this to complexity
- any program that outputs Si is an upper bound of kappa
- for instance, the output of gzip(Si) concatenated with sequence Sj is the program that creates Sj.
- a way to represent the distance between a, b is to draw the alignment!
- O(n^2m) if creating a program that generates a sequence based on a different program (for a different sequence)
Average Common Substring Length -- Another method
- Si, Sj
- Si = AGTTCTT; Sj = CCGTTATA
A G T T C T T | C C G T T A T A
-
1 -----
3 ---
2 -
1 -
1 ---
2 -
1
- ACS[Si, Sj]
- dij = 1/(ACS[Si,Sj])
- got popular in 2003, team in Isreal
- has runtime of O(n^2m)
- if the average common substring length is similar, then the sequences are more likely to be similar
Going back to the Kmer counting
- foreach w in strings of k: fi(w) = Pr[a subword of Si is w]
- how do we get k?
- authors used books as an example
as a function of k, how many words in w {a ... z}k appear at least twice?
- words are not words in the book (Peter Pan), they remove spaces, punctuation etc.,
- we expect few words when k is small, we expect few words when k is big-- this number bulges in the middle-- when k = 6
- could have included spaces, punctuation, could have used words built from English words as opposed to letters.
- worry that biosequences are less compressible
- lower bound is given by this function.
upper bound.
- there should be a point where increasing the number of tokens does not increase the frequency of words
- f[a1 ... ak] =approx= f[a1 ... a(k-1)] * (f[a2 ... ak])/(f[a2 ... ak-1])
- the above is a Markovian assumption where the left side is the actual observed frequency and the right side is some approximation
- CRE[k] = sum of l = k+1 to infinity for KL [F hat l, fl]
- this happens at k = 16 for Peter Pan.
- gives a multi-modal curve that starts high and drops to zero.
Going to phylogeny
- hierarchical clustering -- Dr. Brown says: category error.
- phylogenies are a special case of hierarchical clustering: these are categories though that represent species that have existed though.
- the book thing is a nice clustering strategy but does not really say anything about ancestors
- incidentally-- this is perfect for Andre/Mike's problem: clustering :/ ... however, the objective was taxonomy :/
- let's think about it... k = 3...
Insertion...
A T G T A C T A | A T G C T A
---
-----
-----
-----
-----
- l+2k-2 -- where l is the length of the insertion, k is the length of a word.
Deletion...
A T G T A C T A | A T G T A A G C T A
---
-----
-----
-----
-----
- what about things that cause sequence alignments to die? -- inversions (not today) and translocations...
A T G T A C T A | A T G T A T A C
----- -----
- what about duplications? Behaves like an insertion.
- ROBUST!
- hey what should k be? log_4(n)? -- we're scaling to a probability vector-- it looks like there's no reason why this shouldn't work
- they've just done it this way
- See paper-- method -- a few heuristics are used
- words that are very common are thrown out (AAAAAAAAAAAAA)
- low complexity regions are thrown out (useful? err...)
Hints!
- WABI: Wheeler -- When you have a giant distance matrix -- Ninja
- http://www.wabi09.org/papers.php