Notes 20100622 CS 798 A Silly Weird Topic in Phylogenetics
From SnOwy - Ed's Wiki Notebook
Serious: What do I do if my algorithm gave me 15k trees?
- some of our algorithms can give us back multiple trees
- algorithms based on parsimony for instance
- multiple optima
- Bayesian algorithms
- sample from posterior distribution
- independent experiments -- experiments aren't actually independent; genes actually coevolve
- the gene tree / species tree problem
- where we attempted to get
- solving the orthologue problem
- given n-taxa, can we identify the genes that should actually be part of the tree?
- def : orthologues -> genes that are related between species, are different due to speciation events
- the standard way to identify orthologues is with a reference sequence and some corresponding sequence given a set of available sequences
- why does this happen?
- 2000 Human Genome (bad sequencing back then -- roughly 1% error rate)
- 2001 Mouse Genome (again -- badness happened here)
- sequencing errors, algorithm weaknesses
- 2002, 2003 -- Rat Genome (better sequencing)
- 2003 Human Genome -- finished, good quality!
- there's also the likelihood that not all of the taxa have all of the orthologues
Supertree Methods
- Given T1 ... Tk all the trees on the sizme n taxa,
- How do we summarize them?
- we probably want a tree-- we probably want the summary to be a single tree
- evolution in the real world doesn't actually follow a single tree
- horizontal gene transfer
- philosophically suspect anyway: there's not just one tree.
- assume OK, let's work with the hypothetical
- we want this tree to be reasonably informative
*
/ \
/ \
/\ \
/ \ /\
/\ /\ / /\
A B CD E F G
*
/ \
/ \
/ /\
/\ / /\
/ /\ / /\ \
B C D E A F G
- different ideas for summaries
- strict concensus:
- only the splits found in all trees are included: i.e. (C, D).
- not very good -- we lose a lot of information-- we only have the taxa CD, C and D.
- consider the Mt tree -- for t ≥ 50
- here, a split is kept if it is found in > t % of the trees (Boundary condition: ≥ 100%, so M100 = strict consensus)
- M50 tree is the majority tree.
- Is this a tree?
- Could we obtain (A,C) and (A,B) as two majority splits?
- won't happen: requires at least one tree that has both splits-- (the solution space is broken into two halves)
- possible if t < 50
- Can we find the majority tree?
- find all the splits in each tree and vote for them - O(n2) per tree
- find all the splits with > k/2 votes - O(kn2) for summarization
- build a tree - O(n2) to build a tree
- O(n2k) to build the majority tree
Bit Vectors used to Describe Branching A B C D E F G 0 0 1 1 0 0 0 -- describes a branch who breaks off into a left subtree(A, B, E, F, G) and a right subree(C, D) 0 0 1 0 0 0 0 -- describes a lower branch who breaks off into left subtree(tree-C) and subtree(C)
- Can we do this in O(nk) time?
- use hash tables to make voting more efficient
- we reconstruct the treee more cleverly
- Nima Amenta, Federick Clarke, Katherine St. John, 2000
- reminder: n taxa, k trees.
- we're using the bitvectors as keys and the values are the number of votes <int, int>
- f(x) = p1x + p2 mod p3 where p3 is a prime -- this is the hashing function
- hash table collisions-- two values of 'x' have the same value of 'f(x)'
- g(x) = q1x + q2 mod q3 where q3 is roughly (nk)2, is a prime
- To make a vote for x, compute f(x)--
- see if there exists an entry in my main hash table's entry for f(x) for the new key g(x)
- if so: increment it.
- if not: add a new table entry with that key and give it value 1
- we have created a hash table of hash tables
- the first round of hashing is going to be full of collisions
- we have all hash entries with > k/2 votes
- each hash entry points to the first tree that created it and its exact split
- traverse the trees again, looking for a tree that contains each pair of eventual splits yielding a correct ordering for the nodes of the majority rules tree.
Silly: What if we have historic taxa available?
...