Notes 20101005 BIOL 614 McConkey Class
From SnOwy - Ed's Wiki Notebook
Last week: HMM (HMMER) / PSSM (PSI-BLAST)
Next week: Models of evolution -- backend scoring functions
Contents |
Topics for today
- multiple Sequence Alignments
- phylogenetic Trees (intro)
Phylogeny
- we can consider classical taxonomy as phylogenetics -- we now take molecular data instead of phenotypical characters
- ~100 years old
- requires: similarity score
- ATGC - A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.
- is a web server program
- input: PHYLIP / MSA -- DNA / AA / Interleaved / Sequential
- substitution model: scoring functions
- transition / transversion ratio
- substitution rate categories
- gamma shape parameter
Molecular phylogeny overview
- in this class: molecular data
- data analysis
- selection of sequences
- multiple sequence alignment
- substitution model (BLOSUM, PAM)
- handling indels
- possibly delete entire column?
- tree building
- tree evaluation
- why are we building a tree in the first place?
- xenologs (horizontal transfer) -- paralogs (gene duplication) -- orthologs (evolution)
Selecting sequences
- figuring out the phylogeny of those organisms
- CoG -- clusters of orthologous groups
Sequence data -- DNA / RNA / Protein
- protein works better in phylogeny for distantly related organisms
- DNA, RNA works better for closely related organisms
- able to use ribosomal RNA sequences -- this is an exception, this is present in many organisms
- fine details -- protein sequences are going to be too similar
- DNA sequences have more changes for each sequence
- DNA translated into proteins sequences to create primary alignment -- DNA however is used finally
- DNA synonymous substitutions can be detected this way
- single nucleotide substitutions, sequential (temporally, not spatially) substitutions, coincidental substitutions
- DNA is closer linked to the molecular clock
- DNA -- noncoding regions, pseudogenes, transversion, transition rate
Multiple Sequence Alignments
- building a MSA (ClustalW)
- progressive alignment - sequences aligned in a specified order
- local / global weighting (T-Coffee)
- information from other sequences in pairwise alignment
- similar regions are downweighted ...
- ungapped segment alignment (DiAlign)
- no gap penalty; aligns short global regions, merges sequences (with gaps)
- iterative re-alignment (MUSCLE)
- sequence subsets removed and re-aligned
ClustalW
- start of with n sequences
- perform a (complete) set of pairwise global alignments -- n(n-1) ÷ 2 alignments
- guide tree created -- most similar to least similar
- progressive align sequences -- dynamic programming
Expectation Values Can't Be Used Here
- moot when it comes to MSA
- we've already assumed that we're aligning homologous sequences
- we don't really ever calculate e-values for that reason
- scoring functions thus assess something else...
Conserved Domain Database
- CDD will also return a functional summary
Phylogenetic Trees Continued
MSA Continued
- an initial error can occur for the primary pairwise alignments
MSA - Features
- conserved regions -- hydrophobic transmembraneous
- cysteine (disulphide bridges) should be conserved
Mr. Bayes - Software
- will mask away a certain region riddled with gaps to compare available information
- that is, will treat sequences as though they have the same length in primary tree construction
- why am I interested?
- is this the "view" concept in my first subproject?
Parsimony vs Likelihood
- parsimony: fewest changes (shortest edit distance)
- maximum likelihood: parameterizes the evolutionary functions involved and provides lowest scoring path
- but does likelihood actually have a good foundation in reality?
- we have difficulty observing point mutations let alone bigger insertions and nucleotide repeat diseases ...
Evaluation of method
- consistency -- given different information (trials, genes, columns) within a single dataset
- robustness -- given a injected random mutations
- efficiency -- outcome (accuracy, precision, conciseness) ÷ price (time, computation, complexity)
Bootstrapping -- robustness checking
- can we keep finding the same branching order if we randomly permute a subset of information from the same dataset (columns are multiple trials)
- scales linearly with the number of sequences input (makes sense, reperforms the experiment that many times)
- the MSA -- sans columns with gaps -- the trials get shuffled
Human friendly displays
- cladogram -- phylogenetic branch lengths are not to scale -- pretty for a diagram
- phylogram -- phylogenetic branch lengths are to scale
Species vs gene, protein trees
- species -- one representation that is orthologous for all members of the tree -- is the evolution for the entire tree
- speciation -- reproductive isolation
- internal nodes represent species
- gene duplications, paralogues etc.,
- allowed to occur in the tree -- is likely to be different from the species tree
- gene families
- gene loss can change the drawn phylogenetic tree: missing taxon.
Additive vs. non-additive trees
- additive: preserves sum of distances between terminal elements
- additivity depends on dataset!
- phylogenetic trees should be additive
- non-additive: microarray data -- no particular assumption of common ancestor between clustered elements