Notes 20100914.143119 BIOL 614 McConkey Bioinformatics Course
From SnOwy - Ed's Wiki Notebook
Introduction
- Students' knowledge assumed to be: knows how to use BLAST and understand its algorithms; may know tBLAST (translated blast); not likely to know about BLAST's e-values.
- This is a practical course -- theory is learned up to the point where we understand why we would select a particular tool
- this course will be good to fill in the knowledge gaps
Topics
- Sequence alignment and BLAST statistics
- MSA -- 'a little more information'
- Clustering, phylogeny
- Clustering: describes taxa that are similar
- Phylogeny: see Müller's course for more.
- protein structural modeling
- binding sites, predictions etc.,
- functional annotation: genes, proteins
- data analysis in genomics, proteomics
- microbial genome sequence assembly
Topics from the floor
- next gen sequencing
- CNVs
- Perhaps: machine learning methods
Statistics
- various numbers to detect errors, inaccuracies
- multiple hypothesis testing etc.,
Evaluation
Independent Project
- projects should be related to research
- impetus of course is to use discussed methods with respect to own academic interests
Seminar Component
- discuss the project selected
- includes also tool theory
- roughly 30 minute presentation
New in this version
- Lecture component
- Lab component
- C2-160: Friesen Computer Lab - Thursdays 2:30pm to 4:30pm.
- designed as a set of example cases
- gives practical understanding of what to expect in terms of returned data from tools
Notes
Scoring Matrices and BLAST Scoring
- Read "What is dynamic programming" - Sean R Eddy
- Read "... BLOSUM62 ..." - Sean R Eddy
Pairwise alignment
- try to pick out the best scoring alignments
- depends on objective function
- should also estimate that the alignment occurred just by chance (E-values)
- scoring functions depend on amino acid or nucleotide mutation
- Expectation value: the probability of an alignment occurring by chance alone
- low-complexity regions
- sequences that contain very short repeated sequences
- low complexity filters: removes tendency to align such regions
- example: leucine rich regions that form alpha helices.
- align DNA if no protein coding region available
- multiple domains
- entire domain insertions or deletions
- Score -- raw scores, bit scores
- Raw score is the raw alignment score
- Bit score is derived from the algorithm used
- Both scores can be used to derive some e-value
Deriving Scoring Matrices
- let qa,b be probability that an alignment of amino acids a, b occur in homologous sequence pair
- let pa, pb be overall frequencies of occurrence of a, b
- probability of random alignment is the product of the aa frequencies papb
- log odds ratio for a, b is ...
- Odds ratio = qa,b / papb
- Odds ratio === p(alignment is authentic) / p(alignment is random)
- we assume that each column is an independent event...
- total odds ratio = Πu(qa,b / papb)u
- logarithm allows us to sum instead of multiply -- computationally and numerically more sane.
- S = Σulog(qa,b / papb)u = Σu(Sa,b)u
BLOSUM
- based on local alignments
- developed after PAM matrices -- more data
- derived from BLOCKS database
- BLOCKS Substitution Matrices = "BLOSUM" -- Henikoff and Henikoff 1991, 1992
- differences with PAM matrices:
- About PAM matrices
- PAM matrices attempt to fit an evolutionary model -- i.e. has the property that multiple applications yields a more evolutionarily distant pair of sequences
- Used to identify conserved regions.
- Open question?! Transitive closure?!
- Why BLOSUM62? -- seemed to work the best.
BLAST
- different BLAST programs specialize for different kinds of tasks.
PAM
- PAM1 → highly similar
- PAM250 → highly divergent