From SnOwy - Ed's Wiki Notebook
Hidden Markov Models (HMMs) and PSI-BLAST
- each represents sequence profiles
PSI-BLAST and sequence profiles
- more remote homologies
- proteins, genes that have more distant common origins but may be beyond a single BLAST search
- recognition of highly diverse gene families
- all homologues within a given species -- splice isoforms
- novel genes
- performed in five steps
- select query, search against protein database
- psi-blast constructs MSA, creates profile (PSSM)
- PSSM used as query
- psi-blast . . .
- redundant sequences are removed to reduce bias in query sequence
PSSMs
- captures position-specific patterns -- see weblogo -- describes conservation, preference at column
- highly conserved items are functionally related (structurally too?)
- those columns which have fewer tokens have higher bitscores -- we infer confidence with this representation
Calculating PSSMs
- technical issues -- building upon existing score matrices
- PSSM scores based on weighted BLOSUM matrix
- mu,a = Σb(fu,bsa,b)
- where u is a given position in the alignment
- possible to use log odds instead -- but we just use frequencies seen in the MSA -- allow background frequency
- mu,a = 1/λ log(qu,a / pa) and qu,a = αfu,a + (βpa) / (α + β)
PSSM Corruption
- occurs when a particular non-homologous protein is grouped into the profile
- the erroneous signal is then propagated down into the next iterations
- sequences that may do this -- hydrophobic regions? low complexity regions?
- prevention
- filtering of biased composition regions
- change e-values -- reduce from 0.005 (default) to 0.0005 etc.
- visually inspect output in each iteration -- remove suspicious hits (in PSI-BLAST -- uncheck the box)
HMM
- the mathematical background is a bit different from PSSMs
- Markov assumption / Markov chain
- deals with gaps better than PSSMs
- requires a MSA as input
- example --
GTWYA
GLWYA
GRWYE
GTWYE
GEWFS
- recall: log odds score => sum of ln(probabilities)
- we also include a background probability to allow for low identity sequences
Course Project
- project proposals due this week
- example projects from two years ago ...
- effectiveness of alignment algorithms on short highly variable 16S gene sequences
- sequence, structural analyses of major outer membrane protein OmpF (Pseudomonas putida UW4)
- in silico analysis of uncharacterized arabidopsis thaliana protein At1g22850
- searching for glucocorticoid-induced matrix metalloproteinase overexpression in stressed adult zebrafish (microarray data)
- in phylogeny
- phylogenetic reconstruction of Pontoporidea using Bayesian techniques
- analysis of conserved domains of CHK1 for use in cloning CHK1 in Oncorhynchus mykiss
- evolutionary diversification of 5′-Methylthioadenosine nucleosidase (MTN), Methylthioadenosine (MTA) generating enzymes
- annotation of arabidopsis thaliana loci AT1G05620, AT2G36310
- deliverables
- seminar
- textual project report
Brendan Suggests
- position specific correlation given a structure and a PSSM
- promoter sequence *something* with HMM