Notes 20101109 BIOL 614 McConkey Microarray Data Z-Scores
From SnOwy - Ed's Wiki Notebook
Contents |
Microarray Data
Inferential and Descriptive Statistics
- FDR (False Discovery Rate) correction
- with a given confidence interval, we can infer the number of incorrect false positives
- inferential statistics
- t-tests, comparisons -- infer about the population given your sample
- some hypothesis with respect to the null hypothesis
- genes that are differentially expressed
- traditional stats: use some level of α for confidence
- mean between two groups (t-test)
- comparing slides
Inferential Statistics
- if we're only interested in a single gene, then p < 0.05 is reasonable
- if we have 10,000 genes : 5% → 500 gene expressions are incorrect!
- we should know our experimental error
- see Bonferroni correction (may be too harsh for microarray data)
- Bonferroni correction means we set p < 5 × 10-6 lt; something far lower --
- FDR = E(count(false positives) ÷ count(total predictions))
- p < 0.05 is well accepted and is thought to be the highest amount of error allowed
- in FDR, we will allow E() = 10% -- considered OK
- family-wise error rate (FWER)
- control probability of making any false positive calls given desired significance
- FDR
- control proportion of false positive calls in a total number of predictions
FDR
- linear step up procedure -- see BH procedure
- order p-values P(1) ≤ P(2) ≤ P(3) ≤ ... ≤ P(m)
- let k = imax
- for {p(i) ≤ (i ÷ m)q}
- where q is the target FDR (default 0.10)
- we then find the maximum before the threshold -- all p-values below that are considered valid
- reject the null hypotheses H(1) ≤ H(2) ≤ H(3) ≤ ... ≤ H(i)
- up until the (i) that is the threshold
Example brain and behaviour study
- mouse brain gene expression correlation with behaviour
- 17 behavioural endpoints
- 2.7 × 104 genes
- behaviours are ordered by p-value
- Bonferroni threshold -- 0.0029 is at p-value 0.05
- FDR threshold -- foreach rank, it is 0.0029(i) for i = {1 to n}
Software
- PaGE -- Pattern of Gene Expression
- Owen: not known
- SAM -- Significance Analysis of Microarrays
- Owen: works well, permutation tests
- LPE -- Local Pooled-Error model test
- Owen: messy, interesting ideas, doesn't always work
- Bioconductor -- Open source toolkit
- Owen: powerful, R, user-unfriendly, developer-friendly
Descriptive Statistics
- finding meaningful patterns in the data
- arrange data in the matrix
- distance metric defines relatedness between points
- Pearson correlation coefficient
- Euclidean distance
- hierarchical clustering -- grouping things in a pairwise fashion
- looking for a logical separation within the microarray data
- agglomerative and divisive clustering ...
- UPGMA
- not as good for phylogenetic trees (assumptions made)
- very useful for microarray
- uses distance measures that are treated as one by their average value (agglomerative)
Clustering
- k-means
- microarray branch lengths are not additive