Notes 20110214 Base Calling (Phred)
From SnOwy - Ed's Wiki Notebook
Contents |
Base Calling
- automated reading of fluorescent nucleotides
- original software designed is called Phred
Early Phred Paper -- Error Estimation, Confident Base Assignment
doi:10.1101/gr.8.3.186 - 1998
- high speed sequencing limited by data processing
- phred estimates probability of error
- works with phrap (assembly) and consed (finishing program)
- estimate reliability of reads
- position-specific error probabilities
- good assembly software →
- accuracy
- completeness
- sequential repeats
- consensus sequences
- finishing: missing data? aditional edits?
- previous work exists for base-calling
- phred -- novel properties
- deriving confidence without a bunch of assumptions
- assignment of confidence given a trace window (not just a single peak)
- maximize discrimination only in confident regions of trace (not equally over entire trace)
- log-transformed error probability
- q = -10log10(p) -- p is estimated probability of error
- error probabilities
- must be derivable without knowing actual sequence
- must be valid -- based on observed error
- instead of treating each base position with the same probability of error
- use several classes of error probabilities
- leads to more accurate results (varies with lower quality reads)
- quality value distribution
- more accurate reads are more important →
- consensus reads
- error probability (at rate) = expected error (at rate) / number of base callsd
- error probability (at rate) -- provides confidence in estimation of error at a rate class
- study assigning r ≤0.01 (low) derived from expert knowledge (imitates human observer)
- encoded this decision making into algorithm with these parameters ...
- continue at Trace Parameters ...
- Peak spacing
- Uncalled:Called amplitude ratio at one position
- Uncalled:Called amplitude ratio at three adjacent positions
- Peak resolution -- related to distance between most recent unresolved base and current base undergoing resolution
General Notes
- base calling requires
- guessing at a given base
- providing an estimation of error
- additional parameters hinted by Dr. Kremer
- knowing when to provide no answer rather than a wrong answer
Support Vector Machine (SVM) on Next Generation Sequencing (NGS) Data
ISBN = {978-1-4244-8302-0} - 2010
- previous work ...
- sequence correction for SAGE data
- SAGE?
- previous software unable to handle the larger amounts of data
- genome read correction
- assume coverage large, thus rare sequences are incorrect
- not as useful for transcriptome, metagenome sequencing
- correct sequences often rare
- sequence correction for SAGE data
- require sequence correction based on entire ensembles of reads
- previous work ...
- FreClu -- iteratively checks errors in clusters of reads
- RECOUNT -- sequencer error modelled as multinomial thinning process
- Expectation-Maximization (EM) framework
- infers true counts → maximize likelihood of observations
- this work ...
- propose method to classify true and false reads
- binary classification problem
- function learned -- feature sets mapped to {0, 1}
- input features ...
- observed count
- estimated true count (← RECOUNT)
- log likelihood (+ entropy penalty)
- log likelihood ratio
- expectation matching score
- self-correctness coefficient
- SVM classifier on 〈 6 〉 yields 96% accuracy.
- continue at II. Data Simulator ...
Statistical Learner for Illumina
- doi:10.1186/gb-2009-10-8-r83 - 2009
- start at Statistical learner for Illumina base calling ...
- base caller design
- should mitigate cycle-dependent problems
- could model sequence chemistry
- T accumulation
- intensity fading
- first and last cycle quirks of sequencing
- problems:
- unlikely to know all the reasons for errors
- over complication → data overfitting (bad generalization)
- solution:
- estimate chemistry from data
- this approach does not use intensity correction before learning
- allows statistical learner to work with raw values (for better or for worse)
- trade dependence on intensity correcting model for cognitive overhead
- continue at Statistical learner for Illumina base calling ...
Need for new classifier
- need to eliminate the human being at the end
- increase the throughput
- removes subjectivity
- the person at the top looks at the predictions and the graphs (peaks)
- overrides the software in some cases
- we require examples of the human making the overriding
- put the pdf for the file format (ab1 = applied biosystems one)
- ab1 currently gives us ...
- raw signals
- the estimation of the sequence
- probability assigned
- understand the transformation between the raw and cooked signal