Andre/Mike Brain Fart todo
From SnOwy - Ed's Wiki Notebook
Log
A log of work completed on this project. Notice that Andre is working on the Kmer counting, Mike is doing the sequencing and sampling, and I've been charged with the G?SOM.
20100430 To Do
Attached is a runnable JAR and the Eclipse project source. If you want to edit the source, you'll need an additional library...but that's straight forward. Anyway, to run java -jar brainfart.jar $a $k $a = alphabet: nt for nucleotide or p for protein $k = length of kmer. -- --Andre Masella (andre@masella.no-ip.org, masella.no-ip.org, PGP:10ECBAE0) An engineer is someone who does list processing in FORTRAN. 2 attachments — Download all attachments source.tar.gz source.tar.gz 4K Download brainfart.jar brainfart.jar 17K Download
- files are currently in eddiema@Tin:active/BrainFart/andreSrc/
20100426 To Do
- need to clean up mental image of G?SOM operating on k-mer counts-- is each k-mer on one axis while each sample is on the other? -- solidify concept and report.
- clarified: G?SOMs are project high dimensional data onto 2D/3D feature spaces-- datapoints that are similar end up closer together
- will take exponential bite out of the neighbour joining algorithm
- axes are arbitrary on data projection -- i.e. algorithm decides for us
- obvious question: Does mothur already contain k-mer counting options?
- mothur will use kmer counting when the parameters "kmer" and "ksize" are given parameters--
- http://www.mothur.org/wiki/Cluster
- http://www.mothur.org/wiki/Read.dist
- http://www.mothur.org/wiki/Classify.seqs#search.3Dkmer_and_ksize
- this is part of the Align.seqs toolchain-- there is an option to use kmer counting, blastn or suffix trees...
- http://www.mothur.org/wiki/Align.seqs
- the next part is the expensive part: using an alignment algorithm.
- mothur: open source-- would it be good to contribute back G?SOM code to it (-- provided we develop our own code of course).
kmer and ksize in mothur
- example command line -- classify.seqs( ... search=kmer, ksize=8)
- notice that an alternative is to use blastn to find nearest neighbors-- this may have been the source of the efficiency problems to start.
Conclusion re: source of inefficiency
- likely due to distance matrix representation rather than distance matrix construction itself
- solvable with the G?SOM approach-- need to continue with mental clean up task.
Challenges and Recommendations
- Chan et al report that sequences of < 1kb tend not to have sufficient information to do the clustering
- could we then use the SOM clustering as a pre-alignment refining step that reduces the overall dimensionality of clusters submitted to Mothur?
- recall, we aren't interested in novel bioinformatics techniques so much as a way to levy the exponential explosion
- Chan et al recommend that analyses start with 3-mers and increase only as needed
- G?SOM does not offer an implicit way to draw cluster boundaries
20100426 Remarks
- currently reading on Mothur; doi:10.1128/AEM.01541-09 -- originating paper 2009
- self-hooks: phylogenetic construction with k-mers alone? Repeat boundary characterization?
- terminology: Operational Taxonomic Unit (OTU)-- a terminal node of a phylogenetic tree-- organism, species, genome, allele etc.,
- Mothur: three clustering algorithms available: furthest neighbor, nearest neighbor, average neighbor
- Mothur: performs the clustering based on a distance matrix-- phylip tree with branch weights
- Mothur: distance -- results from the Read.dist command...
- terminology: α and β diversity?
- α-diversity: species richness-- number of taxo per number of exemplars
- β-diversity: comparison of local species diversity amongst more than one groups or along a gradient
- managed to get in contact with the GSOM team mentioned earlier-- Chan et al will provide us with some code soon.
- terminology: Ribotype-- an OTU derived by the 16S rRNA sequence
- currently reading on G?SOM (Chan et al, 2008); doi:10.1155/2008/513701 -- binning metagenome
- terminology: Whole Genome Shotgun (WGS)
- Chisel: cluster based on enzyme phylogeny
- metaclust: nucleotide signatures
- TETRA: tetranucleotide z-scores
- PhyloPythia: SVM (supervised, trained with exemplars)
- See Abe et al.-- doi:10.1101/gr.634603 -- http://genome.cshlp.org/content/13/4/693.full.pdf (not in my personal library)
- terminology: Principle Component Analysis (PCA)
- method evaluation-- terminology: IOM and LOM -- intensity of mix (ratio of misclassified points between two clusters); level of mix (pairwise cluster taxonomatic ambiguity)
Training SOM phases
- initialization
- ordering
- tuning
Initialization
- topology decided-- rectangular or hexagonal - User; Chan et al used hexagons (was better able to accommodate shape of data)
- number of nodes decided (i.e. resolution) - User
- weight values per node (a la ANN) - PCA
- aspect ratio of map - PCA
- number of nodes given by the first two principle vectors
- aspect ratio given by ratio of magnitude of first two principle components
Ordering, Fine-Tuning
- input vectors are shown to the map
- map node that is most proximal to input vector "wins"
- "winner" and its neighbours are shifted to accommodate new datapoint
- update rule is...
w(t+1) = w(t) + α*h*[x(k) - w(t)]; let α be the learning rate let h be the neighbourhood kernel function let x at k be the kth input vector let x(k) and w(t) have the same dimensions -- equal to the number of components in the input vector
- do not be confused: α is often the momentum coefficient in other ANN architectures.
- what is the neighbourhood kernel function?
GSOM Modifications
- additional parameter set -- defines spread factor (SF) [0, 1]--
- 0 is coarse, 1 is fine; Chan et al report using 0.4 for di-, 0.6 for tri-, 0.8 for tetra-, 0.9 for penta- (experimentally derived)
- Initialization sees the instantiation of a single lattice-- this is a 4-mer for rectangular maps, a 6-mer for hexagonal maps.
- growth threshold given by...
GT = -D * ln(SF) let 'D' be the dimensionality of the data.
- error is recorded at each node, and is given by:
Ewinner(t + 1) = Ewinner(t) + ||x(k) - wwinner(t)||
- new nodes are added iff a winner is at the boundary of the map and Ewinner > GT
- if E > GT but winner not at edge, error distributed evenly and outward to neighbours instead
Questions for Andre
- How are gaps being treated in k-mer counting? Presumably there aren't gaps, but all gaps should be stripped in case we ever use this software elsewhere.
- Did you use a sliding window that increments by one character? This would be the appropriate design rather than chopping sequences into equal length substrings.
Andre (Poki) GTalk 0426.21:47 I will strip gaps. I strip whitespace, I'll make "-" == " " 0426.21:47 Sliding window.
20100425 Remarks
Competitive Learning incl. G?SOM
- http://homepages.feis.herts.ac.uk/~nngroup/software.php
- http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/DemoGNG/GNG.html
Mothur
20100423 To Do
- Research previous methods.
Briefings
Having discussed the project with Mike and Andre more extensively, it became a bit clear that we have the opportunity to introduce new clustering techniques to the taxonomy culture.
Taxonomy Culture
- a piece of software, MOTHUR is used often which includes a package of binaries
- in it, the binary that performs the clustering actually performs a multiple sequence alignment
- this multiple sequence alignment allows furthest-neighbour joining clustering which is resource intensive
- background, find two or three representative taxonomy papers that have used MOTHUR in the past
- notice cultural: MOTHUR includes a binary called DOTTUR (sp?)
What we will be offering
- we will offer a less resource intensive expression
- kmer counts can be used as a high precision approximation -- demonstration is below
Kmer counts, SOM
- we intend to use a SOM to cluster the kmer count barcode of each sequence
- we must demonstrate that this approach is not only valid, but is an excellent tool
- first, show that the distance between two members of a single cluster have low distance
- second, show that the distance between two members of different clusters have higher distance
- do this pairwise for a random sampling of sequences until we reach a statistically sound number
- must research the correct sample size to draw from a population to prove such a hypothesis with confidence as alpha < 5%
- additionally, show multiple sequence alignment scores for each of the pairwise mentioned
Other considerations - Naïve Bayesian classifier
- the Naïve Bayesian Classifier as seen here: doi:10.1128/AEM.00062-07 performs one action we will not
- that is a Markov Chain consideration when given a kmer barcode
- we are using however a different means of joint probability analysis-- the regression done by a SOM is arguably just that
- notice this paper refers to this barcoding as "words" and has optimized empirically a kmer size of 8 (having reported lengths 6 to 9)
- objective covered: rapid assignment of rRNA sequences
- sequence length:
- number of sequences: 5014 (Bergey's outline) and tested with 23095 (NCBI higher order taxonomies)
- the above points toward additional external validation
Additional history - GSOM Paper
- GSOM paper here: doi:10.1155/2008/513701 uses Growing Self-Organizing Map
- objective covered: genome-wide binning
- sequence length 10kb
- source: shotgun sequencing
- reported on kmer sizes of 2 to 5 (referred to as di-, tri-, tetra-, penta- -nucleotides)
- metagenomics