Archive for the ‘Thesis Document’ tag
Paper Progress
Brief: Not a lot to report, papers are being edited and written– I’ve just finished the QSAR briefing, NGN design and experimental design sections of the IEEE paper… Thesis corrections are due Monday…
Meeting with Stefan
Brief: Met with Stefan today. The conference paper we’re working on has a page limit of eight pages, not six– but I’ve just been told this must include everything including the references and of course figures. With respect to the thesis, the request for examination paperwork has just been completed, and the abstract for the thesis document has been submitted. A few corrections are to be completed on the thesis document. Present work focuses on said corrections, the conference paper and software completion for Chris’ project. There is presently no benefit to working on anything more in parallel.
Thesis Paper, Approved Draft
Brief: Draft accepted with minor revision notices by Dr. Stacey. Approval for defense documents to be finished soon along with revisions implemented.
More on Data Treatment
Data Ranking instead of Linear Range Normalization
At least one of the papers that I’ve read has done regression by ranking datapoints rather than by normalizing their range linearly. I don’t know if this will perform better, but it’s worth a shot. The normal rules for ranking should be applied. Earlier in my career, I bumped into a very useful ranking definition– in it, all datapoints are sorted on their range; points which share the same value in range all occupy the same rank. When finally a different value is found, it skips ahead by the number of points sharing that previous value. Example, consider the following dataset and its mapping to ranks based on the range element.
- (Apples, 1.0) -> (Apples, #1)
- (Pillow, 1.2) -> (Pillow, #2)
- (Gamma, 1.7) -> (Gamma, #3)
- (Leaves, 1.7) -> (Leaves, #3 [would have been #4])
- (F, 1.7) -> (F, #3 [would have been #5])
- (2.89, 1.7) -> (2.89, #3 [would have been #6])
- (Jerry, 2.6) -> (Jerry, #7)
- (Oswald, 2.8) -> (Oswald, #8)
The ranks themselves would of course need to be linearly normalized. Without actually trying this, it’s not possible to know if it would perform better or worse– but the fact that some papers use this approach to evaluate a machine learning system seems to indicate that there is the potential for favourable behaviour. For one, this kind of rank-based normalization smooths out bias in data, so that lumpy histograms of data distribution become one long plateau. The downside is that the machine may not draw out the importance of whatever salient features are hidden in the domain.
Thankfully, trying this is cheap– consisting of one additional python script to sort, rank and normalize the data– and the deploying of this experiment as a branch alongside the regression experiment proper.
Better Definition of Convergence
The present definition of convergence is not data- or problem-specific at all; a general RMSE (root mean squared error) formula is used– it essentially returns the euclidean distance between a set of vectors representing a “correct answer” against a set representing an “incorrect answer”; the hyperspace polygon that all of these vectors draws can be more correct if the shape is a closer approximation of the desired polygon– OR less correct if it deviates.
Why not try a more problem specific answer? For classification, literally– concordance (percentage correct) will be the new metric. For regression, the average ABSOLUTE euclidean distance for each vector from its intended target will be the new metric.
The new regression analysis is MORE specific (generates fewer false positives) because it takes an absolute error for each point in the hyperspace polygon– this is in contrast to some large arbitrary distance that corresponds to the average of ALL points with the assumption that vectors that are too positive may counteract vectors that are too negative.
In this way, vectors that are “more incorrect” are ALL counted against the system without regard to their directionality.
This feature will take a longer time to implement, but will also be available as a commandline switch (the only logical place to put it).
Master’s Thesis RC-2 Submitted
An excerpt from an e-mail I sent to Dr. Kremer.
The second release of my thesis has just been committed to the repository– It’s a bit bigger at roughly 110 pages. Note that the front page does not conform to the style guide yet, but that change does not involve any content change so I’m doing that after.
And something from a previous e-mail…
If time permits, want to add diagrams in Ch3 for Walsh (molecule graph topology network), Mohr (molecule kernel)– Strong desire to add discussion on other molecule kernel techniques (P-SVM compat.), probably only have time to add a few sentences.
As it turns out I never did have a chance to add those other graph kernel references to the thesis.
Ed's Big Plans