Ed's Big Plans

Computing for Science and Awesome

More on Data Treatment

without comments

Data Ranking instead of Linear Range Normalization

At least one of the papers that I’ve read has done regression by ranking datapoints rather than by normalizing their range linearly. I don’t know if this will perform better, but it’s worth a shot. The normal rules for ranking should be applied. Earlier in my career, I bumped into a very useful ranking definition– in it, all datapoints are sorted on their range; points which share the same value in range all occupy the same rank. When finally a different value is found, it skips ahead by the number of points sharing that previous value. Example, consider the following dataset and its mapping to ranks based on the range element.

  • (Apples, 1.0) -> (Apples, #1)
  • (Pillow, 1.2) -> (Pillow, #2)
  • (Gamma, 1.7) -> (Gamma, #3)
  • (Leaves, 1.7) -> (Leaves, #3 [would have been #4])
  • (F, 1.7) -> (F, #3 [would have been #5])
  • (2.89, 1.7) -> (2.89, #3 [would have been #6])
  • (Jerry, 2.6) -> (Jerry, #7)
  • (Oswald, 2.8) -> (Oswald, #8)

The ranks themselves would of course need to be linearly normalized. Without actually trying this, it’s not possible to know if it would perform better or worse– but the fact that some papers use this approach to evaluate a machine learning system seems to indicate that there is the potential for favourable behaviour. For one, this kind of rank-based normalization smooths out bias in data, so that lumpy histograms of data distribution become one long plateau. The downside is that the machine may not draw out the importance of whatever salient features are hidden in the domain.

Thankfully, trying this is cheap– consisting of one additional python script to sort, rank and normalize the data– and the deploying of this experiment as a branch alongside the regression experiment proper.

Better Definition of Convergence

The present definition of convergence is not data- or problem-specific at all; a general RMSE (root mean squared error) formula is used– it essentially returns the euclidean distance between a set of vectors representing a “correct answer” against a set representing an “incorrect answer”; the hyperspace polygon that all of these vectors draws can be more correct if the shape is a closer approximation of the desired polygon– OR less correct if it deviates.

Why not try a more problem specific answer? For classification, literally– concordance (percentage correct) will be the new metric. For regression, the average ABSOLUTE euclidean distance for each vector from its intended target will be the new metric.

The new regression analysis is MORE specific (generates fewer false positives) because it takes an absolute error for each point in the hyperspace polygon– this is in contrast to some large arbitrary distance that corresponds to the average of ALL points with the assumption that vectors that are too positive may counteract vectors that are too negative.

In this way, vectors that are “more incorrect” are ALL counted against the system without regard to their directionality.

This feature will take a longer time to implement, but will also be available as a commandline switch (the only logical place to put it).

Eddie Ma

May 20th, 2009 at 4:00 pm

Posted in Machine Learning

NGN Software Updates

without comments

Two major items should be done to the source code toward the completion of the next objectives; they are a cleanup of one legacy argument, and the addition of new training behaviour.

Cleanup Random Number “Argument”

The present generated NGN binary does not treat the random number seed the same way as other arguments; it takes it in through stdin via the pipe operator at program launch. This should be cleaned up so that the user may use a commandline switch, then an integer to specify the random seed.

Training Dataset Balancing, Boosting and Verification Dataset

The cleanest way to enable balancing and boosting during training is by altering the binary executable rather than any wrapping script. Both of these options should philosophically be enabled only when either the diskonce or parseonce options are also enabled; this ensures that the data is already preprocessed, and can be referenced in program memory during its operation. Balancing requires an one additional integer argument one additional double argument so that the software understands 1: how many bins to balance against (as it is mathmatically impossible to balance in the set of reals) and 2: how much tolerance to give bins when true balancing is impossible. Balancing sees the deterministic selection of the first n-elements in each bin, until the bins cannot tolerate any more deviation. A training epoch occurs, and new n-elements are selected for each bin; this implies bins with fewer elements will see training more often.

For boosting, the software will tag each datapoint for its range of NGN activation against its target from greatest to smallest deviation; a command line parameter double value will indicate what proportion sorted from greatest to least will see repetition in training more often.

An algorithm based on the favour of retraining on balancing and boosting will be developed so that selection by favour of balancing is given a turn, then in favour of boosting interlaced. Every ten epochs or so, a complete train on the entire dataset in classic sequence is done to determine the overall RMSE of the system, it is only then that a convergence can be determined; that is– the intermittent ten epochs cannot converge (by definition of this algorithm).

Verification datasets can be implemented as a feature of the wrapping script or as a feature of the binary; it makes sense in this case to try and implement as a feature of the binary. From the prespecified number of bins determined in the balancing argument, one example will be selected out and isolated to be used for verification. This will allow a converged network to be internally tested prior to being used as a model for an actual external test set (which can only be determined by the wrapping script); this is useful as a means to “proofread” the model, an allow even converged cases to be rejected on suspiscion of problem over/under-fitting.

Eddie Ma

May 18th, 2009 at 1:03 pm

Meeting with Danielle

without comments

Met with Danielle Nash, coordinator at iGEM Waterloo today. There are three branches of projects this year, first the completion of last year’s project and the introduction of one new project in two parts.

Last year’s project consists of a delivery system, wherein one introduces bacteria that have been modified so that all genomic DNA is lost. The bacteria thus function as subcellular-sized vehicles that are broken down, so that some arbitrary payload is released into a patient. The focus of the team working on this subproject is its completion; elucidation and final characterization of the system behaviour.

This year’s project consists first of a “foundational advance” submission which formally defines a consistent means to create cassettes for the exchange of genetic material between some vector and a target bacterial chromosome. This involves the definition of a new mutant strain with the homologous recombination sites, suitable for the integrase used; a well defined and consistent cassette, which one would use to enclose the biobricks or other genes of interest; and a short integrase plasmid, likely with just the gene and a promoter of some arbitrary strength. The second is an extension to this project, which defines a different chasis (target organism); this time a plant, likely arabidopsis.

David Monje Johnston immediately comes to mind; he oversaw iGem Guelph last year, in his plant agriculture lab. His advice could likely benefit the team.

What am I doing?

Well, I’ve contacted Andre Masella who currently lives and reigns at Laurier. He’s the head of modeling this year for Waterloo and has summarized the objectives as the creation of some software that will anticipate the best sequence of sites needed to ensure the highest probability of consistent exchange between the cassette and the chasis chromosome.

We’ve all settled on the idea that I would be assisting Andre, with the likelihood of coaching an undergraduate in bioinformatics software design, deployment and utility all the while.

I’m very interested in seeing the background work on this, since I have little idea of what the problem constraints are. What makes a given sequence a good sequence (high in stability, high in predictability), what makes a given sequence bad (high in variance, low likelihood of working)? I’ve written back to ask about related papers he’s worked with, seen or written.

This should be a BLAST.

Eddie Ma

May 15th, 2009 at 9:14 am

Meeting with Chris Cameron

without comments

Chris is a student working on a summer project towards starting a Master’s degree soon. We’ve been meeting semi-regularly so that we can discuss his project ideas and so that I can offer assistance with whatever cheminformatics / bioinformatics software, datasets and any other technologies that have come up in his work.

Today’s meeting was kind of interesting– it involved revisiting the idea that the NGN can be integrated in a larger expert decision system, and also that the NGN can itself operate as a molecular kernel whose output can be then fed into some other decision machine. Dr. Kremer had a few ideas, but I didn’t quite grasp what was being conveyed– it can however be narrowed down to three specific designs.

  • The NGN output can be used as features in a larger molecular descriptor vector space.
  • A different NGN tuned toward a specific kind of problem is selected from a panel of NGNs trained at different problems; this selection is done by an expert (decision tree, or some other state machine).
  • The NGN can output an x-dimension array; for instance, a 3D grid of outputs is output so that on the three axes are labeled ‘species’, ‘LD50 class’ (lethal dose for 50% of the population), and ‘target organ or tissue’

As a side note, we decided it might be good to look at the traditional QSAR task with traditional descriptors and the good ol’ feed forward neural network. I’ll prepare the general neural network software for Chris that I used in Experimental Design; the first version I sent in had too much stuff in it, I’ll basically strip it down to “just operational” so he can actually understand the thing.