Ed's Big Plans

Computing for Science and Awesome

More on Data Treatment

without comments

Data Ranking instead of Linear Range Normalization

At least one of the papers that I’ve read has done regression by ranking datapoints rather than by normalizing their range linearly. I don’t know if this will perform better, but it’s worth a shot. The normal rules for ranking should be applied. Earlier in my career, I bumped into a very useful ranking definition– in it, all datapoints are sorted on their range; points which share the same value in range all occupy the same rank. When finally a different value is found, it skips ahead by the number of points sharing that previous value. Example, consider the following dataset and its mapping to ranks based on the range element.

  • (Apples, 1.0) -> (Apples, #1)
  • (Pillow, 1.2) -> (Pillow, #2)
  • (Gamma, 1.7) -> (Gamma, #3)
  • (Leaves, 1.7) -> (Leaves, #3 [would have been #4])
  • (F, 1.7) -> (F, #3 [would have been #5])
  • (2.89, 1.7) -> (2.89, #3 [would have been #6])
  • (Jerry, 2.6) -> (Jerry, #7)
  • (Oswald, 2.8) -> (Oswald, #8)

The ranks themselves would of course need to be linearly normalized. Without actually trying this, it’s not possible to know if it would perform better or worse– but the fact that some papers use this approach to evaluate a machine learning system seems to indicate that there is the potential for favourable behaviour. For one, this kind of rank-based normalization smooths out bias in data, so that lumpy histograms of data distribution become one long plateau. The downside is that the machine may not draw out the importance of whatever salient features are hidden in the domain.

Thankfully, trying this is cheap– consisting of one additional python script to sort, rank and normalize the data– and the deploying of this experiment as a branch alongside the regression experiment proper.

Better Definition of Convergence

The present definition of convergence is not data- or problem-specific at all; a general RMSE (root mean squared error) formula is used– it essentially returns the euclidean distance between a set of vectors representing a “correct answer” against a set representing an “incorrect answer”; the hyperspace polygon that all of these vectors draws can be more correct if the shape is a closer approximation of the desired polygon– OR less correct if it deviates.

Why not try a more problem specific answer? For classification, literally– concordance (percentage correct) will be the new metric. For regression, the average ABSOLUTE euclidean distance for each vector from its intended target will be the new metric.

The new regression analysis is MORE specific (generates fewer false positives) because it takes an absolute error for each point in the hyperspace polygon– this is in contrast to some large arbitrary distance that corresponds to the average of ALL points with the assumption that vectors that are too positive may counteract vectors that are too negative.

In this way, vectors that are “more incorrect” are ALL counted against the system without regard to their directionality.

This feature will take a longer time to implement, but will also be available as a commandline switch (the only logical place to put it).

Eddie Ma

May 20th, 2009 at 4:00 pm

Posted in Machine Learning