Archive for May, 2009
Balancing and Boosting a Dataset
Two things that I had wanted to try but didn’t get around to while working on my thesis are balancing the training data and boosting the data. I’ll explain each below, and likely commit the text to a wiki entry later.
Balancing a training set consists of ensuring an even distribution in the range of the training set as much as possible. Notice that while it’s important to have an evenly distributed range, where possible an attempt should be made to even out the domain as well to ensure that an inductive learning machine has a broad enough collection of exemplars to work with. Where this isn’t possible, a stochastic treatment is better than no treatment (i.e. pick out several combinations randomly).
Boosting a dataset refers to a change in training algorithm. Under normal circumstances, a neural network is trained with a static sequence of exemplars every single epoch; a boosted algorithm sees a dynamic treatment instead. In this treatment, the amount exposure of exemplars to the inference machine in training is inversely proportional to the accuracy of a prediction made for those exemplars; an exemplar on which a neural network performs poorly is shown to the network more often.
Balancing abstractly reduces the probability that a neural network is working by chance distribution of range elements; in an extreme case, one could argue that an extension can be done to test the robustness of a method by reversing the balance that naturally occurs in the training set.
Boosting by contrast increases system bias. Care must be taken in each epoch so that the final algorithm used does not overwhelm the system with data points that are known to be flaky (unreliable).
Ed's Big Plans