Ed's Big Plans

Computing for Science and Awesome

NGN Software Updates

without comments

Two major items should be done to the source code toward the completion of the next objectives; they are a cleanup of one legacy argument, and the addition of new training behaviour.

Cleanup Random Number “Argument”

The present generated NGN binary does not treat the random number seed the same way as other arguments; it takes it in through stdin via the pipe operator at program launch. This should be cleaned up so that the user may use a commandline switch, then an integer to specify the random seed.

Training Dataset Balancing, Boosting and Verification Dataset

The cleanest way to enable balancing and boosting during training is by altering the binary executable rather than any wrapping script. Both of these options should philosophically be enabled only when either the diskonce or parseonce options are also enabled; this ensures that the data is already preprocessed, and can be referenced in program memory during its operation. Balancing requires an one additional integer argument one additional double argument so that the software understands 1: how many bins to balance against (as it is mathmatically impossible to balance in the set of reals) and 2: how much tolerance to give bins when true balancing is impossible. Balancing sees the deterministic selection of the first n-elements in each bin, until the bins cannot tolerate any more deviation. A training epoch occurs, and new n-elements are selected for each bin; this implies bins with fewer elements will see training more often.

For boosting, the software will tag each datapoint for its range of NGN activation against its target from greatest to smallest deviation; a command line parameter double value will indicate what proportion sorted from greatest to least will see repetition in training more often.

An algorithm based on the favour of retraining on balancing and boosting will be developed so that selection by favour of balancing is given a turn, then in favour of boosting interlaced. Every ten epochs or so, a complete train on the entire dataset in classic sequence is done to determine the overall RMSE of the system, it is only then that a convergence can be determined; that is– the intermittent ten epochs cannot converge (by definition of this algorithm).

Verification datasets can be implemented as a feature of the wrapping script or as a feature of the binary; it makes sense in this case to try and implement as a feature of the binary. From the prespecified number of bins determined in the balancing argument, one example will be selected out and isolated to be used for verification. This will allow a converged network to be internally tested prior to being used as a model for an actual external test set (which can only be determined by the wrapping script); this is useful as a means to “proofread” the model, an allow even converged cases to be rejected on suspiscion of problem over/under-fitting.

Eddie Ma

May 18th, 2009 at 1:03 pm