Archive for the ‘Neural Grammar Network’ tag
Well, the convergence tests are well underway running on SharcNet (whale serial cluster) for the new datasets (2 Aqueous Solubility, 1 Melting Point)– I’m currently only running them for SMILES due to a problem with one of the Aqueous Solubility sets. I’ve decided to use a “ridiculously high” number of parameters (12 hidden units per hidden layer), and a “ridiculously low” learning rate / momentum (0.3 each).
A convergence test is basically used to see if the data and parameters chosen can be operated on by the NGN. For me, the test set and training sets are identical in convergence testing.
I’ll probably rerun this with yet lower learning (train = 0.15 / momentum = 0.45). Why? Because I have a feeling that the number of hidden units actually enables a greater chance of accidentally falling into the close search space, while the minute math arguments allow these accidents to be gently brushed into being.
I’m presently running all the regression data I’ve ever run in past– this time in both classic normalization and rank-normalization. Finally, one of the Aqueous Solubility datasets came with poorly formed SMILES– they can’t be parsed by the grammars that I made up, and they can’t be parsed by OpenBabel so conversion to InChI wasn’t possible in this round, let alone operation by my system. I’m going to presume they’re broken and skip those entries which still leaves well over a thousand exemplars in that set — this set will have to be rerun with the correct data soon.
Convergence rmse has been set to 1%, meaning that I have a stricter idea of what I think is a deviation from correct. I want to plot the results as a “target on actual” residuals line plot when this is done to give me an idea of how these parameters worked on self-comparison.
Of course, everything here is subject to editing, as I figure exactly how to implement and deploy the previously mentioned changes to how regression experiments are defined (for convergence, for boosting/balancing/verification).
Most immediate next steps:
- Select OK Aqueous SMILES and run.
- Rerun all with InChI-NGN
Options options. One of the problems I’m encountering now is that I’ve hit the edge of knowledge that both me and Stefan are capable of accomodating. When it comes to cheminformatics, having me probe the edges of this space has me realizing I’d probably need some help. If performance is good enough in the next experiments, then I think we would both benefit from snagging in a third person involved with chemistry to help author the paper. Basically, we’ve got experience writing for the machine learning and computer science crowd, but I think a potential paper is most viable in the chemistry and biomedical crowd– Just look at the number of ACS publications with the phrase “QSAR” in the tagline. Someone that’s written chemistry papers in the past and can figure out what looks best in that culture would improve the odds of an accepted paper… This of course opens the door to gearing a more general publication for the comp sci crowd once a different problem area has been formalized.
Just a thought.
I mentioned in a very hand-wavy fashion that the NGN can also be used in image recognition. I think after the regression stuff is done I should really brain storm with Stefan about what other applications would be quick to implement with good results.
Finally done sorting the data for the next experiment in regression. Checking for NGN compatibility next, then moving onto trials. These sets relate to aqueous solubility (2) and melting point (1).
Two major items should be done to the source code toward the completion of the next objectives; they are a cleanup of one legacy argument, and the addition of new training behaviour.
Cleanup Random Number “Argument”
The present generated NGN binary does not treat the random number seed the same way as other arguments; it takes it in through stdin via the pipe operator at program launch. This should be cleaned up so that the user may use a commandline switch, then an integer to specify the random seed.
Training Dataset Balancing, Boosting and Verification Dataset
The cleanest way to enable balancing and boosting during training is by altering the binary executable rather than any wrapping script. Both of these options should philosophically be enabled only when either the diskonce or parseonce options are also enabled; this ensures that the data is already preprocessed, and can be referenced in program memory during its operation. Balancing requires an one additional integer argument one additional double argument so that the software understands 1: how many bins to balance against (as it is mathmatically impossible to balance in the set of reals) and 2: how much tolerance to give bins when true balancing is impossible. Balancing sees the deterministic selection of the first n-elements in each bin, until the bins cannot tolerate any more deviation. A training epoch occurs, and new n-elements are selected for each bin; this implies bins with fewer elements will see training more often.
For boosting, the software will tag each datapoint for its range of NGN activation against its target from greatest to smallest deviation; a command line parameter double value will indicate what proportion sorted from greatest to least will see repetition in training more often.
An algorithm based on the favour of retraining on balancing and boosting will be developed so that selection by favour of balancing is given a turn, then in favour of boosting interlaced. Every ten epochs or so, a complete train on the entire dataset in classic sequence is done to determine the overall RMSE of the system, it is only then that a convergence can be determined; that is– the intermittent ten epochs cannot converge (by definition of this algorithm).
Verification datasets can be implemented as a feature of the wrapping script or as a feature of the binary; it makes sense in this case to try and implement as a feature of the binary. From the prespecified number of bins determined in the balancing argument, one example will be selected out and isolated to be used for verification. This will allow a converged network to be internally tested prior to being used as a model for an actual external test set (which can only be determined by the wrapping script); this is useful as a means to “proofread” the model, an allow even converged cases to be rejected on suspiscion of problem over/under-fitting.
Chris is a student working on a summer project towards starting a Master’s degree soon. We’ve been meeting semi-regularly so that we can discuss his project ideas and so that I can offer assistance with whatever cheminformatics / bioinformatics software, datasets and any other technologies that have come up in his work.
Today’s meeting was kind of interesting– it involved revisiting the idea that the NGN can be integrated in a larger expert decision system, and also that the NGN can itself operate as a molecular kernel whose output can be then fed into some other decision machine. Dr. Kremer had a few ideas, but I didn’t quite grasp what was being conveyed– it can however be narrowed down to three specific designs.
- The NGN output can be used as features in a larger molecular descriptor vector space.
- A different NGN tuned toward a specific kind of problem is selected from a panel of NGNs trained at different problems; this selection is done by an expert (decision tree, or some other state machine).
- The NGN can output an x-dimension array; for instance, a 3D grid of outputs is output so that on the three axes are labeled ‘species’, ‘LD50 class’ (lethal dose for 50% of the population), and ‘target organ or tissue’
As a side note, we decided it might be good to look at the traditional QSAR task with traditional descriptors and the good ol’ feed forward neural network. I’ll prepare the general neural network software for Chris that I used in Experimental Design; the first version I sent in had too much stuff in it, I’ll basically strip it down to “just operational” so he can actually understand the thing.