Ed's Big Plans

Computing for Science and Awesome

  • Page 2 of 2
  • <
  • 1
  • 2

Archive for the ‘Neural Networks’ tag

NGN Software Updates

without comments

Two major items should be done to the source code toward the completion of the next objectives; they are a cleanup of one legacy argument, and the addition of new training behaviour.

Cleanup Random Number “Argument”

The present generated NGN binary does not treat the random number seed the same way as other arguments; it takes it in through stdin via the pipe operator at program launch. This should be cleaned up so that the user may use a commandline switch, then an integer to specify the random seed.

Training Dataset Balancing, Boosting and Verification Dataset

The cleanest way to enable balancing and boosting during training is by altering the binary executable rather than any wrapping script. Both of these options should philosophically be enabled only when either the diskonce or parseonce options are also enabled; this ensures that the data is already preprocessed, and can be referenced in program memory during its operation. Balancing requires an one additional integer argument one additional double argument so that the software understands 1: how many bins to balance against (as it is mathmatically impossible to balance in the set of reals) and 2: how much tolerance to give bins when true balancing is impossible. Balancing sees the deterministic selection of the first n-elements in each bin, until the bins cannot tolerate any more deviation. A training epoch occurs, and new n-elements are selected for each bin; this implies bins with fewer elements will see training more often.

For boosting, the software will tag each datapoint for its range of NGN activation against its target from greatest to smallest deviation; a command line parameter double value will indicate what proportion sorted from greatest to least will see repetition in training more often.

An algorithm based on the favour of retraining on balancing and boosting will be developed so that selection by favour of balancing is given a turn, then in favour of boosting interlaced. Every ten epochs or so, a complete train on the entire dataset in classic sequence is done to determine the overall RMSE of the system, it is only then that a convergence can be determined; that is– the intermittent ten epochs cannot converge (by definition of this algorithm).

Verification datasets can be implemented as a feature of the wrapping script or as a feature of the binary; it makes sense in this case to try and implement as a feature of the binary. From the prespecified number of bins determined in the balancing argument, one example will be selected out and isolated to be used for verification. This will allow a converged network to be internally tested prior to being used as a model for an actual external test set (which can only be determined by the wrapping script); this is useful as a means to “proofread” the model, an allow even converged cases to be rejected on suspiscion of problem over/under-fitting.

Written by Eddie Ma

May 18th, 2009 at 1:03 pm

Windows Basic NN Port

without comments

The Windows port of that ancient basic neural network library has just been cleaned up. A copy was sent to Chris as he’s needed a generic neural net implementation to analyze and interpret for some time. The clean up involved stripping away all of the custom exception handling I placed in, which realistically doesn’t need to be there. The software now softly crashes instead of spouting out a bunch of warnings that no one would understand (and by now, that includes me as I’ve forgotten many of the conventions I put in).

I suppose the prudent thing to do is to slap in a readme.txt, and post it in the wiki for public access.

On to better, more impending things…

Written by Eddie Ma

May 16th, 2009 at 2:48 pm

Posted in Machine Learning

Tagged with

Meeting with Chris Cameron

without comments

Chris is a student working on a summer project towards starting a Master’s degree soon. We’ve been meeting semi-regularly so that we can discuss his project ideas and so that I can offer assistance with whatever cheminformatics / bioinformatics software, datasets and any other technologies that have come up in his work.

Today’s meeting was kind of interesting– it involved revisiting the idea that the NGN can be integrated in a larger expert decision system, and also that the NGN can itself operate as a molecular kernel whose output can be then fed into some other decision machine. Dr. Kremer had a few ideas, but I didn’t quite grasp what was being conveyed– it can however be narrowed down to three specific designs.

  • The NGN output can be used as features in a larger molecular descriptor vector space.
  • A different NGN tuned toward a specific kind of problem is selected from a panel of NGNs trained at different problems; this selection is done by an expert (decision tree, or some other state machine).
  • The NGN can output an x-dimension array; for instance, a 3D grid of outputs is output so that on the three axes are labeled ‘species’, ‘LD50 class’ (lethal dose for 50% of the population), and ‘target organ or tissue’

As a side note, we decided it might be good to look at the traditional QSAR task with traditional descriptors and the good ol’ feed forward neural network. I’ll prepare the general neural network software for Chris that I used in Experimental Design; the first version I sent in had too much stuff in it, I’ll basically strip it down to “just operational” so he can actually understand the thing.

Next Experiments

without comments

New Architecture

In a steep recursive architecture, there is a likelihood that LSTMs can assist—however, this hypothesis in my mind is unstable because I’m torn against two opposing rationales: First, that steep trees cause signal decay, and second that a steady razor has culled overly complicated designs since the beginning of human intelligence. Neatly re-expressed, the rationale conflict is whether or not LSTM cells in the NGN would survive this culling.

New Data and Retry Crashed Data

Regression datasets and retry classification tasks which lead to suspected stack smashing; while I was running FXa and DHFR data, the software would crash unexpectedly while all memory allocations seemed correct (i.e. crash report was not consistent with a segmentation fault). I think that the string buffer used to read the chemical string was too short– interestingly, this is the only non-dynamic memory element that should technically be variable size. Increasing it to 4096 bytes should be enough and should not negatively impact performance at all. For a system running 100 trials concurrently, that is an increase of 400MB of memory in total, this is by today’s standards “trivial”.

Input String Manipulation

One of the items that I didn’t consider is input string manipulation. Machine learning system performance can necessarily be inferred from the task and architecture suitability; however, input normalization and representation is also very important.

Below are some possible manipulations I can use for the NGN.

Different deterministic orders of precedence in the rules of SMILES generation can lead to a variety of different strings for the same molecule; as long as a data set is expressed in the same order of precedence, the NGN is useful.

Below is a demonstration using RDKit in Python of how to manipulate the traversal order originally posted at Blueobelisk.

from rdkit import Chem

m = Chem.MolFromSmiles('Cc1nc(N)ccc1')

for i in range(m.GetNumAtoms()):

print Chem.MolToSmiles(m, rootedAtAtom=i)

Cc1nc(N)ccc1
c1(C)nc(N)ccc1
n1c(C)cccc1N
c1(N)nc(C)ccc1
Nc1nc(C)ccc1
c1ccc(C)nc1N
c1cc(C)nc(N)c1
c1ccc(N)nc1C

Directive: Investigate the different traversal algorithms.

SMILES Isomerism and InChI layer selection…

Although there is rationale to utilize isomeric SMILES strings, a concrete demonstration that this is certainly required has never been demonstrated; what would happen to the performance of the NGN if isomerism were discarded from the training data? InChI layers are indiscriminately accepted currently, but there is no rational for that save for that one suspects hidden relationships to evolve between and require multiple segment. This is addressed along with the idea of hybridizing InChI and SMILES strings together.

When embedding objects into InChI as a layer, use a new ‘’ switch! It’s pretty extensible, almost like XML tags so that closing tags are implied by opening a new tag (hence, not nesting structure).

  • SMILES layers will use ‘SMILES’ as a switch
  • Isomeric SMILES layers will use ‘ISOSMILES’ as a switch
  • Permutations of SMILES will use ‘SMILES(smiles)’, such that ‘smiles’ is replaced by a natural number.

Directive: Use sub-traversals and hybrid traversals as follows. Test the hypothesis that each of these layers is in fact needed; test the hypothesis that no additional information is needed for either InChIs or SMILEs to operate.

A list of Possible Simple, Hybrid or Concatenated Strings

  • {SMILES} – input on its own.
  • {InChI … + ISOSMILES} – are each two descriptors on their own.
  • {InChI … + SMILES} – are each two descriptors on their own.
  • {SMILES(0) + SMILES(1) … SMILES(n-1)} – neural network where each descriptor is an NGN.
  • ISOSMILES, SMILES, SMILES(0), SMILES(1) … SMILES(n-1) – each permutation on its own.
  • {ISOSMILES, SMILES, SMILES(0 … n), InChI, (InChI [Formula and Connection Table Only])}
  • InChI, (InChI [Formula and Connection Table Only]) – InChI with layer non-molecular structure layers removed.
  • InChI, (InChI [Formula and Connection Table Removed]) – InChI with only non-molecular structure layers.
  • {(InChI [Formula and Connection Table Removed]) / SMILES} – InChI where formula and connection table are removed and replaced with SMILES; this is actually SMILES with some redundant information; ensure that layers that refer to the connection table like the hydrogen layer are also removed as they would make no sense.
  • {InChI / ISOSMILES}
  • {InChI / SMILES}
  • {InChI / SMILES(0 / … / n)}

Further more imaginative items exist, but they further convolute and contaminate the hypothesis space of input string exploration, requiring modification of the architecture; for instance, what if each traversal had its own dedicated layer? This leads to a linear growth in the number of weight layers, where the coefficient is the number of such traversals. This would play into the expert panel system more.

Layer separation, staying true to the grammar design—we would use ‘!’ to separate each subvector not designated as an InChI layer; this ‘!’ character is not used by the previous SMILES or InChI strings, but may lead to a legacy issue if in future this is adopted by either parties (doubtful / easily fixed).

Written by Eddie Ma

May 12th, 2009 at 5:55 pm

Balancing and Boosting a Dataset

without comments

Two things that I had wanted to try but didn’t get around to while working on my thesis are balancing the training data and boosting the data. I’ll explain each below, and likely commit the text to a wiki entry later.

Balancing a training set consists of ensuring an even distribution in the range of the training set as much as possible. Notice that while it’s important to have an evenly distributed range, where possible an attempt should be made to even out the domain as well to ensure that an inductive learning machine has a broad enough collection of exemplars to work with. Where this isn’t possible, a stochastic treatment is better than no treatment (i.e. pick out several combinations randomly).

Boosting a dataset refers to a change in training algorithm. Under normal circumstances, a neural network is trained with a static sequence of exemplars every single epoch; a boosted algorithm sees a dynamic treatment instead. In this treatment, the amount exposure of exemplars to the inference machine in training is inversely proportional to the accuracy of a prediction made for those exemplars; an exemplar on which a neural network performs poorly is shown to the network more often.

Balancing abstractly reduces the probability that a neural network is working by chance distribution of range elements; in an extreme case, one could argue that an extension can be done to test the robustness of a method by reversing the balance that naturally occurs in the training set.

Boosting by contrast increases system bias. Care must be taken in each epoch so that the final algorithm used does not overwhelm the system with data points that are known to be flaky (unreliable).

Written by Eddie Ma

May 10th, 2009 at 5:42 pm

Posted in Machine Learning

Tagged with

  • Page 2 of 2
  • <
  • 1
  • 2