Ed's Big Plans

Computing for Science and Awesome

Archive for the ‘Machine Learning’ Category

Windows Basic NN Port

without comments

The Windows port of that ancient basic neural network library has just been cleaned up. A copy was sent to Chris as he’s needed a generic neural net implementation to analyze and interpret for some time. The clean up involved stripping away all of the custom exception handling I placed in, which realistically doesn’t need to be there. The software now softly crashes instead of spouting out a bunch of warnings that no one would understand (and by now, that includes me as I’ve forgotten many of the conventions I put in).

I suppose the prudent thing to do is to slap in a readme.txt, and post it in the wiki for public access.

On to better, more impending things…

Written by Eddie Ma

May 16th, 2009 at 2:48 pm

Posted in Machine Learning

Tagged with

Next Experiments

without comments

New Architecture

In a steep recursive architecture, there is a likelihood that LSTMs can assist—however, this hypothesis in my mind is unstable because I’m torn against two opposing rationales: First, that steep trees cause signal decay, and second that a steady razor has culled overly complicated designs since the beginning of human intelligence. Neatly re-expressed, the rationale conflict is whether or not LSTM cells in the NGN would survive this culling.

New Data and Retry Crashed Data

Regression datasets and retry classification tasks which lead to suspected stack smashing; while I was running FXa and DHFR data, the software would crash unexpectedly while all memory allocations seemed correct (i.e. crash report was not consistent with a segmentation fault). I think that the string buffer used to read the chemical string was too short– interestingly, this is the only non-dynamic memory element that should technically be variable size. Increasing it to 4096 bytes should be enough and should not negatively impact performance at all. For a system running 100 trials concurrently, that is an increase of 400MB of memory in total, this is by today’s standards “trivial”.

Input String Manipulation

One of the items that I didn’t consider is input string manipulation. Machine learning system performance can necessarily be inferred from the task and architecture suitability; however, input normalization and representation is also very important.

Below are some possible manipulations I can use for the NGN.

Different deterministic orders of precedence in the rules of SMILES generation can lead to a variety of different strings for the same molecule; as long as a data set is expressed in the same order of precedence, the NGN is useful.

Below is a demonstration using RDKit in Python of how to manipulate the traversal order originally posted at Blueobelisk.

from rdkit import Chem

m = Chem.MolFromSmiles('Cc1nc(N)ccc1')

for i in range(m.GetNumAtoms()):

print Chem.MolToSmiles(m, rootedAtAtom=i)

Cc1nc(N)ccc1
c1(C)nc(N)ccc1
n1c(C)cccc1N
c1(N)nc(C)ccc1
Nc1nc(C)ccc1
c1ccc(C)nc1N
c1cc(C)nc(N)c1
c1ccc(N)nc1C

Directive: Investigate the different traversal algorithms.

SMILES Isomerism and InChI layer selection…

Although there is rationale to utilize isomeric SMILES strings, a concrete demonstration that this is certainly required has never been demonstrated; what would happen to the performance of the NGN if isomerism were discarded from the training data? InChI layers are indiscriminately accepted currently, but there is no rational for that save for that one suspects hidden relationships to evolve between and require multiple segment. This is addressed along with the idea of hybridizing InChI and SMILES strings together.

When embedding objects into InChI as a layer, use a new ‘’ switch! It’s pretty extensible, almost like XML tags so that closing tags are implied by opening a new tag (hence, not nesting structure).

  • SMILES layers will use ‘SMILES’ as a switch
  • Isomeric SMILES layers will use ‘ISOSMILES’ as a switch
  • Permutations of SMILES will use ‘SMILES(smiles)’, such that ‘smiles’ is replaced by a natural number.

Directive: Use sub-traversals and hybrid traversals as follows. Test the hypothesis that each of these layers is in fact needed; test the hypothesis that no additional information is needed for either InChIs or SMILEs to operate.

A list of Possible Simple, Hybrid or Concatenated Strings

  • {SMILES} – input on its own.
  • {InChI … + ISOSMILES} – are each two descriptors on their own.
  • {InChI … + SMILES} – are each two descriptors on their own.
  • {SMILES(0) + SMILES(1) … SMILES(n-1)} – neural network where each descriptor is an NGN.
  • ISOSMILES, SMILES, SMILES(0), SMILES(1) … SMILES(n-1) – each permutation on its own.
  • {ISOSMILES, SMILES, SMILES(0 … n), InChI, (InChI [Formula and Connection Table Only])}
  • InChI, (InChI [Formula and Connection Table Only]) – InChI with layer non-molecular structure layers removed.
  • InChI, (InChI [Formula and Connection Table Removed]) – InChI with only non-molecular structure layers.
  • {(InChI [Formula and Connection Table Removed]) / SMILES} – InChI where formula and connection table are removed and replaced with SMILES; this is actually SMILES with some redundant information; ensure that layers that refer to the connection table like the hydrogen layer are also removed as they would make no sense.
  • {InChI / ISOSMILES}
  • {InChI / SMILES}
  • {InChI / SMILES(0 / … / n)}

Further more imaginative items exist, but they further convolute and contaminate the hypothesis space of input string exploration, requiring modification of the architecture; for instance, what if each traversal had its own dedicated layer? This leads to a linear growth in the number of weight layers, where the coefficient is the number of such traversals. This would play into the expert panel system more.

Layer separation, staying true to the grammar design—we would use ‘!’ to separate each subvector not designated as an InChI layer; this ‘!’ character is not used by the previous SMILES or InChI strings, but may lead to a legacy issue if in future this is adopted by either parties (doubtful / easily fixed).

Written by Eddie Ma

May 12th, 2009 at 5:55 pm

Balancing and Boosting a Dataset

without comments

Two things that I had wanted to try but didn’t get around to while working on my thesis are balancing the training data and boosting the data. I’ll explain each below, and likely commit the text to a wiki entry later.

Balancing a training set consists of ensuring an even distribution in the range of the training set as much as possible. Notice that while it’s important to have an evenly distributed range, where possible an attempt should be made to even out the domain as well to ensure that an inductive learning machine has a broad enough collection of exemplars to work with. Where this isn’t possible, a stochastic treatment is better than no treatment (i.e. pick out several combinations randomly).

Boosting a dataset refers to a change in training algorithm. Under normal circumstances, a neural network is trained with a static sequence of exemplars every single epoch; a boosted algorithm sees a dynamic treatment instead. In this treatment, the amount exposure of exemplars to the inference machine in training is inversely proportional to the accuracy of a prediction made for those exemplars; an exemplar on which a neural network performs poorly is shown to the network more often.

Balancing abstractly reduces the probability that a neural network is working by chance distribution of range elements; in an extreme case, one could argue that an extension can be done to test the robustness of a method by reversing the balance that naturally occurs in the training set.

Boosting by contrast increases system bias. Care must be taken in each epoch so that the final algorithm used does not overwhelm the system with data points that are known to be flaky (unreliable).

Written by Eddie Ma

May 10th, 2009 at 5:42 pm

Posted in Machine Learning

Tagged with