Ed's Big Plans

Computing for Science and Awesome

Archive for the ‘Neural Grammar Network’ tag

Next Experiments

without comments

New Architecture

In a steep recursive architecture, there is a likelihood that LSTMs can assist—however, this hypothesis in my mind is unstable because I’m torn against two opposing rationales: First, that steep trees cause signal decay, and second that a steady razor has culled overly complicated designs since the beginning of human intelligence. Neatly re-expressed, the rationale conflict is whether or not LSTM cells in the NGN would survive this culling.

New Data and Retry Crashed Data

Regression datasets and retry classification tasks which lead to suspected stack smashing; while I was running FXa and DHFR data, the software would crash unexpectedly while all memory allocations seemed correct (i.e. crash report was not consistent with a segmentation fault). I think that the string buffer used to read the chemical string was too short– interestingly, this is the only non-dynamic memory element that should technically be variable size. Increasing it to 4096 bytes should be enough and should not negatively impact performance at all. For a system running 100 trials concurrently, that is an increase of 400MB of memory in total, this is by today’s standards “trivial”.

Input String Manipulation

One of the items that I didn’t consider is input string manipulation. Machine learning system performance can necessarily be inferred from the task and architecture suitability; however, input normalization and representation is also very important.

Below are some possible manipulations I can use for the NGN.

Different deterministic orders of precedence in the rules of SMILES generation can lead to a variety of different strings for the same molecule; as long as a data set is expressed in the same order of precedence, the NGN is useful.

Below is a demonstration using RDKit in Python of how to manipulate the traversal order originally posted at Blueobelisk.

from rdkit import Chem

m = Chem.MolFromSmiles('Cc1nc(N)ccc1')

for i in range(m.GetNumAtoms()):

print Chem.MolToSmiles(m, rootedAtAtom=i)

Cc1nc(N)ccc1
c1(C)nc(N)ccc1
n1c(C)cccc1N
c1(N)nc(C)ccc1
Nc1nc(C)ccc1
c1ccc(C)nc1N
c1cc(C)nc(N)c1
c1ccc(N)nc1C

Directive: Investigate the different traversal algorithms.

SMILES Isomerism and InChI layer selection…

Although there is rationale to utilize isomeric SMILES strings, a concrete demonstration that this is certainly required has never been demonstrated; what would happen to the performance of the NGN if isomerism were discarded from the training data? InChI layers are indiscriminately accepted currently, but there is no rational for that save for that one suspects hidden relationships to evolve between and require multiple segment. This is addressed along with the idea of hybridizing InChI and SMILES strings together.

When embedding objects into InChI as a layer, use a new ‘’ switch! It’s pretty extensible, almost like XML tags so that closing tags are implied by opening a new tag (hence, not nesting structure).

  • SMILES layers will use ‘SMILES’ as a switch
  • Isomeric SMILES layers will use ‘ISOSMILES’ as a switch
  • Permutations of SMILES will use ‘SMILES(smiles)’, such that ‘smiles’ is replaced by a natural number.

Directive: Use sub-traversals and hybrid traversals as follows. Test the hypothesis that each of these layers is in fact needed; test the hypothesis that no additional information is needed for either InChIs or SMILEs to operate.

A list of Possible Simple, Hybrid or Concatenated Strings

  • {SMILES} – input on its own.
  • {InChI … + ISOSMILES} – are each two descriptors on their own.
  • {InChI … + SMILES} – are each two descriptors on their own.
  • {SMILES(0) + SMILES(1) … SMILES(n-1)} – neural network where each descriptor is an NGN.
  • ISOSMILES, SMILES, SMILES(0), SMILES(1) … SMILES(n-1) – each permutation on its own.
  • {ISOSMILES, SMILES, SMILES(0 … n), InChI, (InChI [Formula and Connection Table Only])}
  • InChI, (InChI [Formula and Connection Table Only]) – InChI with layer non-molecular structure layers removed.
  • InChI, (InChI [Formula and Connection Table Removed]) – InChI with only non-molecular structure layers.
  • {(InChI [Formula and Connection Table Removed]) / SMILES} – InChI where formula and connection table are removed and replaced with SMILES; this is actually SMILES with some redundant information; ensure that layers that refer to the connection table like the hydrogen layer are also removed as they would make no sense.
  • {InChI / ISOSMILES}
  • {InChI / SMILES}
  • {InChI / SMILES(0 / … / n)}

Further more imaginative items exist, but they further convolute and contaminate the hypothesis space of input string exploration, requiring modification of the architecture; for instance, what if each traversal had its own dedicated layer? This leads to a linear growth in the number of weight layers, where the coefficient is the number of such traversals. This would play into the expert panel system more.

Layer separation, staying true to the grammar design—we would use ‘!’ to separate each subvector not designated as an InChI layer; this ‘!’ character is not used by the previous SMILES or InChI strings, but may lead to a legacy issue if in future this is adopted by either parties (doubtful / easily fixed).

Written by Eddie Ma

May 12th, 2009 at 5:55 pm