Meeting with Chris Cameron
Chris is a student working on a summer project towards starting a Master’s degree soon. We’ve been meeting semi-regularly so that we can discuss his project ideas and so that I can offer assistance with whatever cheminformatics / bioinformatics software, datasets and any other technologies that have come up in his work.
Today’s meeting was kind of interesting– it involved revisiting the idea that the NGN can be integrated in a larger expert decision system, and also that the NGN can itself operate as a molecular kernel whose output can be then fed into some other decision machine. Dr. Kremer had a few ideas, but I didn’t quite grasp what was being conveyed– it can however be narrowed down to three specific designs.
- The NGN output can be used as features in a larger molecular descriptor vector space.
- A different NGN tuned toward a specific kind of problem is selected from a panel of NGNs trained at different problems; this selection is done by an expert (decision tree, or some other state machine).
- The NGN can output an x-dimension array; for instance, a 3D grid of outputs is output so that on the three axes are labeled ‘species’, ‘LD50 class’ (lethal dose for 50% of the population), and ‘target organ or tissue’
As a side note, we decided it might be good to look at the traditional QSAR task with traditional descriptors and the good ol’ feed forward neural network. I’ll prepare the general neural network software for Chris that I used in Experimental Design; the first version I sent in had too much stuff in it, I’ll basically strip it down to “just operational” so he can actually understand the thing.
Next Experiments
New Architecture
In a steep recursive architecture, there is a likelihood that LSTMs can assist—however, this hypothesis in my mind is unstable because I’m torn against two opposing rationales: First, that steep trees cause signal decay, and second that a steady razor has culled overly complicated designs since the beginning of human intelligence. Neatly re-expressed, the rationale conflict is whether or not LSTM cells in the NGN would survive this culling.
New Data and Retry Crashed Data
Regression datasets and retry classification tasks which lead to suspected stack smashing; while I was running FXa and DHFR data, the software would crash unexpectedly while all memory allocations seemed correct (i.e. crash report was not consistent with a segmentation fault). I think that the string buffer used to read the chemical string was too short– interestingly, this is the only non-dynamic memory element that should technically be variable size. Increasing it to 4096 bytes should be enough and should not negatively impact performance at all. For a system running 100 trials concurrently, that is an increase of 400MB of memory in total, this is by today’s standards “trivial”.
Input String Manipulation
One of the items that I didn’t consider is input string manipulation. Machine learning system performance can necessarily be inferred from the task and architecture suitability; however, input normalization and representation is also very important.
Below are some possible manipulations I can use for the NGN.
Different deterministic orders of precedence in the rules of SMILES generation can lead to a variety of different strings for the same molecule; as long as a data set is expressed in the same order of precedence, the NGN is useful.
Below is a demonstration using RDKit in Python of how to manipulate the traversal order originally posted at Blueobelisk.
from rdkit import Chem
m = Chem.MolFromSmiles('Cc1nc(N)ccc1')
for i in range(m.GetNumAtoms()):
print Chem.MolToSmiles(m, rootedAtAtom=i)
Cc1nc(N)ccc1
c1(C)nc(N)ccc1
n1c(C)cccc1N
c1(N)nc(C)ccc1
Nc1nc(C)ccc1
c1ccc(C)nc1N
c1cc(C)nc(N)c1
c1ccc(N)nc1C
Directive: Investigate the different traversal algorithms.
SMILES Isomerism and InChI layer selection…
Although there is rationale to utilize isomeric SMILES strings, a concrete demonstration that this is certainly required has never been demonstrated; what would happen to the performance of the NGN if isomerism were discarded from the training data? InChI layers are indiscriminately accepted currently, but there is no rational for that save for that one suspects hidden relationships to evolve between and require multiple segment. This is addressed along with the idea of hybridizing InChI and SMILES strings together.
When embedding objects into InChI as a layer, use a new ‘’ switch! It’s pretty extensible, almost like XML tags so that closing tags are implied by opening a new tag (hence, not nesting structure).
- SMILES layers will use ‘SMILES’ as a switch
- Isomeric SMILES layers will use ‘ISOSMILES’ as a switch
- Permutations of SMILES will use ‘SMILES(smiles)’, such that ‘smiles’ is replaced by a natural number.
Directive: Use sub-traversals and hybrid traversals as follows. Test the hypothesis that each of these layers is in fact needed; test the hypothesis that no additional information is needed for either InChIs or SMILEs to operate.
A list of Possible Simple, Hybrid or Concatenated Strings
- {SMILES} – input on its own.
- {InChI … + ISOSMILES} – are each two descriptors on their own.
- {InChI … + SMILES} – are each two descriptors on their own.
- {SMILES(0) + SMILES(1) … SMILES(n-1)} – neural network where each descriptor is an NGN.
- ISOSMILES, SMILES, SMILES(0), SMILES(1) … SMILES(n-1) – each permutation on its own.
- {ISOSMILES, SMILES, SMILES(0 … n), InChI, (InChI [Formula and Connection Table Only])}
- InChI, (InChI [Formula and Connection Table Only]) – InChI with layer non-molecular structure layers removed.
- InChI, (InChI [Formula and Connection Table Removed]) – InChI with only non-molecular structure layers.
- {(InChI [Formula and Connection Table Removed]) / SMILES} – InChI where formula and connection table are removed and replaced with SMILES; this is actually SMILES with some redundant information; ensure that layers that refer to the connection table like the hydrogen layer are also removed as they would make no sense.
- {InChI / ISOSMILES}
- {InChI / SMILES}
- {InChI / SMILES(0 / … / n)}
Further more imaginative items exist, but they further convolute and contaminate the hypothesis space of input string exploration, requiring modification of the architecture; for instance, what if each traversal had its own dedicated layer? This leads to a linear growth in the number of weight layers, where the coefficient is the number of such traversals. This would play into the expert panel system more.
Layer separation, staying true to the grammar design—we would use ‘!’ to separate each subvector not designated as an InChI layer; this ‘!’ character is not used by the previous SMILES or InChI strings, but may lead to a legacy issue if in future this is adopted by either parties (doubtful / easily fixed).
Master’s Thesis RC-2 Submitted
An excerpt from an e-mail I sent to Dr. Kremer.
The second release of my thesis has just been committed to the repository– It’s a bit bigger at roughly 110 pages. Note that the front page does not conform to the style guide yet, but that change does not involve any content change so I’m doing that after.
And something from a previous e-mail…
If time permits, want to add diagrams in Ch3 for Walsh (molecule graph topology network), Mohr (molecule kernel)– Strong desire to add discussion on other molecule kernel techniques (P-SVM compat.), probably only have time to add a few sentences.
As it turns out I never did have a chance to add those other graph kernel references to the thesis.
Balancing and Boosting a Dataset
Two things that I had wanted to try but didn’t get around to while working on my thesis are balancing the training data and boosting the data. I’ll explain each below, and likely commit the text to a wiki entry later.
Balancing a training set consists of ensuring an even distribution in the range of the training set as much as possible. Notice that while it’s important to have an evenly distributed range, where possible an attempt should be made to even out the domain as well to ensure that an inductive learning machine has a broad enough collection of exemplars to work with. Where this isn’t possible, a stochastic treatment is better than no treatment (i.e. pick out several combinations randomly).
Boosting a dataset refers to a change in training algorithm. Under normal circumstances, a neural network is trained with a static sequence of exemplars every single epoch; a boosted algorithm sees a dynamic treatment instead. In this treatment, the amount exposure of exemplars to the inference machine in training is inversely proportional to the accuracy of a prediction made for those exemplars; an exemplar on which a neural network performs poorly is shown to the network more often.
Balancing abstractly reduces the probability that a neural network is working by chance distribution of range elements; in an extreme case, one could argue that an extension can be done to test the robustness of a method by reversing the balance that naturally occurs in the training set.
Boosting by contrast increases system bias. Care must be taken in each epoch so that the final algorithm used does not overwhelm the system with data points that are known to be flaky (unreliable).
Internet Connection Sharing
I’ve found that the implementation of NAT (network address translation, also called internet connection sharing [ICS] (OSX) and network bridging (Windows)) on Tin is pretty decent. The most common use of NAT from a laptop is to convert the laptop into a software wireless router: the laptop is connected to an actual router via wire, and transmits the network wirelessly. I needed to do the reverse: to receive a wireless signal and then share the address with a desktop that doesn’t have a wireless interface.
This succeeded until the wireless connection was inexplicably irresolvable in that dialing to any IP address from either the desktop or Tin resulted in server timeouts. This happened over and over and usually required about ten minutes after each failure and reconnection to the wireless router.
I’ve read a bit more, and I’m really no closer to figuring out why this happened but have a few hypotheses. First, it could be that the NAT implementation isn’t as bullet proof as I had hoped; second, it could be that the router some how doesn’t like what I’m doing; and third, my ISP may see the set up as somehow disruptive to other clients and thus had disconnected me.
SSH, (S)FTP, VNC Tests Etc.
With the present setup, my D-Link forwards TCP/80 request to Tin on 8080 which is where Apache lives. It made sense to set up SSH and FTP as well (SFTP uses the same port as SSH AFAIK). This turned out to be remarkably easy, allowing FTP to pass through TCP/21 and SSH to pass through FTP/22 each to Tin. VNC generally uses UDP 5900 and TCP 5900~5902, but opening just UDP+TCP/5900 did the trick.
Of course, as there are _many_ sensitive items on Tin, I can’t afford to just blast everything online so I’ve deactivated the port forwarding for those services after the test was done.
Whenever it is that I get the new web serving hardware, I’ll have to create a lower privilege account to host everything. The plan is to have a box that doesn’t have any kind of interface except for its network connection so that everything is controlled over SSH/SFTP/VNC etc. wherever I am physically in the world.
Edit: Oh! I’ve just learned this is called running a machine headless. — I’ve heard that term before, I just didn’t put it together.
Ed's Big Plans