Archive for the ‘SharcNet’ tag
sqkillall.py
Brief: I forgot all about sqkillall.py! It’s a convenience script for killing all of the SharcNet jobs belonging to you! (More about it; Source code).
Meeting with Chris
Brief: Met with Chris last week. Chris finished with the convergence tests and some cross validation sets on his descriptors and recommended his own design for 80/20 prediction tests… Meanwhile, I’ve updated the InChI grammar used for the NGN to work with the new data, and have set up experiments to run convergence tests using the SMILES-NGN and InChI-NGN on the eight possible QSAR datasets on SharcNet (16 processes total)… Next on the list– create a script to evaluate his preliminary cross validation experiments (based on Neural Network predicted vs. target values) and provide instructions for running the convergence tests with my NGN software… Will need to pull up an old nugget.py to wrap the convergence test (current one doesn’t halt and always runs 100 trials).
Soon: Port everything to Ubuntu Linux so that we can maintain compatibility without further porting care of Sun Virtualbox VM… Meeting again tomorrow…
Python 2.3, 2.6 and 3.0
The HPC serial cluster “Whale” on SharcNet has installed Python 2.3 for all of its nodes; in fact, I can almost guarantee that Python 2.3 is standard across SharcNet.
This means that there is no subprocessor module and no top-level exit() function (use sys.exit() instead). A few iterator constructs and other top-level functions are changed, notably the semantics of zip(), range() and xrange()– interestingly, on review– I *should* have used enumerate() in these incompatible cases anyway. But then again, this now-legacy code was written long before I even knew about that function.
This is actually NOT quite as bad as it sounds, since there aren’t too too many features that can’t be backported to work with Python 2.3 (SharcNet) and Python 2.6 simultaneously (Tin & Pewter). In future, when I switch to Python 3.0, I’ll have to take even more care since syntax changes and some major revisions have been adopted (for one, “print” the language construct is now print() a normal top-level function)… In reality, I’m likely to have both installed and will treat them as seperate languages (a sane way to manage legacy code, I’m told).
Today, I’m working on backporting the NGN to be 100% compatible with SharcNet– it takes far too long to compile, zip, upload, decompress, compile, run– the cycle can be expidited by having just a single compile procedure. So far, the problem can be traced to some issues with iterators implemented early on… I can likely take this opportunity to make some of the code more efficient as well.
UPDATE: The convergence was not as straightforward as I had hoped– I’ll still rely on two compile steps. As it turns out, there’s a String.partition() step which is impossible to ignore. It’s possible to do a String.split(), and then rejoin each subsequent segment together, but the amount of testing needed… I’ll need to return to updating the NGN code later. For now, I’m able to at least cleanly break apart the compilation into a “source generation” step, then a “compilation” step.
Convergence Tests
Well, the convergence tests are well underway running on SharcNet (whale serial cluster) for the new datasets (2 Aqueous Solubility, 1 Melting Point)– I’m currently only running them for SMILES due to a problem with one of the Aqueous Solubility sets. I’ve decided to use a “ridiculously high” number of parameters (12 hidden units per hidden layer), and a “ridiculously low” learning rate / momentum (0.3 each).
A convergence test is basically used to see if the data and parameters chosen can be operated on by the NGN. For me, the test set and training sets are identical in convergence testing.
I’ll probably rerun this with yet lower learning (train = 0.15 / momentum = 0.45). Why? Because I have a feeling that the number of hidden units actually enables a greater chance of accidentally falling into the close search space, while the minute math arguments allow these accidents to be gently brushed into being.
I’m presently running all the regression data I’ve ever run in past– this time in both classic normalization and rank-normalization. Finally, one of the Aqueous Solubility datasets came with poorly formed SMILES– they can’t be parsed by the grammars that I made up, and they can’t be parsed by OpenBabel so conversion to InChI wasn’t possible in this round, let alone operation by my system. I’m going to presume they’re broken and skip those entries which still leaves well over a thousand exemplars in that set — this set will have to be rerun with the correct data soon.
Convergence rmse has been set to 1%, meaning that I have a stricter idea of what I think is a deviation from correct. I want to plot the results as a “target on actual” residuals line plot when this is done to give me an idea of how these parameters worked on self-comparison.
Of course, everything here is subject to editing, as I figure exactly how to implement and deploy the previously mentioned changes to how regression experiments are defined (for convergence, for boosting/balancing/verification).
Most immediate next steps:
- Select OK Aqueous SMILES and run.
- Rerun all with InChI-NGN
Ed's Big Plans