Ed's Big Plans

Computing for Science and Awesome

Archive for the ‘Quantitative Structure-Activity Relationship’ tag

Meeting with Chris

without comments

Brief: Met with Chris last week. Chris finished with the convergence tests and some cross validation sets on his descriptors and recommended his own design for 80/20 prediction tests… Meanwhile, I’ve updated the InChI grammar used for the NGN to work with the new data, and have set up experiments to run convergence tests using the SMILES-NGN and InChI-NGN on the eight possible QSAR datasets on SharcNet (16 processes total)… Next on the list– create a script to evaluate his preliminary cross validation experiments (based on Neural Network predicted vs. target values) and provide instructions for running the convergence tests with my NGN software… Will need to pull up an old nugget.py to wrap the convergence test (current one doesn’t halt and always runs 100 trials).

Soon: Port everything to Ubuntu Linux so that we can maintain compatibility without further porting care of Sun Virtualbox VM… Meeting again tomorrow…

NNcmk: A Neural Network (Win32 & OSX)

without comments

Okay– I managed to finish that 3-layer neural network implementation the other day– actually, it was a while ago but I didn’t post about it from being busy. It’s a pretty standard network, but I’m proud to say it’s small and works for OSX and Win32. I have to put in a few #define directives to have it work with Linux as well.

I will have to document it too when I get a chance. The reason why I made a brand new executable (instead of using the source from my previous projects) is because I needed something that would take in launch-time parameters so that it didn’t need to be recompiled each time someone decides to use the binary on a new dataset with a different number of inputs. Right now, the thing has barely any solid parameters that can’t be touched at launch-time.

The NNcmk (Neural Network – Cameron, Ma, Kremer) package is C compilable, uses the previously developed in-house library for the NGN and will be available shortly after I’m satisfied that I’ve squashed all the bugs, fixed the output and have documented the thing completely. I think Chris has difficulty with it right now mostly because I didn’t specify exactly what parameters do what– I did at least provide a (DOS) batch file with an example run-in-train-mode / run-in-test-mode sequence…

Back to work on that paper right now though…

QSAR Descriptors – Chris’ Project

without comments

Wow! Go Chris!

With Chris’ dataset collection finished, it’s time to convert every single into a nice suite of descriptor sets. Chris has went and learned the CDK in and out — what started as a complicated and frustrated struggle against Java in Windows and the technicalities of class paths etc., has turned into a fruitful and promising endeavour. While he was working with that, I stumbled onto Bioclipse which utilizes CDK internally for some of its molecule data conversions. Interestingly, the QSAR feature is not yet complete– or is experimental. It looks to be promising in future, but I received an e-mail from Chris telling me all was figured with CDK before I could figure Bioclipse’s QSAR feature out.

The Windows NN port that I’ve been working on is almost done, I’d say I’ll want two more days– today to finish the port and test on a windows box– and tomorrow to figure out how to chain everything together with a windows port of GNU make and some windows variante of GCC.

We’ve decided that the format to express the molecules will be as follows…

  • Each descriptor set has its own file
  • Each file corresponds to a target species and a target organ
  • Each row in each file corresponds to a molecule
  • Each row has the columns <Comma Delimited QSAR Descriptor Elements>; Range (LD50); Species; Organ

If we keep on track, we’ll have the first experiments running on SharcNet by early next week.

Eddie Ma

June 25th, 2009 at 3:45 pm

Meeting with Chris

without comments

Brief: We’ve taken on a new strategy– Chris is building a novel database of LD50 values for many many compounds. We’ll be generating descriptors with some free software, (JoeLib) and (CDK). Eventually, the fixed-width descriptor vectors will be used, as well as the SMILES and InChI counterparts in Neural Networks and NGNs respectively; the ultimate goal is the development of either a nested neural decision tree whose subtrees are the descriptor network and NGNs… OR, the nesting of the descriptor network inside an NGN… OR, the creation of an expert voting system where each decision system gets to vote on a particular molecule of interest. With the windows NN software draft and NGN ready for SharcNet, preliminary trials can start soon.

Meeting with Chris

without comments

Chris’ project has grown to data sets of roughly three hundred exemplars for each the mouse and rat data sets– these are the sets that mapped molecules to some physiological defect, by organ or tissue. I think he’ll be onto his next phase shortly– taking the data and applying some machine learning construct to it.

I’ve recommended four papers to him to read– three of which discuss QSAR in general, and compare the performance of different approaches. The last paper explicitly uses neural networks for descriptors in regression of melting points. The use of neural networks or similar technology is something that he’s expressed a lot of interest in, so I think this selection falls in well. I’ve provided him with an adapted version of the melting point dataset where the domain is re-expressed as SMILES and InChI.

I think it might be good to set him up with NGNs for those items as well as NNs for the descriptor vector used in the melting point paper.

Meeting with Chris Cameron

without comments

Chris is a student working on a summer project towards starting a Master’s degree soon. We’ve been meeting semi-regularly so that we can discuss his project ideas and so that I can offer assistance with whatever cheminformatics / bioinformatics software, datasets and any other technologies that have come up in his work.

Today’s meeting was kind of interesting– it involved revisiting the idea that the NGN can be integrated in a larger expert decision system, and also that the NGN can itself operate as a molecular kernel whose output can be then fed into some other decision machine. Dr. Kremer had a few ideas, but I didn’t quite grasp what was being conveyed– it can however be narrowed down to three specific designs.

  • The NGN output can be used as features in a larger molecular descriptor vector space.
  • A different NGN tuned toward a specific kind of problem is selected from a panel of NGNs trained at different problems; this selection is done by an expert (decision tree, or some other state machine).
  • The NGN can output an x-dimension array; for instance, a 3D grid of outputs is output so that on the three axes are labeled ‘species’, ‘LD50 class’ (lethal dose for 50% of the population), and ‘target organ or tissue’

As a side note, we decided it might be good to look at the traditional QSAR task with traditional descriptors and the good ol’ feed forward neural network. I’ll prepare the general neural network software for Chris that I used in Experimental Design; the first version I sent in had too much stuff in it, I’ll basically strip it down to “just operational” so he can actually understand the thing.