Ed's Big Plans

Computing for Science and Awesome

  • Page 1 of 2
  • 1
  • 2
  • >

Archive for the ‘Quantitative Structure-Activity Relationship’ tag

Meeting with Chris

without comments

Brief: Met with Chris last week. Chris finished with the convergence tests and some cross validation sets on his descriptors and recommended his own design for 80/20 prediction tests… Meanwhile, I’ve updated the InChI grammar used for the NGN to work with the new data, and have set up experiments to run convergence tests using the SMILES-NGN and InChI-NGN on the eight possible QSAR datasets on SharcNet (16 processes total)… Next on the list– create a script to evaluate his preliminary cross validation experiments (based on Neural Network predicted vs. target values) and provide instructions for running the convergence tests with my NGN software… Will need to pull up an old nugget.py to wrap the convergence test (current one doesn’t halt and always runs 100 trials).

Soon: Port everything to Ubuntu Linux so that we can maintain compatibility without further porting care of Sun Virtualbox VM… Meeting again tomorrow…

Written by Eddie Ma

July 27th, 2009 at 12:35 am

Convergences Detected!

without comments

Good News

Chris’ project has come back to the forefront– after I defend my thesis on Wednesday, it’ll certainly have all of my attention.

We will at least be meeting on Monday though to discuss what can be done in the interim.

Convergence Tests Went Fine

We decided that it would be good for Chris to run a few convergence tests on the datasets he put together across each of the available descriptor sets. So far, many have come back converged meaning that it would be good to proceed. There are two concerns I have. First, do we want to melt the converged descriptors together; do we want to melt all of the descriptors together regardless of convergence? Second, if we don’t– can we do it after the fact and argue that neural network convergence is a good determiner for what descriptors are correlated with results we care about?

Melting Descriptors

To clarify– I mean “concatenating” real value vectors when I say “melting”. This means that we splice together a few linear arrays of numbers and come up with a new longer array that’s still fixed length.

The second question is only true if it turns out that selected melted converged descriptors have better predictive power than when all descriptors are melted together– it’s an even stronger case (and more practical) if it turns out that the descriptors behave better in concert than any particular subset on its own.

That would be an interesting case. The cost of running an additional eight or sixteen experiments to test that hypothesis is cheap to set up, cheap to do.

Alternatives– A Faster Solution

An alternative approach is to naïvely forget about descriptor space reduction / augmentation for now, and just go on and create training and test sets– or cross validation sets– I think with the strained timelines, this would be the wiser objective to knock down first. I’ll make a ticket for myself for both these actually — I should look up how to use the Tanimoto coefficient actually– that will assist in the design of “maximum dissimilarity” test sets to ensure we have good predictive / extrapolation power.

And the NGN…

Finally, I need to go back and uncover a working version of the NGN to use with Chris’ data– I don’t think that InChI is possible, but we’ll try anyway. The SMILES strings are already here, so I can certainly at least run a few convergence tests of my own. That constitutes eighty runs at worst (8 * 10 trials fail) and eight runs at best (1 * 10 trials converge on the first try). I am going to leave this in Unix compatible form because there isn’t enough time to complete the windows port of the NGN.

This should be OK though since everything will be set up for SharcNet.

Written by Eddie Ma

July 11th, 2009 at 7:00 pm

NNcmk: A Neural Network (Win32 & OSX)

without comments

Okay– I managed to finish that 3-layer neural network implementation the other day– actually, it was a while ago but I didn’t post about it from being busy. It’s a pretty standard network, but I’m proud to say it’s small and works for OSX and Win32. I have to put in a few #define directives to have it work with Linux as well.

I will have to document it too when I get a chance. The reason why I made a brand new executable (instead of using the source from my previous projects) is because I needed something that would take in launch-time parameters so that it didn’t need to be recompiled each time someone decides to use the binary on a new dataset with a different number of inputs. Right now, the thing has barely any solid parameters that can’t be touched at launch-time.

The NNcmk (Neural Network – Cameron, Ma, Kremer) package is C compilable, uses the previously developed in-house library for the NGN and will be available shortly after I’m satisfied that I’ve squashed all the bugs, fixed the output and have documented the thing completely. I think Chris has difficulty with it right now mostly because I didn’t specify exactly what parameters do what– I did at least provide a (DOS) batch file with an example run-in-train-mode / run-in-test-mode sequence…

Back to work on that paper right now though…

QSAR Descriptors – Chris’ Project

without comments

Wow! Go Chris!

With Chris’ dataset collection finished, it’s time to convert every single into a nice suite of descriptor sets. Chris has went and learned the CDK in and out — what started as a complicated and frustrated struggle against Java in Windows and the technicalities of class paths etc., has turned into a fruitful and promising endeavour. While he was working with that, I stumbled onto Bioclipse which utilizes CDK internally for some of its molecule data conversions. Interestingly, the QSAR feature is not yet complete– or is experimental. It looks to be promising in future, but I received an e-mail from Chris telling me all was figured with CDK before I could figure Bioclipse’s QSAR feature out.

The Windows NN port that I’ve been working on is almost done, I’d say I’ll want two more days– today to finish the port and test on a windows box– and tomorrow to figure out how to chain everything together with a windows port of GNU make and some windows variante of GCC.

We’ve decided that the format to express the molecules will be as follows…

  • Each descriptor set has its own file
  • Each file corresponds to a target species and a target organ
  • Each row in each file corresponds to a molecule
  • Each row has the columns <Comma Delimited QSAR Descriptor Elements>; Range (LD50); Species; Organ

If we keep on track, we’ll have the first experiments running on SharcNet by early next week.

Written by Eddie Ma

June 25th, 2009 at 3:45 pm

Meeting with Chris

without comments

Brief: We’ve taken on a new strategy– Chris is building a novel database of LD50 values for many many compounds. We’ll be generating descriptors with some free software, (JoeLib) and (CDK). Eventually, the fixed-width descriptor vectors will be used, as well as the SMILES and InChI counterparts in Neural Networks and NGNs respectively; the ultimate goal is the development of either a nested neural decision tree whose subtrees are the descriptor network and NGNs… OR, the nesting of the descriptor network inside an NGN… OR, the creation of an expert voting system where each decision system gets to vote on a particular molecule of interest. With the windows NN software draft and NGN ready for SharcNet, preliminary trials can start soon.

Written by Eddie Ma

June 12th, 2009 at 4:41 pm

Meeting with Chris

without comments

Chris’ project has grown to data sets of roughly three hundred exemplars for each the mouse and rat data sets– these are the sets that mapped molecules to some physiological defect, by organ or tissue. I think he’ll be onto his next phase shortly– taking the data and applying some machine learning construct to it.

I’ve recommended four papers to him to read– three of which discuss QSAR in general, and compare the performance of different approaches. The last paper explicitly uses neural networks for descriptors in regression of melting points. The use of neural networks or similar technology is something that he’s expressed a lot of interest in, so I think this selection falls in well. I’ve provided him with an adapted version of the melting point dataset where the domain is re-expressed as SMILES and InChI.

I think it might be good to set him up with NGNs for those items as well as NNs for the descriptor vector used in the melting point paper.

Written by Eddie Ma

June 5th, 2009 at 10:27 am

  • Page 1 of 2
  • 1
  • 2
  • >