Archive for the ‘Neural Grammar Network’ tag
Brief: The BIBM09 conference was the very first conference I have ever attended. I learned a lot from the various speakers and poster sessions–
I thought it was really interesting how the trend is to now study and manipulate large interaction pathways in silico– a theme of which is the utilization of many different data sources integrating chemical, drug and free text as well as the connection of physical protein interaction pathways and gene expression pathways. There was even a project which dealt with the alignment of pathway graphs (topology).
Dealing with pathways especially by hand and in the form of a picture is probably the bane of many biologists’ existence– I think that the solutions we’ll see in the next few years will turn this task into simple data-in-data-out software components, much like the kind we have to deal with sequence alignments.
And now, back to the real world!
Addendum: My talk went very well
And here are my slides with a preview below.
Brief: I’m particularly happy with this diagram… I had something along these lines in my head for a while, but I never could figure out how to draw it correctly. I never thought that simplifying it to three easy steps was the smarter thing to do.
Brief: The paper I codenamed MSc-X2 / MSc-IEEE was just accepted… Final version due on Sept. 1st? Conference in November? But I’m going on vacation! I’ll patch what I can and figure out what else needs to be done with Stefan!
Brief: Met with Chris last week. Chris finished with the convergence tests and some cross validation sets on his descriptors and recommended his own design for 80/20 prediction tests… Meanwhile, I’ve updated the InChI grammar used for the NGN to work with the new data, and have set up experiments to run convergence tests using the SMILES-NGN and InChI-NGN on the eight possible QSAR datasets on SharcNet (16 processes total)… Next on the list– create a script to evaluate his preliminary cross validation experiments (based on Neural Network predicted vs. target values) and provide instructions for running the convergence tests with my NGN software… Will need to pull up an old nugget.py to wrap the convergence test (current one doesn’t halt and always runs 100 trials).
Okay okay okay– so the defense was about four days ago, I need to get myself organized… these are the projects I can afford to participate in OR are the projects that will yield the most return in terms of enjoyment or some other intangible value.
- Get married– with the wedding now only two weeks away, I need to pull everything I need together and really support Cara in the remaining tasks. As far as I understand, almost everything is in order already and it’s down to ballroom dance practicing etc. and getting our lines memorized. Well, there’s probably a lot more in terms of communicating with the flower girl and ring bearer– and my family– so hopefully we’ll magically finish just on time. Plus there’s a week in August where we’ll be at Disney World for the honeymoon, so I’ll certainly find myself in a bubble away from anything else.
Long Term Focus
- The PhD– I’ve already written to Liz and Brendan about the summer, or actually– the changes about this summer. Earlier on, we had agreed it would be good if I got a head start on the PhD project in summer. At the very least, I’d refine the problem space and be able to formalize my interests. It looks like the most rational thing to do now is to do a normal clean start in Fall since there are a few outstanding things I need to tackle back at Guelph.
Important Intermediate Projects (In Randomly Generated Order)
- Graduating and all its caveats (Must Finish)– I need to fix the thesis, which means I need to get the examiners’ notes back. I also need to finish some paper work — I have a stack of signed documents that I shouldn’t lose that has to be handed in with the final thesis, and another stack of signed documents that has to be handed in with the department keys. This item doesn’t stress me out as much as it probably should…
- iGEM (Would Be Nice)– I’ve been delegated nothing right now, so I actually should chase down Andre next week when we morph from developers to end users. Yes, that’s right– we shall suffer the glory of using our own software creation. I really ought to checkout the repository to see if I can understand the code logic after everyone’s touched it (the modules are more or less mature at this point). Reading the code would be a start — After that, the modeling team will probably break apart and merge into other teams in UWiGEM. Plus there’s mathematical modeling, planning for next year etc. The iGEM project has the potential to be paper worthy if we manage to get some decent results… it’ll be a feat of an interdisciplinary team which makes me happy to be a part of it all.
- Chris’ Project (Must Finish)– So, this item is always running in the background since I’m not the primary owner of this project. Chris has done an excellent job with the science so far, which motivates me to offer him the support he needs to finish… this one’s another potential paper– but again, we need to get results and actually have something interesting to say. He’s actually doing five-fold and leave-one-out cross validation schemes right now, but that’s a post for another time.
- MSc-X3 (Would Be Nice)– The math paper! Stefan and I decided a long time ago that a math paper would be good to complete the story of the NGN. This would actually be slightly more comprehensive than the thesis in explaining the math, including things like run time costs and analysis plus all of the equations that got lost in the translation.
- Andre’s Mystery MSc-??? (Would Be Nice)– I have little detail about this, but it’s another gene management system with the additional feature that fragments can be checked out and added to end users’ own databases. I’m really curious to learn more. The last I heard, Andre managed to churn out code that didn’t have any bugs– panicked at the lack of bugs– then found some bugs– and was relieved.
- I’m going to stop short of listing three additional projects I had been working on previously– I have realized that I don’t have time, and the organizations (and one person) to whom these projects belong to are probably aware that I managed to get buried in work… I really want to revisit these items in future, but I am unsure when time will come available again.
Chris’ project has come back to the forefront– after I defend my thesis on Wednesday, it’ll certainly have all of my attention.
We will at least be meeting on Monday though to discuss what can be done in the interim.
Convergence Tests Went Fine
We decided that it would be good for Chris to run a few convergence tests on the datasets he put together across each of the available descriptor sets. So far, many have come back converged meaning that it would be good to proceed. There are two concerns I have. First, do we want to melt the converged descriptors together; do we want to melt all of the descriptors together regardless of convergence? Second, if we don’t– can we do it after the fact and argue that neural network convergence is a good determiner for what descriptors are correlated with results we care about?
To clarify– I mean “concatenating” real value vectors when I say “melting”. This means that we splice together a few linear arrays of numbers and come up with a new longer array that’s still fixed length.
The second question is only true if it turns out that selected melted converged descriptors have better predictive power than when all descriptors are melted together– it’s an even stronger case (and more practical) if it turns out that the descriptors behave better in concert than any particular subset on its own.
That would be an interesting case. The cost of running an additional eight or sixteen experiments to test that hypothesis is cheap to set up, cheap to do.
Alternatives– A Faster Solution
An alternative approach is to naïvely forget about descriptor space reduction / augmentation for now, and just go on and create training and test sets– or cross validation sets– I think with the strained timelines, this would be the wiser objective to knock down first. I’ll make a ticket for myself for both these actually — I should look up how to use the Tanimoto coefficient actually– that will assist in the design of “maximum dissimilarity” test sets to ensure we have good predictive / extrapolation power.
And the NGN…
Finally, I need to go back and uncover a working version of the NGN to use with Chris’ data– I don’t think that InChI is possible, but we’ll try anyway. The SMILES strings are already here, so I can certainly at least run a few convergence tests of my own. That constitutes eighty runs at worst (8 * 10 trials fail) and eight runs at best (1 * 10 trials converge on the first try). I am going to leave this in Unix compatible form because there isn’t enough time to complete the windows port of the NGN.
This should be OK though since everything will be set up for SharcNet.