Ed's Big Plans

Computing for Science and Awesome

NGN Upgrade Status

without comments

The NGN is being worked on again. In implementing the Bagging/Balancing/Boosting/Verifying activity of the software, I’ve decided to break everything apart into more smaller modules — that gives me a better picture of progress while allowing me to keep motivated on smaller tasks.

Realistically, there are only three functions that are ever dynamically generated by the BNF generator; these all perform some logic related to the weight layers of the neural network– I think it is feasible now to completely close off and encapsulate the dynamically generated content in its own source files. Linking it in is probably philosophically more correct and also implementationally cleaner than trying to have the BNF generator pump out all of the template items that do not change between grammars.

In retrospect, I might drop the Verifying activity soon depending on how similar or different it is to Balancing– if it’s similar enough and could be expressed in many of the same inner functions, then I’ll keep it. Verifying will probably be so weak, it can be supplemented by an ancillary test run after training prior to an actual test run. I’m simply concerned that in the stark majority of cases, verification would mislead the system into convergence / infinite non-convergence. In a good system with much data distributed between training and verification along with correct tolerances, this is not the case– I have neither the luxury of time nor copious amounts of data. I’ll leave it in for now, and figure out what I want to do with it later.

In terms of completion, the way I’ve broken down the work has me coding out C modules until at least Monday, whereupon I must fix any API discrepancies between the modules and magically anneal them together. And that’s still before making the hard-coded changes to the NGN generator.

I think making the deadline is still possible, but making sure that everything is as small (time) as possible is essential.

Eddie Ma

June 10th, 2009 at 2:49 pm

Posted in Machine Learning

Tagged with ,

Meeting with Chris

without comments

Chris’ project has grown to data sets of roughly three hundred exemplars for each the mouse and rat data sets– these are the sets that mapped molecules to some physiological defect, by organ or tissue. I think he’ll be onto his next phase shortly– taking the data and applying some machine learning construct to it.

I’ve recommended four papers to him to read– three of which discuss QSAR in general, and compare the performance of different approaches. The last paper explicitly uses neural networks for descriptors in regression of melting points. The use of neural networks or similar technology is something that he’s expressed a lot of interest in, so I think this selection falls in well. I’ve provided him with an adapted version of the melting point dataset where the domain is re-expressed as SMILES and InChI.

I think it might be good to set him up with NGNs for those items as well as NNs for the descriptor vector used in the melting point paper.

Zinc the Mac Mini

with 2 comments

Zinc on a Shelf

Zinc on a Shelf

Zinc the Mac Mini has arrived! I’ve set it up as the new webserver currently spewing forth this website. Migration and setup was incredibly easy with MAMP– copying the application folder implied moving the application plus Apache settings and SQL database. Copying the Sites folder from one machine to another was a breeze too. Of course I’ll need to upgrade the software in the backend some time in the future, but I’m very happy with how fast it was this time around. It implies very little downtime to be expected should another migration be needed within the same LAN.

Zinc also offers an additional two cores in case I need to offload experiments. I’ve read various reports about Mac Minis overheating or not– so I’ll keep an eye on the thing until I know for sure. It runs very silently next to the router right now. Headless functionality is exactly as I would have hoped and imagined.

So– anyone wanting to do any webhosting, a nice small box next to the router gets my vote.

Eddie Ma

June 2nd, 2009 at 11:44 pm

OGem – Ontario iGem Mini-Conference

without comments

Teams from the Southern Ontario University iGem teams came together on May 29th for a miniature conference at the University of Waterloo. We had members from Guelph, Toronto, Mississauga (U of T’s West Campus), Queen’s and Ottawa. The basic trend of the show was finding the fun and profit from synthetic biology. I could only stay for the morning and early afternoon segment– but I really would have loved to stay for dinner.

Dave Johnston — our very own team leader from last year at Guelph showed up with Brendan. It was kind of neat to see them again, and more so since they didn’t know I had betrayed them and joined the Waterloo team this year (amicably of course).

Meeting with like-minded individuals is a bit of a relief. It is good to need to argue, convince and learn from others in science– and out of science… but when it comes to something as difficult for outsiders to enjoy as synthetic biology, sometimes it’s a nice break to just discuss the facets within the discipline, rather than abstractly and vaguely defending it against misunderstandings. Actually, one of the standing objectives we discussed was improving public image.

Along with the theme of the fun and profit of the beast came the odd realization that what we’re studying now is likely to become obsolete within the decade– however, with that risk comes the potential for each of us attending to contribute something truly worthwhile in short notice.

One key idea that stuck with me was the development and deployment of nanometre-scale sensors to detect the changes of magnetic flux while molecules are moved within a single cell. I’d imagine that one would need to be well versed with trigonometry and calculus to write software to solve the diffraction patterns of the fields in real time. It might look something like a dynamic/real-time x-ray crystallographic analysis. Another key idea is the 1Mbp/1hr/$100 device. If DNA can be printed at the rate of one million base pairs for each hour at the price of a hundred bucks– it wouldn’t matter what currency that’s in, it would win.

Culture was something else I noticed about the group. There was the ever present odd scientist humour. We managed to have running jokes about the phrase “Killer App”, as soon as it was accidentally introduced to refer to engineered microbes.

All in all, it was a good conference.

Something that I’ll need to follow up on is the idea of doing a group / mass booking for a tour bus from Southern Ontario down to Boston come iGem conference time. This along with a group / mass hotel booking would solve a lot the travel and accomodation fragmentation everyone experienced last year.

Eddie Ma

June 1st, 2009 at 1:01 am

Convergence Tests

without comments

Well, the convergence tests are well underway running on SharcNet (whale serial cluster) for the new datasets (2 Aqueous Solubility, 1 Melting Point)– I’m currently only running them for SMILES due to a problem with one of the Aqueous Solubility sets. I’ve decided to use a “ridiculously high” number of parameters (12 hidden units per hidden layer), and a “ridiculously low” learning rate / momentum (0.3 each).

A convergence test is basically used to see if the data and parameters chosen can be operated on by the NGN. For me, the test set and training sets are identical in convergence testing.

I’ll probably rerun this with yet lower learning (train = 0.15 / momentum = 0.45). Why? Because I have a feeling that the number of hidden units actually enables a greater chance of accidentally falling into the close search space, while the minute math arguments allow these accidents to be gently brushed into being.

I’m presently running all the regression data I’ve ever run in past– this time in both classic normalization and rank-normalization. Finally, one of the Aqueous Solubility datasets came with poorly formed SMILES– they can’t be parsed by the grammars that I made up, and they can’t be parsed by OpenBabel so conversion to InChI wasn’t possible in this round, let alone operation by my system. I’m going to presume they’re broken and skip those entries which still leaves well over a thousand exemplars in that set — this set will have to be rerun with the correct data soon.

Convergence rmse has been set to 1%, meaning that I have a stricter idea of what I think is a deviation from correct. I want to plot the results as a “target on actual” residuals line plot when this is done to give me an idea of how these parameters worked on self-comparison.

Of course, everything here is subject to editing, as I figure exactly how to implement and deploy the previously mentioned changes to how regression experiments are defined (for convergence, for boosting/balancing/verification).

Most immediate next steps:

  • Select OK Aqueous SMILES and run.
  • Rerun all with InChI-NGN

Eddie Ma

May 31st, 2009 at 9:52 am

Integrase Problem Introduction

without comments

In a meeting with iGemmers @ Waterloo today– specifically the Modeling team headed by Andre and supplemented by core members Sheena and John– a discussion was held on this year’s modeling project. We’re currently interested in creating a solver that will yield an arrangement of attX sites on a chasis bacterial host chromosome that can accomodate several rounds of deterministic recombination.

Plainly, we need to write software that will create a solution that is a sequence of DNA– this DNA is arranged such that specific sites that can be operated on by the enzyme integrase is sequenced so that it can accept several loops of artificial DNA to recombine with. In this design, we’re interested in a sequence for the host chasis, a sequence for the artificial loops and another loop for integrase to be produced at some arbitrary tonic level inside the cell.

The first step is to mathmatically formalize the problem– and along with it, some working particles of software that successfully model the problem space. The solver is a yet more abstract piece of software that will use these particles in its solution. This is similar to designing the notion of integers and arithmetic operators prior to using those components to solve algebra.

This description is very coarse– I’ll refine it in a later post after I’ve had some time to analyze the problem constraints and what software particles are important to set down on paper.

Eddie Ma

May 25th, 2009 at 12:35 am