Ed's Big Plans

Computing for Science and Awesome

Big Bang Day! A Recombinatron Story

without comments

Matthew also has a post about Recombinatron & Big Bang Day here.

Integrase Enzyme Alphabet

‘Big Bang Day’ was this awesome Saturday morning where a bunch of the UWiGEM modeling folks came together and integrated our modules together. We ended up delegating more work, understanding the problem better and fixing up some of the logic in the big picture. After everyone had filed in and we managed to figure out how to synchronize with the SVN…. and after I managed to break the SVN and fix it again (thankfully!), it was time to get to work. I think the project as it stands right now is more or less done unless Giant Scaffold manages to find something that needs fixing.

Big Picture Drawn on the Chalk Board

The big picture was simplified to three giant objects that passes a big bag of DNA to one another in sequence.

The DNA bag is actually a list of DNAClass Objects (DNAObjects)– We decided not to create our own collection… there’s already a Python list. The Giant Scaffold module controls the movement of the DNA bag or subset there of from storage to Operators to Filters.

I was working with the Operators team– Basically, Matthew finished our team’s work because I managed to get swamped with thesis defense preparations, and Andre managed to take down the UWiGEM server.

(Andre incidentally has a post about taking down servers and making backups here.)

The Operators team ended up producing two big functions (and their little internal functions) and one support function.

A Couple of Filters

reactOneStrand(DNAObject) – Produces a list of resultant DNAObjects when the integrase enzyme is used on a single strand of DNA– this function may produce a one-list, two-list or three-list of DNAObjects. One-lists result from inversion (indirect) reactions, two-lists result from excision (direct) reactions, and three-lists result from palindromic operators.

reactTwoStrands(DNAObject, DNAOther) – Produces a list of resultant DNAObjects when two strands of DNA are reacted together with integrase.

The Filters team split their filters into three big enclosing functions– these three functions are equivalent to categories based on the likelihood that a given event would happen in a cell (Frequent, Moderate, Infrequent).

I unfortunately had to leave roughly 2.5 hours into Big Bang Day on other business but was happy to continue the madness online and on a subsequent Monday.

And now… some more photos…

Look! Everyone’s on their laptops– gee, I didn’t know they had Python on computers now!

Wylee, Bradon, Jordan

Wylee, Brandon and Jordan are all part of the Filters team.

Chong, Mattew, Andre

Chong is part of the Giant Scaffold team. Matthew and Andre are part of the Operators team.

Eddie Ma

July 3rd, 2009 at 1:12 pm

Operator Group Meeting

without comments

The Operator Group (UWiGem/Modeling/Operators) had a meeting about a week ago– the meeting ended up being between three people: Matthew, Andre and me at the iGem office. We’ve basically figured out everything we needed to in terms of raw interfaces between our module and the remaining two modules (Filtering Group and Giant Scaffold Group). The DNAClass was updated with the needs Andre presented– one of which is the ability to iterate over a DNAObject while returning yielding both the token index and token in a duplet: (index, token).

The implementation of the enumerate() built-in in Python (PEP 279) doesn’t allow for abstract function overriding. It always counts a collection as it iterates over it starting from zero. Ideally, the count should reflect the index of the circular DNA strand which means that it should be able to count forward or backward (iterate as reverse compliment), and count from any arbitrary position in the loop.

Note that the reverse compliment copy constructor (DNAObject.rc()) does not cause indexes to be reversed… It actually produces a reverse compliment strand and doesn’t do anything special with the indices (i.e. The new strand increments positively as it iterates forwardly). This behaviour is being debated now– On the one hand, it’s correct because a reverse compliment strand is a new strand; however, it is not a strand de novo– it came from a positive sequence.

I’m now waiting for Andre to let me know about the functions and data frameworks needed for the Operators module; my feeling is that the functions will be the straight forward integrase enzyme actions and that the data framework will simply be a python list.

Eddie Ma

June 26th, 2009 at 2:42 pm

QSAR Descriptors – Chris’ Project

without comments

Wow! Go Chris!

With Chris’ dataset collection finished, it’s time to convert every single into a nice suite of descriptor sets. Chris has went and learned the CDK in and out — what started as a complicated and frustrated struggle against Java in Windows and the technicalities of class paths etc., has turned into a fruitful and promising endeavour. While he was working with that, I stumbled onto Bioclipse which utilizes CDK internally for some of its molecule data conversions. Interestingly, the QSAR feature is not yet complete– or is experimental. It looks to be promising in future, but I received an e-mail from Chris telling me all was figured with CDK before I could figure Bioclipse’s QSAR feature out.

The Windows NN port that I’ve been working on is almost done, I’d say I’ll want two more days– today to finish the port and test on a windows box– and tomorrow to figure out how to chain everything together with a windows port of GNU make and some windows variante of GCC.

We’ve decided that the format to express the molecules will be as follows…

  • Each descriptor set has its own file
  • Each file corresponds to a target species and a target organ
  • Each row in each file corresponds to a molecule
  • Each row has the columns <Comma Delimited QSAR Descriptor Elements>; Range (LD50); Species; Organ

If we keep on track, we’ll have the first experiments running on SharcNet by early next week.

Eddie Ma

June 25th, 2009 at 3:45 pm

Yesterday and Today on UWiGem Recombinatron

without comments

So, I’ve been assigned the Recombinatron DNA submodule and spent a good part of yesterday morning and afternoon working on it at the UWiGem office. I’ve brain stormed out what features the submodule should have and have finished a sizeable chunck of it.

While doing so, I managed to learn how to use the Python yield keyword (along with raise StopIteration); and all about abstract functions and how to manipulate them underneath the hood. Abstract functions include items like “len()”, “some_collection[5:7]” and “some_object >= another_object”.

Basically, the DNAClass submodule will be the atomic type that will be passed between the different larger modules of the project. Each DNAClass instance (DNAObject) encapsulates a read-only string that is to be iterated in a loop either forward or backward, along with the ability to be sliced as a string. This all must be transparent to the user.

I might go into detail later, but for now, here’s a good resource — Ordered Dictionary [odict] class by Nicola Larosa and Michael Foord. The odict source is an excellent primer actually, it contains many many useful comments that’ve really helped me figuring out iterators, the slice object and abstract functions.

Finally, I only really have two items left that are mandatory– fixing up slices when a “None” object is used, and then the iterator-iterators…

The iterator-iterators (terrible idea actually) would be a list of iterators, so that each iterator will start at a different position in the loop of DNA all of which correspond to the same token.

I’m thinking now that I should replace it with an accessor that returns a list of positive integers corresponding to the tokens of interest instead; this can still be transparent to the user AND have the benefit of not being unwielding to implement. Having nested yield statements is just asking for trouble.

Update: Done. An initial version has been committed to the repository.

Eddie Ma

June 16th, 2009 at 11:24 am

Meeting with Chris

without comments

Brief: We’ve taken on a new strategy– Chris is building a novel database of LD50 values for many many compounds. We’ll be generating descriptors with some free software, (JoeLib) and (CDK). Eventually, the fixed-width descriptor vectors will be used, as well as the SMILES and InChI counterparts in Neural Networks and NGNs respectively; the ultimate goal is the development of either a nested neural decision tree whose subtrees are the descriptor network and NGNs… OR, the nesting of the descriptor network inside an NGN… OR, the creation of an expert voting system where each decision system gets to vote on a particular molecule of interest. With the windows NN software draft and NGN ready for SharcNet, preliminary trials can start soon.

Modeling Meeting

without comments

Modeling Team Selection with Flush();

Modeling Team Selection with Flush();

A modeling meeting occurred on Wednesday. Andre headed off the discussion and revisited the entire program layout in a nice chalkboard cartoon. Unfortunately, Andre generally doesn’t push down hard enough or make wide enough lines with the chalk in order to make a high enough contrast image against the black board for photography (i.e. faint drawing => no photos, sorry).

The discussion saw the formalization and division of the programming problem into three distinct software components as follows.

  • Genetic Fragment Operators
  • Genetic Fragment Filters
  • Overall Program Logic

Genetic Fragment Operators

These are the functions that represent reverse-complementation, enzyme activity etc..

Genetic Fragment Filters

These are functions that represent removing uninteresting, ‘inert’, undesirable and fatal fragments of DNA. This definition will become more precise once we’ve worked on the project a bit and better understand the philosophical correctness of each of these notions.

Overall Program Logic

The overall program logic will constitute producing some structure that represents a Big Bag of DNA (as opposed to a cell), communication between this Big Bag, the Operator module and the Filter module and of course– our main program loop.

What I’m doing…

I’ve been tasked with producing a universal representation of DNA which includes a circular iterator on a loop of DNA with an arbitrary starting position. This is OK to do in Python with the use of the ‘yield’ operator. I will be borrowing from Jordan / Brendan / My own previous ideas for this representation– we want to have an easy single-letter-token system and for the moment are happy with the single byte space ascii has to offer.

Eddie Ma

June 12th, 2009 at 8:41 am