Archive for September, 2009
Removable storage as software modules
Brief: I had this idea a long time ago and didn’t bother to implement it. What if I placed logical volumes of data each on their own USB key? Examples… I would place the htdocs root for an apache installation on a USB stick, so that migration of this logical tree from one host to another would just involve pulling the stick and putting it in another machine. Same could be done for our giant binary SQL databases, SVN repositories, virtual machine disk images etc..
Protein Databases and Parsability
Brief: Parsability is essential for fast machine assisted analysis of vast databases… So, I got lucky with SCOP since the entire protein hierarchy is offered exactly like that… The entries are even linked to ASTRAL pdb-like structure files. Something similar is given CATH, but I don’t comprehend it yet.
Aside: I haven’t really given enough credence to PFAM yet– I should spend a little time figuring out how useful it is. From what I understand, it doesn’t classify proteins by structure so it may be more useful in secondary and later analyses.
Aside: Hey look! A big giant page of alignment tools care of ExPASy. Goody. Reinventing the wheel as least often as possible is certainly a good modus.
Scholarship Application Time!
Brief: It’s that time of the year again! Yesterday was the department deadline for NSERC applications. Everything’s in order and accounted for more or less. OGS applications are coming up early October. This is truly the first year that I’ve applied where I feel like a worthy candidate for the prize. Let’s hope the judges feel the same way.
TIM Barrels and 4-Alpha Helix Bundles
Beta-Alpha TIM Barrels and 4-Alpha Helix bundles are the first of the major folds I’ll be looking at here at Waterloo…
As with all academic projects, the probability of goal, approach and method mutation is high. Looking at the above protein folds serves as an excellent starting point as I’ll be applying some of the established methods that Andrew and Aaron have developed.
Alpha helices and Beta sheets are objects that any highschool biologist is acquainted with. To recapitulate, alpha helices are sequences of amino acids arranged so that the alpha carbon of each amino acid falls along the path of a helix. The number of amino acids per turn in this peptide and the regularity of the helix are determined by both the sequence and the environment that peptide finds itself in. Amino acids in the beta sheet conformation are arranged so that their alpha carbons zig zag. The result is a nice wide and flat shape schematically drawn as a sheet. Alpha helices and Beta sheets are collectively called secondary structural elements.
So, folds are these giant overarching classification of proteins– Folds themselves are inherently structural, so classifying them OR using them as classifications is only relevant in structural studies and databases on the web like SCOP and CATH. In databases such as SCOP and CATH, classification of similarly structured proteins start by determining whether the protein contains mostly alpha helices; mostly beta sheets; beta sheet and alpha helices alternatively and irregularly; or beta sheet and alpha helices in distinct regions of a protein. In SCOP, further classification is done by manually assigning proteins to smaller and smaller categories, while in CATH, these classifications are done by a hidden Markov model and then manually inspected (or not). It turns out that CATH uses a similar manual approach, and uses HMMs only to assist; contrast with PFAM which actually utilizes HMMs for the majority of work and is verified afterward by humans.
Certain folds like the beta-Trefoil and TIM Barrel benefit from containing only proteins that cleanly fit into some subcategory or several subcategories– it is then possible to just drill into the right level of categorization and pull out all of the beta-Trefoils and TIM Barrels we want.
The 4-Alpha-Helix Bundle constitutes a fold of protein that manages to be spread around the databases, being a very common secondary structural repeat; also a very small repeat when compared to the two giants above. These two items represent an interesting contrast too. Both machine and human intelligence pulls TIM Barrels together while sprinkling alpha-helix bundles across databases and subcategories. And yes, the size difference helps too.
So, I’m starting with a structural then sequence based alignment for single domain TIM Barrels and alpha-helix bundles; to be completely focused, an objective is named: To identify where sequence repeats occur in each individual protein.
Knots were the wrong math
The Knot Math was eventually understood to be the wrong kind of math to model our problem on.
Knots take the form of a circle that has been broken and rejoined at a point on its circumference after being wrapped about itself an arbitrary number of times. What we’re working on doesn’t utilize any function that twists loops of DNA the same way. The knot maths provide a way to real-value-vectorize these shapes, but do not provide an easy way to insert our own data. There are two properties that relate to the incompatibility. The first is that knot maths consider two knots equal if their topology with respect to the number of twists they have are identical. Our problem does not consider these two knots equal, as distance and sequence specificity (imagine each particle on the rope circle was labeled) are required. Second, what we produce overlaps arbitrarily by lying a circle segment on top of another circle segment whereas the knot maths produce overlaps with twists. While I think there could be a clever way to identify our problem with the knot math, I don’t think there is a feasible or cost (time) effective way to do this.
Brain continues to storm.
I did managed to uncover some very exciting papers however. One of them was on a piece of software called TangleSolve– which does do site specific recombination and visualization of DNA knots– reading on this software was actually instrumental in understanding why our problem was not identifiable here. Side note– topoisomerase — is an enzyme involved with DNA knot formation and super coiling relaxation.
I’m a T.A. Now.
Brief: Analytical Methods in Molecular Biology is the course that I’ll be TAing this term. It looks like it’ll be a lot of fun. I’m surprised at the amount I remember from my courses at Guelph; I’m also surprised by the amount I’m relearning.
Update: I’ll outline the course here– we discuss the reason to, and how to use synthetic biology in order to identify and characterize genes. Characterization is an intentionally broad word indicating the determination of the putative DNA sequence in question, the range of phenotypes its alleles produce, the mass, charge, solubility of the protein it produces along with catalytic and structural activity etc., and finally some profile generated using various bioinformatics scoring for the labeling of homologues, related structures and sequences etc.. Well, that’s my take on it so far although I likely will revise my understanding of the nature of this course as I progress through it.
My TA partner is Ariana Marcassa who finds herself in her third year of undergrad. I think we’ll make a good team.
I know what you mean about the learning, I’m discovering the same thing as I TA for the first time this semester
I’m actually very happy about this course… I’ll have to come bother you about which course you TA. Oh hey, that’s right– you’re at Guelph! So you must be doing a computer science course.
Ed's Big Plans
I can one up you: why don’t you put the software that runs it on the USB key as well? That way, it becomes more self-contained.
We actually did something like this at Environment Canada. Well, we started to, anyway. We have a Storage Area Network (SAN). So, we create volumes on various harddisk and then can present them to machines via a fibre network. The idea was to create volumes that represented applications and their data. Applications would be made not to litter their data on the system, so it would be easy to move an application from system to another. It also meant that the disk on machine X wouldn’t get full because of application Y. Application Y filled application Y’s disk.
The only impracticality of this system is the heterogeneity of the hosts you are using. As they become more heterogenous, it becomes more difficult to shuttle your data between them. If they are all roughly the same and upgraded in sync, it’s not so bad. Oracle, being a picky bitch of a piece of software, made the upgrading in sync difficult at EC.
That’s great… actually, I _meant_ to go in that direction in my post, as evidenced by the title– but I ended up going a different way because I didn’t have good examples.
I would do this, but I don’t really have too much software that would benefit from that. Wait a second… all this new command line bioinformatics stuff and associated data… that would be an excellent candidate. We should sit down and chat about protein folding some time…