Ed's Big Plans

Computing for Science and Awesome

Archive for the ‘Brendan McConkey’ tag

fsMSA Algorithm Context

without comments

What started as a meeting between me and my advisors ended up being a ball of unresolved questions about the cultural context of multiple sequence alignment and phylogenetic trees. While I had a good idea of what the field and its researchers had looked into and developed, I hadn’t a grasp of how far along we were. The result is the presentation I’ve just finished. In it, I discuss what I consider to be a representative sampling of the alignment and phylogenetic tree building algorithms available right now, at this very instant.

(PDF not posted, contact me if interested.)

Written by Eddie Ma

January 13th, 2010 at 12:12 pm

fsMSA Algorithm… Monkeys in a β-Barrel…

without comments

I’ve finally finished documenting my foil sensitive protein sequence algorithm… This is part of Monkeys in a β-Barrel — a work in progress, this time continuing more on Andrew’s half of the problem rather than Aron’s.

I’ve decided on using the word “foil” to mean “internal repeat” since it’s easier to say and less awkward in written sentences. Andre suggested it after “trefoil” and “cinquefoil”, the plant.

Thumbnails below (if you are curious about the full slide show, contact me :D ).

Protein Project Progress…

without comments

Last week, Liz, Aron, Andrew, Brendan and I sat down to discuss the beta-trefoil project. It was a good chance for me to understand the methods used and the kinds of results we are interested in for my own TIM Barrel project.

Continuing on with the structural repeat problem, I’ll today be writing a short FSA parser that can handle DSSP or DSS output– simply, a very primitive machine will be used to imitate a human’s visual inspection of repeated secondary structural elements in given proteins. This is in line with the work I did manually staring at structures to get a grasp of how to look at protein models, and also in line with the objective to automate much of this work. Prior to that step, I reduced the probability of doing redundant work by using BLASTCLUST and selecting only a few known structures in each cluster to inspect… a sequence based alignment for each cluster will inform me of where my manually detected repeat boundaries map to the remaining sequences.

Oddity: If you BLASTCLUST all the “FULL” (not “SEED”) sets of TIM Barrel sequences for the entire fold from PFAM along with the sequences of known TIM Barrel fold structures of SCOP, you’ll find that cluster fifteen (as of today) has these elements:

1YBE_A A6U5X9 A6WV52 Q2KDT0 Q2YNV6 Q6G0X7 Q6G5H6 Q8UIS9 Q8YEP2 Q92S49 Q98D24 A1UUA2

In the above listing, 1YBE:A (PDB code) is the sole known PDB structure, while the remainder are putative TIM Barrels (uniprot codes) as determined by the HMM model from PFAM.

The enzyme 1YBE looks like this…1YBE_A

It’s an oddity because of the number of alpha-helices inserted within what is usually a hydrophobic beta barrel– the red pieces of ribbon should form a hollow cylinder, but it’s split apart for 1YBE and accommodates a bunch of cyan helices. Labeled in white are helices that break with the beta-alpha repeating secondary structural element (SSE) pattern by occurring before the first repeat. Labeled in green are breaks between beta-alpha SSE patterns.

Reference

Seetharaman, J., Swaminathan, S., Crystal Structure of a Nicotinate phosphoribosyltransferase [To be Published]

Written by Eddie Ma

October 28th, 2009 at 9:35 am

TIM Barrels and 4-Alpha Helix Bundles

without comments

Beta-Alpha TIM Barrels and 4-Alpha Helix bundles are the first of the major folds I’ll be looking at here at Waterloo…

As with all academic projects, the probability of goal, approach and method mutation is high. Looking at the above protein folds serves as an excellent starting point as I’ll be applying some of the established methods that Andrew and Aaron have developed.

Alpha helices and Beta sheets are objects that any highschool biologist is acquainted with. To recapitulate, alpha helices are sequences of amino acids arranged so that the alpha carbon of each amino acid falls along the path of a helix. The number of amino acids per turn in this peptide and the regularity of the helix are determined by both the sequence and the environment that peptide finds itself in. Amino acids in the beta sheet conformation are arranged so that their alpha carbons zig zag. The result is a nice wide and flat shape schematically drawn as a sheet. Alpha helices and Beta sheets are collectively called secondary structural elements.

So, folds are these giant overarching classification of proteins– Folds themselves are inherently structural, so classifying them OR using them as classifications is only relevant in structural studies and databases on the web like SCOP and CATH. In databases such as SCOP and CATH, classification of similarly structured proteins start by determining whether the protein contains mostly alpha helices; mostly beta sheets; beta sheet and alpha helices alternatively and irregularly; or beta sheet and alpha helices in distinct regions of a protein. In SCOP, further classification is done by manually assigning proteins to smaller and smaller categories, while in CATH, these classifications are done by a hidden Markov model and then manually inspected (or not). It turns out that CATH uses a similar manual approach, and uses HMMs only to assist; contrast with PFAM which actually utilizes HMMs for the majority of work and is verified afterward by humans.

Certain folds like the beta-Trefoil and TIM Barrel benefit from containing only proteins that cleanly fit into some subcategory or several subcategories– it is then possible to just drill into the right level of categorization and pull out all of the beta-Trefoils and TIM Barrels we want.

The 4-Alpha-Helix Bundle constitutes a fold of protein that manages to be spread around the databases, being a very common secondary structural repeat; also a very small repeat when compared to the two giants above. These two items represent an interesting contrast too. Both machine and human intelligence pulls TIM Barrels together while sprinkling alpha-helix bundles across databases and subcategories. And yes, the size difference helps too.

So, I’m starting with a structural then sequence based alignment for single domain TIM Barrels and alpha-helix bundles; to be completely focused, an objective is named: To identify where sequence repeats occur in each individual protein.

Written by Eddie Ma

September 24th, 2009 at 2:09 pm

Near and Far Goals

without comments

Okay okay okay– so the defense was about four days ago, I need to get myself organized… these are the projects I can afford to participate in OR are the projects that will yield the most return in terms of enjoyment or some other intangible value.

Immediate Focus

  • Get married– with the wedding now only two weeks away, I need to pull everything I need together and really support Cara in the remaining tasks. As far as I understand, almost everything is in order already and it’s down to ballroom dance practicing etc. and getting our lines memorized. Well, there’s probably a lot more in terms of communicating with the flower girl and ring bearer– and my family– so hopefully we’ll magically finish just on time. Plus there’s a week in August where we’ll be at Disney World for the honeymoon, so I’ll certainly find myself in a bubble away from anything else.

Long Term Focus

  • The PhD– I’ve already written to Liz and Brendan about the summer, or actually– the changes about this summer. Earlier on, we had agreed it would be good if I got a head start on the PhD project in summer. At the very least, I’d refine the problem space and be able to formalize my interests. It looks like the most rational thing to do now is to do a normal clean start in Fall since there are a few outstanding things I need to tackle back at Guelph.

Important Intermediate Projects (In Randomly Generated Order)

  • Graduating and all its caveats (Must Finish)– I need to fix the thesis, which means I need to get the examiners’ notes back. I also need to finish some paper work — I have a stack of signed documents that I shouldn’t lose that has to be handed in with the final thesis, and another stack of signed documents that has to be handed in with the department keys. This item doesn’t stress me out as much as it probably should…
  • iGEM (Would Be Nice)– I’ve been delegated nothing right now, so I actually should chase down Andre next week when we morph from developers to end users. Yes, that’s right– we shall suffer the glory of using our own software creation. I really ought to checkout the repository to see if I can understand the code logic after everyone’s touched it (the modules are more or less mature at this point). Reading the code would be a start :D — After that, the modeling team will probably break apart and merge into other teams in UWiGEM. Plus there’s mathematical modeling, planning for next year etc. The iGEM project has the potential to be paper worthy if we manage to get some decent results… it’ll be a feat of an interdisciplinary team which makes me happy to be a part of it all.
  • Chris’ Project (Must Finish)– So, this item is always running in the background since I’m not the primary owner of this project. Chris has done an excellent job with the science so far, which motivates me to offer him the support he needs to finish… this one’s another potential paper– but again, we need to get results and actually have something interesting to say. He’s actually doing five-fold and leave-one-out cross validation schemes right now, but that’s a post for another time.
  • MSc-X3 (Would Be Nice)– The math paper! Stefan and I decided a long time ago that a math paper would be good to complete the story of the NGN. This would actually be slightly more comprehensive than the thesis in explaining the math, including things like run time costs and analysis plus all of the equations that got lost in the translation.
  • Andre’s Mystery MSc-??? (Would Be Nice)– I have little detail about this, but it’s another gene management system with the additional feature that fragments can be checked out and added to end users’ own databases. I’m really curious to learn more. The last I heard, Andre managed to churn out code that didn’t have any bugs– panicked at the lack of bugs– then found some bugs– and was relieved.

Other Concerns

  • I’m going to stop short of listing three additional projects I had been working on previously– I have realized that I don’t have time, and the organizations (and one person) to whom these projects belong to are probably aware that I managed to get buried in work… I really want to revisit these items in future, but I am unsure when time will come available again.

Written by Eddie Ma

July 18th, 2009 at 12:26 pm

Meeting with Liz and Brendan

without comments

Met with Dr. Meiering (Liz) and Dr. McConkey (Brendan) on Tuesday.

We’ve basically decided on the courses that I’m best suited for in the role of a TA. They recommended that I try out for Molecular Biotechnology since it relates especially to the thesis I’ll be working on. I’m tempted to agree as well, especially since I do have a bit of history with the stuff in my undergraduate project as well as iGem last year. As a secondary choice, I chose Cell Biology as it’s something I’m very familiar with and use in day to day activities.

Written by Eddie Ma

June 10th, 2009 at 2:46 pm