# Ed's Big Plans

## Virus capsids are pretty

Brief: The majority of virus protein coats (capsids) are in the shape of an icosahedron — a figure with twenty equilateral triangles. The first time I saw this rendered was in a paper by David S. Goodsell. In it, Goodsell describes proteins with structural symmetries. Four viruses are used as examples — they are tobacco necrosis virus (2BUK), tomato bushy stunt virus (2TBV), bluetongue virus (3IYK) and simian virus 40 (1SVA) linked here to their RSCB PDB entries.

Pretty, aren’t they? Very pretty.

If you follow the PDB links, you can take a look at how a single tessellation unit appears, how long a chain is and how massive the capsid is.

Notice: The icosahedron (twenty equilateral triangles) must not be confused with the dodecahedron (twenty points).

Eddie Ma

September 27th, 2010 at 1:00 pm

## Idea: Delaunay Simplex Graph Grammar

The Structural Bioinformatics course I’m auditing comes with an independent project for graduate students. I’ve decided to see how feasible and meaningful it is to create a graph rewriting grammar for proteins that have been re-expressed as a Delauney Tessellation.

I was first introduced to the Delauney Tessellation about half a year ago. Such a tessellation is composed of irregular three dimensional tetrahedrons where each vertex corresponds to an amino acid. A hypothetical sphere that is defined by the four points of such a tetrahedron cannot be crossed by a line segment that does not belong to said tetrahedron.

An alphabet in formal languages is a finite set of arbitrarily irreducible tokens that composes the inputs of a language. In this project, I want to see if I can discover a grammar for the language of Delauney protein simplex graphs. Graph rewriting is likened to the collapse of neighbouring tetrahedrons. The tetrahedrons selected are either functionally important, stability important or have a strangely high probability of occurrence. This definition is recursively applied so that previously collapsed points are subject to further collapse in future passes of the algorithm.

When a subgraph is rewritten, two things happen. Some meaning is lost from the original representation of the protein, but that same meaning is captured on a stack of the changes made to the representation. In this way, the protein graph is iteratively simplified, while a stack that records the simplifications indicates all of the salient grammatical productions that have been used.

This stack is what my project is really after. Can a stack based on grammatical production rules for frequency of occurrence render any real information, or is it just noise? I can’t even create a solid angle to drive my hypothesis at this point. … “Yes … ?” …

I’ve seen a lot of weird machine learning algorithms in my line of work… and I attest that it’s hard for a novice to look at a description and decide whether or not it derives anything useful. Keep in mind that the literature is chuck full of things that DO work, and none of the things that didn’t make it. I conjecture that this representation has made me optimistically biased.

This method however IS feasible to deploy on short notice in the scope of an independent project ðŸ˜€

Eddie Ma

March 8th, 2010 at 6:31 pm