Virus capsids are pretty
Brief: The majority of virus protein coats (capsids) are in the shape of an icosahedron — a figure with twenty equilateral triangles. The first time I saw this rendered was in a paper by David S. Goodsell. In it, Goodsell describes proteins with structural symmetries. Four viruses are used as examples — they are tobacco necrosis virus (2BUK), tomato bushy stunt virus (2TBV), bluetongue virus (3IYK) and simian virus 40 (1SVA) linked here to their RSCB PDB entries.
Pretty, aren’t they? Very pretty.
If you follow the PDB links, you can take a look at how a single tessellation unit appears, how long a chain is and how massive the capsid is.
Notice: The icosahedron (twenty equilateral triangles) must not be confused with the dodecahedron (twenty points).
C# & Bioinformatics: Indexers & Substitution Matrices
I’ve recently come to appreciate the convenience of C# indexers. What indexers allow you to do is to subscript an object using the familiar bracket notation. I’ve used them for substitution matrices as part of my phylogeny project. Indexers are essentially syntactical sugar that obey the rules of method overloading. I first describe what I think are useful substitution matrix indexers and then a bare bones substitution matrix class (you could use your own). The indexer notation implementation is discussed last, so feel free to skip the preamble if you’re able to deduce what I’m doing.
Note: I’ve only discussed accessors (getters) and not mutators (setters) today.
Some Reasonable Substitution Matrix Indexers
This is the notation you might expect from the indexer notation in C#.
// Let there be a class called SubMatrix which contains data from BLOSUM62.
var sm = new SubMatrix( ... );
// I'll assume you already have some constructors.
int index_of_proline = sm['P'];
// Returns the row or column that corresponds to proline, 14.
char token_at_three = sm[3];
// Returns the amino acid at position three, aspartate.
int score_proline_to_aspartate = sm['P', 'D'];
// Returns the score for a mutation from proline to aspartate, -1.
int score_aspartate_to_proline = sm[3, 14];
// Returns the score for a mutation from aspartate to proline, -1.
An Example Bare Bones Substitution Matrix Class
Let’s say you’ve loaded up the BLOSUM62 and are representing it internally in some 2D array…
// We've keying the rows and columns in the order given by BLOSUM62: // ARNDCQEGHILKMFPSTWYVBZX* (24 rows, columns) int[,] imatrix;
For convenience, let’s say you’ll also keep a dictionary to map at which array position one finds each amino acid…
// Keys = amino acid letters, Values = row or column index Dictionary<char, int> indexOfAA;
Finally, we’ll put these two elements into a class and assume that you’ve already written your own constructors that will take care of the above two items — either from accepting arrays and strings as arguments or by reading from raw files. If this isn’t true and you need more help, feel free to leave a comment and I’ll extend this bare bones example.
// Bare bones class ...
public partial class SubMatrix {
// Properties ...
private int[,] imatrix;
private Dictionary<char, int> indexOfAA;
// Automatic Properties ...
public int Width { // Returns number of rows of the matrix.
get {
return imatrix.GetLength(0);
}
}
// Constructors ...
...
}
I’ve added the automatic property “Width” above — automatic properties are C# members that provide encapsulation: a public face to some arbitrary backend data — you’ve been using these all along when you’ve called “List.Count” or “Array.Length“.
Substitution Matrix Indexer Implementation
You can implement the example substitution matrix indexers as follows. Notice the use of the “this” keyword and square[] brackets[] to specify[] arguments[].
//And finally ...
public partial class SubMatrix {
// Indexers ...
// Give me the row or column index where I can find the token "aa".
public int this[char aa] {
get {
return this.indexOfAA[aa];
}
}
// Give me the amino acid at the row or column index "index".
public char this[int index] {
get {
return this.key[index];
}
}
// Give me the score for mutating token "row" to token "column".
public double this[char row, char column] {
get {
return this.imatrix[this.indexOfAA[row], this.indexOfAA[column]];
}
}
// Give me the score for mutating token at index "row" to index "column".
public double this[int row, int column] {
get {
return this.imatrix[row, column];
}
}
}
Similar constructs are available in Python and Ruby but not Java. I’ll likely cover those later as well as how to set values too.
Edit number of rows shown for ‘Visitor Maps and Who’s Online’
Brief: For those of you who use WordPress, a handy plug-in to see who’s been viewing your site is Visitor Maps and Who’s Online. You’ll notice that there isn’t a way to change the number of entries (rows) displayed in the Who’s Online and the Who’s Been Online pages in the plugin managing pages. In order to change that, you’re going to dive a little deeper. Here are step by step instructions on how to increase the number of displayed visitors for version 1.5.2..
Requirements: Must have “WordPress” 3.x installed with the “Visitor Maps and Who’s Online” 1.x plugin installed.
(1) Log into your dashboard (wp-admin).
(2) Click on the Plugins administration page — it’s in the far left column.
(3) Scroll down the page and look for Visitor Maps and Who’s Online.
(4) Click on Edit (near Deactivate and Settings).
(5) In the far right column named Plugin Files, click on “visitor-maps/class-wo-been.php“.

(6) In the text area named “Editing visitor-maps/class-wo-been.php (inactive)” Search for the string “$rows_per_page = 25;“.

(7) Change the integer “25” to any whole number you want. I chose “100“.
(8) Click on the button below labelled Update File.

And you’re done! To see the results, click on “Who’s Been Online” and checkout the extended list of visitors per page.
Still Unsolved: There’s still one item I haven’t really looked into. The display that you get back is not actually the number of rows defined by $rows_per_page — instead, this variable only tells the plugin the number of entries to load. What this means is that turning on a filter like “Show Bots: No” in “Who’s Been Online” counts the total number of entries with bots included, then removes them from the display — you end up $rows_per_page minus the number of bots displayed instead of a total of $rows_per_page non-bot rows. I’ll wait till the next version, perhaps the author is already working on it. For now, this quick fix should help.
C# & Bioinformatics: Align Strings to Edit Strings
This post follows roughly from the e-strings (R. Edgar, 2004) topic that I posted about here. The previous source code was listed in Ruby and JS, but this time I’ve used C#.
In this post, I discuss alignment strings (a-strings), then why and how to convert them into edit strings (e-strings). Incidentally, I can’t seem to recover where I first saw alignment strings so tell me if you know.
Alignment strings
When you perform a pairwise sequence alignment, the transformation that you perform on the two sequences is finite. You can record precisely where insertions, deletions and substitutions (or matches) are made. This is useful if you want to retain the original sequences, or later on build a multiple sequence alignment while keeping the history of modifications. There’s a datastructure I’ve seen described called alignment strings, and in it you basically list out the characters ‘I’, ‘D’ and ‘M’ to describe the pairwise alignment.
Consider the example two protein subsequences below.
gi|115372949: GQAGDIRRLQSFNFQTYFVRHYN gi|29827656: SLSTGVSRSFQSVNYPTRYWQ
A global alignment of the two using the BLOSUM62 substitution matrix with the gap penalties -12 to open, -1 to extend yields the following.
gi|115372949: GQAGDIR-----RLQSFNFQTYFVRHYN gi|29827656: SL----STGVSRSFQSVNYPT---RYWQ
The corresponding alignment string looks like this…
gi|115372949: GQAGDIR-----RLQSFNFQTYFVRHYN gi|29827656: SL----STGVSRSFQSVNYPT---RYWQ Align String: MMDDDDMIIIIIMMMMMMMMMDDDMMMM
Remember, the alignment is described such that the top string is modelled as occurring earlier than the bottom string which occurs later — this is why a gap in the top string is an insertion (that new material is inserted in the later, bottom string) while a gap in the bottom string is a deletion (that material is deleted to make the later string). Notice in reality, it doesn’t really matter what’s on top and what’s on bottom– the important thing is that the alignment now contains gaps.
Why we should probably use e-strings instead
An alignment string describes the relationship between two strings and their gaps — we are actually recording some redundant information if we only want to take one string at a time into consideration and the path needed to construct profiles it participates in. The top and bottom sequences are also treated differently, where both ‘M’ and ‘D’ indicates retention of the original characters for the top string and both ‘M’ and ‘I’ indicates retention for the bottom string; the remaining character of course implies the inclusion of a gap character. A pair of e-strings would give each of these sequences their own data structure and allow us to cleanly render the sequences as they appear in the deeper nodes of a phylogenetic tree using the multiply operation described last time.
Here are the corresponding e-strings for the above examples.
gi|115372949: GQAGDIR-----RLQSFNFQTYFVRHYN e-string: < 7, -5, 16 > gi|29827656: SL----STGVSRSFQSVNYPT---RYWQ e-string: < 2, -4, 16, -3, 4 >
Recall that an e-string is a list of alternating positive and negative integers; positive integers mean to retain a substring of the given length from the originating sequence, and negative integers mean to place in a gap of the given length.
Converting a-strings to e-strings
Below is a C# code listing for an implementation I used in my project to convert from a-strings to the more versatile e-strings. The thing is — I really don’t use a-strings to begin with anymore. In earlier versions of my project, I used to keep track of my movement across the score matrix using an a-string by dumping down a ‘D’ for a vertical hop, a ‘I’ for a horizontal hop and a ‘M’ for a diagonal hop. I now just count the number of relevant steps to generate the matching pair of e-strings.
The function below, astring_to_estring takes an a-string (string u) as an argument and returns two e-strings in a list (List<List<int>>) — don’t let that type confuse you, it simply breaks down to a list with two elements in it: the e-string for the top sequence (List<int> v), and the e-string for the bottom sequence (List<int> w).
public static List<List<int>> astring_to_estring(string u) {
/* Defined elsewhere are the constant characters ...
EDITSUB = 'M';
EDITINS = 'I';
EDITDEL = 'D';
*/
var v = new List<int>(); // Top e-string
var w = new List<int>(); // Bottom e-string
foreach(var uu in u) {
if(uu == EDITSUB) { // If we receive a 'M' ...
if(v.Count == 0) { // Working with e-string v (top, keep)
v.Add(1);
} else if(v[v.Count -1] <= 0) {
v.Add(1);
} else {
v[v.Count -1] += 1;
}
if(w.Count == 0) { // Working with e-string w (bottom, gap)
w.Add(1);
} else if(w[w.Count -1] <= 0) {
w.Add(1);
} else {
w[w.Count -1] += 1;
}
} else if(uu == EDITINS) { // If we receive a 'I' ...
if(v.Count == 0) { // Working with e-string v (top, gap)
v.Add(-1);
} else if(v[v.Count -1] >= 0) {
v.Add(-1);
} else {
v[v.Count -1] -= 1;
}
if(w.Count == 0) { // Working with e-string w (bottom, keep)
w.Add(1);
} else if(w[w.Count -1] <= 0) {
w.Add(1);
} else {
w[w.Count -1] += 1;
}
} else if(uu == EDITDEL) { // If we receive a 'D' ...
if(v.Count == 0) { // Working with e-string v (top, keep)
v.Add(1);
} else if(v[v.Count -1] >= 0) {
v[v.Count -1] += 1;
} else {
v.Add(1);
}
if(w.Count == 0) { // Working with e-string w (bottom, keep)
w.Add(-1);
} else if(w[w.Count -1] >= 0) {
w.Add(-1);
} else {
w[w.Count -1] -= 1;
}
}
}
var vw = new List<List<int>>(); // Set up return list ...
vw.Add(v); // Top e-string v added ...
vw.Add(w); // Bottom e-string w added ...
return vw;
}
The conversion back from e-strings to a-strings is also easy, but I don’t cover that today. Enjoy and happy coding
Wanted: Semiotics Search Tool
Brief: One of the problems that I’ve encountered is the complete and utter inability to search for symbols online. One can enter keywords, but there doesn’t seem to be a good generative grammar or stick-figure search to specify the symbol that you’ve seen so that you can ask “what is this figure called”, “what does this figure mean?”, “who does it belong to?” — the closest I’ve found has been this reference, fittingly called symbols.com — but it only offers you a symbol whose name you already know. There are also a few tools that will present several chemical compounds that match a query sketch the user inputs — here’s a structure search by PubChem and another one by eMolecules.
So what figures do I want to be able to search for? Here are a few example queries …
- A circle is drawn with a freehand curve separating the figure roughly into two halves — should turn up the Ying Yang along with a few other circular bipartitioned figures.
- Two arcs are drawn beside one another concave inward with a staff in the middle — should turn up the symbol for Sikhism and the Caduceus.
- Two to four stick figure humans are placed inside a box — should turn up the symbol for an elevator or washrooms etc.
If anyone knows of a good sketch-based semiotic search tool, please let me know. Or conversely, if anyone’s interested in having one developed — I’d be interested in helping
Hi Ed,
I’ve been following your blog for a little while now, good stuff.
I found something on LifeHacker for you: a Hitachi-powered search called GazoPa.
Here’s your first example, it works pretty well! http://bit.ly/yy001
I was searching for something less relevant and found GazoPa instead. I was looking for a project where you could draw some stick figures in a scene (for example, a man on a house) and it would be rendered with real images. I think it was a limited research project, perhaps by Adobe, never released.
That’s really neat actually. I bet GazoPa and the project you mentioned could use the same image-recognization-to-descriptor front end. There was a grad student at Guelph that was working on the opposite half of what we’re after: she was working on that terribly intractable space of describing the components of an image for images returned from a search engine. I haven’t caught up with her yet, I think her name was Melanie Veltman.
Frequent typos of mine
Brief: There are a few typos that I consistently make. I have concluded that these things have been trained into my brain some how. No matter how much I want to correct them, they just keep showing up. Even conscious efforts to “not type it wrongly this time” are only partially successful. It’s … like a speech impediment for my fingers.
Part of me wants to conjecture about motion planning, the cerebellum, buffer overruns and the QWERTY layout — but a larger part would rather not.
Here’s a list of words that get these automatic typos — Yes, I know I use four fingers on my left hand plus three fingers on my right hand to type — I think this scheme was inherited from an obsession with the DooM series of first person shooters when I was a primordial computer user.
- total → totoal — redundantly drummed ‘o’ with right middle finger.
- schematics → schemaitcs — incorrect priority given to right middle finger ‘i’ over left index ‘t’.
- blast → blasy — incorrectly increased reaching distance between left middle finger ‘s’ to left index ‘y’.
- desktop → dekstop — incorrect priority given to right middle finger ‘k’ over left index ‘s’.
- people → poeple — incorrect priority given to right middle finger ‘o’ over left middle ‘e’.
If you have a patch for my brain, please let me know.
Ed's Big Plans