Archive for August, 2010
C# & Bioinformatics: Align Strings to Edit Strings
This post follows roughly from the e-strings (R. Edgar, 2004) topic that I posted about here. The previous source code was listed in Ruby and JS, but this time I’ve used C#.
In this post, I discuss alignment strings (a-strings), then why and how to convert them into edit strings (e-strings). Incidentally, I can’t seem to recover where I first saw alignment strings so tell me if you know.
Alignment strings
When you perform a pairwise sequence alignment, the transformation that you perform on the two sequences is finite. You can record precisely where insertions, deletions and substitutions (or matches) are made. This is useful if you want to retain the original sequences, or later on build a multiple sequence alignment while keeping the history of modifications. There’s a datastructure I’ve seen described called alignment strings, and in it you basically list out the characters ‘I’, ‘D’ and ‘M’ to describe the pairwise alignment.
Consider the example two protein subsequences below.
gi|115372949: GQAGDIRRLQSFNFQTYFVRHYN gi|29827656: SLSTGVSRSFQSVNYPTRYWQ
A global alignment of the two using the BLOSUM62 substitution matrix with the gap penalties -12 to open, -1 to extend yields the following.
gi|115372949: GQAGDIR-----RLQSFNFQTYFVRHYN gi|29827656: SL----STGVSRSFQSVNYPT---RYWQ
The corresponding alignment string looks like this…
gi|115372949: GQAGDIR-----RLQSFNFQTYFVRHYN gi|29827656: SL----STGVSRSFQSVNYPT---RYWQ Align String: MMDDDDMIIIIIMMMMMMMMMDDDMMMM
Remember, the alignment is described such that the top string is modelled as occurring earlier than the bottom string which occurs later — this is why a gap in the top string is an insertion (that new material is inserted in the later, bottom string) while a gap in the bottom string is a deletion (that material is deleted to make the later string). Notice in reality, it doesn’t really matter what’s on top and what’s on bottom– the important thing is that the alignment now contains gaps.
Why we should probably use e-strings instead
An alignment string describes the relationship between two strings and their gaps — we are actually recording some redundant information if we only want to take one string at a time into consideration and the path needed to construct profiles it participates in. The top and bottom sequences are also treated differently, where both ‘M’ and ‘D’ indicates retention of the original characters for the top string and both ‘M’ and ‘I’ indicates retention for the bottom string; the remaining character of course implies the inclusion of a gap character. A pair of e-strings would give each of these sequences their own data structure and allow us to cleanly render the sequences as they appear in the deeper nodes of a phylogenetic tree using the multiply operation described last time.
Here are the corresponding e-strings for the above examples.
gi|115372949: GQAGDIR-----RLQSFNFQTYFVRHYN e-string: < 7, -5, 16 > gi|29827656: SL----STGVSRSFQSVNYPT---RYWQ e-string: < 2, -4, 16, -3, 4 >
Recall that an e-string is a list of alternating positive and negative integers; positive integers mean to retain a substring of the given length from the originating sequence, and negative integers mean to place in a gap of the given length.
Converting a-strings to e-strings
Below is a C# code listing for an implementation I used in my project to convert from a-strings to the more versatile e-strings. The thing is — I really don’t use a-strings to begin with anymore. In earlier versions of my project, I used to keep track of my movement across the score matrix using an a-string by dumping down a ‘D’ for a vertical hop, a ‘I’ for a horizontal hop and a ‘M’ for a diagonal hop. I now just count the number of relevant steps to generate the matching pair of e-strings.
The function below, astring_to_estring takes an a-string (string u) as an argument and returns two e-strings in a list (List<List<int>>) — don’t let that type confuse you, it simply breaks down to a list with two elements in it: the e-string for the top sequence (List<int> v), and the e-string for the bottom sequence (List<int> w).
public static List<List<int>> astring_to_estring(string u) { /* Defined elsewhere are the constant characters ... EDITSUB = 'M'; EDITINS = 'I'; EDITDEL = 'D'; */ var v = new List<int>(); // Top e-string var w = new List<int>(); // Bottom e-string foreach(var uu in u) { if(uu == EDITSUB) { // If we receive a 'M' ... if(v.Count == 0) { // Working with e-string v (top, keep) v.Add(1); } else if(v[v.Count -1] <= 0) { v.Add(1); } else { v[v.Count -1] += 1; } if(w.Count == 0) { // Working with e-string w (bottom, gap) w.Add(1); } else if(w[w.Count -1] <= 0) { w.Add(1); } else { w[w.Count -1] += 1; } } else if(uu == EDITINS) { // If we receive a 'I' ... if(v.Count == 0) { // Working with e-string v (top, gap) v.Add(-1); } else if(v[v.Count -1] >= 0) { v.Add(-1); } else { v[v.Count -1] -= 1; } if(w.Count == 0) { // Working with e-string w (bottom, keep) w.Add(1); } else if(w[w.Count -1] <= 0) { w.Add(1); } else { w[w.Count -1] += 1; } } else if(uu == EDITDEL) { // If we receive a 'D' ... if(v.Count == 0) { // Working with e-string v (top, keep) v.Add(1); } else if(v[v.Count -1] >= 0) { v[v.Count -1] += 1; } else { v.Add(1); } if(w.Count == 0) { // Working with e-string w (bottom, keep) w.Add(-1); } else if(w[w.Count -1] >= 0) { w.Add(-1); } else { w[w.Count -1] -= 1; } } } var vw = new List<List<int>>(); // Set up return list ... vw.Add(v); // Top e-string v added ... vw.Add(w); // Bottom e-string w added ... return vw; }
The conversion back from e-strings to a-strings is also easy, but I don’t cover that today. Enjoy and happy coding 😀