Ed's Big Plans

Computing for Science and Awesome

Archive for the ‘Alignment Strings’ tag

C# & Bioinformatics: Align Strings to Edit Strings

without comments

This post follows roughly from the e-strings (R. Edgar, 2004) topic that I posted about here. The previous source code was listed in Ruby and JS, but this time I’ve used C#.

In this post, I discuss alignment strings (a-strings), then why and how to convert them into edit strings (e-strings). Incidentally, I can’t seem to recover where I first saw alignment strings so tell me if you know.

Alignment strings

When you perform a pairwise sequence alignment, the transformation that you perform on the two sequences is finite. You can record precisely where insertions, deletions and substitutions (or matches) are made. This is useful if you want to retain the original sequences, or later on build a multiple sequence alignment while keeping the history of modifications. There’s a datastructure I’ve seen described called alignment strings, and in it you basically list out the characters ‘I’, ‘D’ and ‘M’ to describe the pairwise alignment.

Consider the example two protein subsequences below.

gi|115372949: GQAGDIRRLQSFNFQTYFVRHYN
gi|29827656:  SLSTGVSRSFQSVNYPTRYWQ

A global alignment of the two using the BLOSUM62 substitution matrix with the gap penalties -12 to open, -1 to extend yields the following.

gi|115372949: GQAGDIR-----RLQSFNFQTYFVRHYN
gi|29827656:  SL----STGVSRSFQSVNYPT---RYWQ

The corresponding alignment string looks like this…

gi|115372949: GQAGDIR-----RLQSFNFQTYFVRHYN
gi|29827656:  SL----STGVSRSFQSVNYPT---RYWQ
Align String: MMDDDDMIIIIIMMMMMMMMMDDDMMMM

Remember, the alignment is described such that the top string is modelled as occurring earlier than the bottom string which occurs later — this is why a gap in the top string is an insertion (that new material is inserted in the later, bottom string) while a gap in the bottom string is a deletion (that material is deleted to make the later string). Notice in reality, it doesn’t really matter what’s on top and what’s on bottom– the important thing is that the alignment now contains gaps.

Why we should probably use e-strings instead

An alignment string describes the relationship between two strings and their gaps — we are actually recording some redundant information if we only want to take one string at a time into consideration and the path needed to construct profiles it participates in. The top and bottom sequences are also treated differently, where both ‘M’ and ‘D’ indicates retention of the original characters for the top string and both ‘M’ and ‘I’ indicates retention for the bottom string; the remaining character of course implies the inclusion of a gap character. A pair of e-strings would give each of these sequences their own data structure and allow us to cleanly render the sequences as they appear in the deeper nodes of a phylogenetic tree using the multiply operation described last time.

Here are the corresponding e-strings for the above examples.

gi|115372949: GQAGDIR-----RLQSFNFQTYFVRHYN
e-string:     < 7, -5, 16 >
gi|29827656:  SL----STGVSRSFQSVNYPT---RYWQ
e-string:     < 2, -4, 16, -3, 4 >

Recall that an e-string is a list of alternating positive and negative integers; positive integers mean to retain a substring of the given length from the originating sequence, and negative integers mean to place in a gap of the given length.

Converting a-strings to e-strings

Below is a C# code listing for an implementation I used in my project to convert from a-strings to the more versatile e-strings. The thing is — I really don’t use a-strings to begin with anymore. In earlier versions of my project, I used to keep track of my movement across the score matrix using an a-string by dumping down a ‘D’ for a vertical hop, a ‘I’ for a horizontal hop and a ‘M’ for a diagonal hop. I now just count the number of relevant steps to generate the matching pair of e-strings.

The function below, astring_to_estring takes an a-string (string u) as an argument and returns two e-strings in a list (List<List<int>>) — don’t let that type confuse you, it simply breaks down to a list with two elements in it: the e-string for the top sequence (List<int> v), and the e-string for the bottom sequence (List<int> w).

public static List<List<int>> astring_to_estring(string u) {
    /* Defined elsewhere are the constant characters ...
       EDITSUB = 'M';
       EDITINS = 'I';
       EDITDEL = 'D';
    */
    var v = new List<int>(); // Top e-string
    var w = new List<int>(); // Bottom e-string
    foreach(var uu in u) {
        if(uu == EDITSUB) { // If we receive a 'M' ...
            if(v.Count == 0) { // Working with e-string v (top, keep)
                v.Add(1);
            } else if(v[v.Count -1] <= 0) {
                v.Add(1);
            } else {
                v[v.Count -1] += 1;
            }
            if(w.Count == 0) { // Working with e-string w (bottom, gap)
                w.Add(1);
            } else if(w[w.Count -1] <= 0) {
                w.Add(1);
            } else {
                w[w.Count -1] += 1;
            }
        } else if(uu == EDITINS) { // If we receive a 'I' ...
            if(v.Count == 0) { // Working with e-string v (top, gap)
                v.Add(-1);
            } else if(v[v.Count -1] >= 0) {
                v.Add(-1);
            } else {
                v[v.Count -1] -= 1;
            }
            if(w.Count == 0) { // Working with e-string w (bottom, keep)
                w.Add(1);
            } else if(w[w.Count -1] <= 0) {
                w.Add(1);
            } else {
                w[w.Count -1] += 1;
            }
        } else if(uu == EDITDEL) { // If we receive a 'D' ...
            if(v.Count == 0) { // Working with e-string v (top, keep)
                v.Add(1);
            } else if(v[v.Count -1] >= 0) {
                v[v.Count -1] += 1;
            } else {
                v.Add(1);
            }
            if(w.Count == 0) { // Working with e-string w (bottom, keep)
                w.Add(-1);
            } else if(w[w.Count -1] >= 0) {
                w.Add(-1);
            } else {
                w[w.Count -1] -= 1;
            }
        }
    }
    var vw = new List<List<int>>(); // Set up return list ...
    vw.Add(v); // Top e-string v added ...
    vw.Add(w); // Bottom e-string w added ...
    return vw;
}

The conversion back from e-strings to a-strings is also easy, but I don’t cover that today. Enjoy and happy coding 😀

Eddie Ma

August 21st, 2010 at 9:03 pm

Bioinformatics: Edit Strings (Ruby, JavaScript)

without comments

>>> Attached: ( editStringAdder.rb — in Ruby | editStrings.js — in JavaScript ) <<<

I wanted to share something neat that I’ve been using in the phylogeny analyzing software I’ve been developing. In the MUSCLE3 BMC Bioinformatics paper (Robert Edgar, 2004), Edgar describes a way to describe all the changes required to get from one raw sequence to how it appears somewhere in an internal node, riddled with gaps. He calls the data structures Edit Strings (e-strings).

I started using these things in my visualizer because it is a convenient way to describe how to render a profile of sequences at internal nodes of my phylogenetic trees. Rather than storing a copy of the sequence with gaps at each of its ancestors, e-strings allow you to store a discrete pathway of the changes your alignment algorithm made to a sequence at each node.

You can chain e-strings together with a multiply operation so that the profile of any node in the tree can be created out of just the original primary sequences of the dataset (the taxa) and the set of corresponding e-strings for the given tree. You can choose to store the alphabet-distribution profiles this way at each node too, but I decided to use the e-strings only for my visualizer (where seeing each sequence within a profile is important).

Edit strings take the form of a sequence of alternating positive and negative integers; positive integers mean “retain a substring of this length” while negative integers mean “insert gaps of this length” — here’s a few example applications of edit strings on the sequence “ABCDEFGHI“:

<10>("ABCDEFGHIJ") = "ABCDEFGHIJ"
<10 -3>("ABCDEFGHIJ") = "ABCDEFGHIJ---"
<4, -2, 6>("ABCDEFGHIJ") = "ABCD--EFGHIJ"
<2, -1, 5, -4, 3>("ABCDEFGHIJ") = "AB-CDEFG----HIJ"

A few particulars should be mentioned. An application of an e-string onto a sequence is only defined when the sum of the positive integers equals the length of the sequence. The total number of negative integers is arbitrary, and refers to any number of inserted gaps. Finally, a digit zero is meaningless. Edit strings can be multiplied together. If you apply two edit strings in succession onto a sequence, the result is the same as if you had applied the product of two edit strings onto that same sequence. Here are some examples of edit strings multipled together.

<10> * <10> = <10>
<10> * <7, -1, 2> = <7, -1, 2>
<6, -1, 4> * <7, -1, 3, -2, 1> = <6, -2, 3, -2, 1>
<6, -1, 4> * <3, -1, 3, -1, 3, -1, 2> = <3, -1, 3, -2, 2, -1, 2>
<6, -1, 4> * <2, -1, 6, -5, 3> = <2, -1, 4, -1, 1, -5, 3>

The multiply function is a bit tricky to implement at first, but one can work backward from the result to get the intermediate step that’s performed mentally. The last two examples above can manually calculated if we imagine an intermediate step as follows. We convert the first e-string to its application on a sequence of ‘+’ symbols– in both the below cases, this results in ten ‘+’ symbols with one ‘-‘ (gap) inserted after the sixth position. We then apply the second edit string to the result of the first application. Finally, we write out the total changes as an e-string on the original sequence of ten ‘+’ symbols. Don’t worry, we don’t really do this in real life software — it’s just a good way for human brains to comprehend the overall operation.

Second last example above — expanded out…

<6, -1, 4> * <3, -1, 3, -1, 3, -1, 2> = ...
++++++-++++ * <3, -1, 3, -1, 3, -1, 2> = ...
+++-+++--++-++ = <3, -1, 3, -2, 2, -1, 2>

Last example above — expanded out…

<6, -1, 4> * <2, -1, 6, -5, 3> = ...
++++++-++++ * <2, -1, 6, -5, 3> = ...
++-++++-+-----+++ = <2, -1, 4, -1, 1, -5, 3>

Here’s the code in Ruby for my take on the multiply function. I derived my version of the function based on the examples from Edgar’s BMC paper (mentioned before). When you think about it, the runtime is the sum of the number of elements of the two e-strings being multiplied. This becomes obvious when you realize that the only computation that really occurs is when a positive integer is encountered in the second e-string. The function is called estring_product(), it takes two e-strings as arguments (u, v) and returns a single e-string (w). This function internally calls an estring_collapse() function because we intermediately create an e-string that may have several positive integers or several negative integers in a row (when this happens, it’s usually just two in a row). Consecutive same-signed integers of an e-string are added together.

def estring_product(u, v)
    w = []
    uu_replacement = nil
    ui = 0
    vi = 0
    v.each do |vv|
        if vv < 0
            w << vv
        elsif vv > 0
            vv_left = vv
            uu = uu_replacement != nil ? uu_replacement : u[ui]
            uu_replacement = nil
            until vv_left == 0
                if vv_left >= uu.abs
                    w << uu
                    vv_left -= uu.abs
                    ui += 1
                    uu = u[ui]
                else
                    if uu > 0
                        w << vv_left
                        uu -= vv_left
                    elsif uu < 0
                        w << -vv_left
                        uu += vv_left
                    end
                    vv_left = 0
                    uu_replacement = uu
                end
            end
        end
        vi += 1
    end
    return estring_collapse(w)
end

If you aren’t familiar with Ruby, the “<<” operator means “append the right value to the left collection”. Everything else should be pretty self-explanatory (leave a comment if it’s not).

def estring_collapse(u)
    v = []
    u.each do |uu|
        if v[-1] == nil
            v << uu
        elsif (v[-1] <=> 0) == (uu <=> 0)
            v[-1] += uu
        else
            v << uu
        end
    end
    return v
end

The “<=>” lovingly referred to as the spaceship operator compares the right value to the left value; if the right value is greater, then the result is 1.0; if the left value is greater, then the result is -1.0; if the left and right values are the same, then the result is 0.0.

A Few Notes on the Attached Files (Ruby, Javascript)

The Ruby source has a lot of comments in it that should help you understand when and where to use the included two functions. The Javascript is actually used in a visualizer I’ve deployed now so it has a few more practical functions in it. A function called sequence_estring() is included that takes a $sequence and applies an $estring, then returns a new gapped sequence. A utility signum() function is included which takes the place of the spaceship operator in the Ruby version. A diagnostic function, p_uvw() is included that just uses document.write() to print out whatever you give as (estring) $u, (estring) $v and (estring) $w. Finally, the JavaScript functions sequence_estring() and estring_product() will print out error messages with document.write() when the total sequence length specified in the first argument is not the same as the total sequence length implied by the second argument. Remember, the sum of the absolute values of the first argument describes the total length of the sequence– each of these tokens must be accounted for by the positive values of the second argument, the estring that will operate on it.

Updates to this post:

  1. Two hyphens in a row were displayed as a single dash — this has been fixed.
  2. Code listing plug-in uses literal “<” and “>” instead of the entity codes “&lt;” and “&gt;” — fixed.