Bioinformatics at Ed's Big Plans

Archive for the ‘Bioinformatics’ tag

C & Bioinformatics: ASCII Nucleotide Comparison Circuit

Here’s a function I developed for Andre about a week ago. The C function takes two arguments. Both arguments are C characters. The first argument corresponds to a degenerate nucleotide as in the below table. The second argument corresponds to a non-degenerate nucleotide {‘A’, ‘C’, ‘G’, ‘T’} or any nucleotide ‘N’. The function returns zero if the logical intersection between the two arguments is zero. The function returns a one-hot binary encoding for the logical intersection if it exists so that {‘A’ = 1, ‘C’ = 2, ‘G’ = 4, ‘T’ = 8} and {‘N’ = 15}. All of this is done treating the lower 5-bits of the ASCII encoding for each character as wires of a circuit.

Character	ASCII (Low 5-Bits)	Represents	One Hot	Equals
A	00001	A	0001	1
B	00010	CGT	1110	14
C	00011	C	0010	2
D	00100	AGT	1101	13
G	00111	G	0100	4
H	01000	ACT	1011	11
K	01011	GT	1100	12
M	01101	AC	0011	3
N	01110	ACGT	1111	15
R	10010	AG	0101	5
S	10011	GC	0110	6
T	10100	T	1000	8
V	10110	ACG	0111	7
W	10111	AT	1001	9
Y	11001	CT	1010	10

The premise is that removing all of the logical branching and using only binary operators would make things a bit faster — I’m actually not sure if the following solution is faster because there are twelve variables local to the function scope — we can be assured that at least half of these variables will be stored outside of cache and will have to live in memory. We’d get a moderate speed boost if at all.

/*
    f():
        Bitwise comparison circuit that treats nucleotide and degenerate
        nucleotide ascii characters as mathematical sets.
        The operation performed is set i intersect j.
    arg char i:
        Primer -- accepts ascii |ABCDG HKMNR STVWY| = 15.
    arg char j:
        Sequence -- accepts ascii |ACGTN| = 5.
    return char (k = 0):
        false -- i intersect j is empty.
    return char (k > 0):
        1 -- the intersection is 'A'
        2 -- the intersection is 'C'
        4 -- the intersection is 'G'
        8 -- the intersection is 'T'
       15 -- the intersection is 'N'
    return char (undefined value):
        ? -- if any other characters are placed in i or j.
*/
char f(char i, char j) {

    // break apart Primer into bits ...
    char p = (i >> 4) &1;
    char q = (i >> 3) &1;
    char r = (i >> 2) &1;
    char s = (i >> 1) &1;
    char t =  i       &1;

    // break apart Sequence into bits ...
    char a = (j >> 4) &1;
    char b = (j >> 3) &1;
    char c = (j >> 2) &1;
    char d = (j >> 1) &1;
    char e =  j       &1;

    return

    ( // == PRIMER CIRCUIT ==
        ( // -- A --
            ((p|q|r|s  )^1) &         t |
            ((p|q|  s|t)^1) &     r     |
            ((p|  r|s|t)^1) &   q       |
            ((p|    s  )^1) &   q&r&  t |
            ((p|      t)^1) &   q&r&s   |
            ((  q|    t)^1) & p&s       |
            ((  q      )^1) & p&r&s
        )
        |
        ( // -- C --
            ((p|q|r    )^1) &       s   |
            ((p|  r|s|t)^1) &   q       |
            ((p|    s  )^1) &   q&r&  t |
            ((p|      t)^1) &   q&r&s   |
            ((  q|r    )^1) &       s&t |
            ((  q|    t)^1) & p  &r&s   |
            ((    r|s  )^1) & p&q&    t
        ) << 1
        |
        ( // -- G --
            ((  q|r|  t)^1) &       s   |
            ((p|q|  s|t)^1) &     r     |
            ((p|q      )^1) &     r&s&t |
            ((p|  r    )^1) &   q&  s&t |
            ((p|      t)^1) &   q&r&s   |
            ((  q|r    )^1) & p&    s   |
            ((  q|    t)^1) & p&    s
        ) << 2
        |
        ( // -- T --
            ((p|q|r|  t)^1) &       s   |
            ((  q|  s|t)^1) &     r     |
            ((p|  r|s|t)^1) &   q       |
            ((p|  r    )^1) &   q&  s&t |
            ((p|      t)^1) &   q&r&s   |
            ((  q      )^1) & p&  r&s&t |
            ((    r|s  )^1) & p&q&    t
        ) << 3
    )
    &
    ( // == SEQUENCE CIRCUIT ==
        ( // -- A --
            ((a|b|c|d  )^1) &         e |
            ((a|      e)^1) &   b&c&d
        )
        |
        ( // -- C --
            ((a|b|c    )^1) &       d&e |
            ((a|      e)^1) &   b&c&d
        ) << 1
        |
        ( // -- G --
            ((a|b      )^1) &     c&d&e |
            ((a|      e)^1) &   b&c&d
        ) << 2
        |
        ( // -- T --
            ((a|      e)^1) &   b&c&d   |
            ((  b|  d|e)^1) & a&  c
        ) << 3
    );
}

Andre’s eventual solution was to use a look-up table which very likely proves faster in practice. At the very least, this was a nice refresher and practical example for circuit logic using four sets of minterms (one for each one-hot output wire).

Should you need this logic to build a fast physical circuit or have some magical architecture with a dozen registers (accessible to the compiler), be my guest 😀

Eddie Ma

February 27th, 2011 at 11:49 pm

Posted in Computational Biology,Featured

Tagged with Andre Masella, ASCII, Binary Logic, Bioinformatics, Circuit, Degenerate Nucleotide, linkedin

C# & Bioinformatics: Edit Strings and Local Alignment Subprofiles

with 2 comments

Alternatively: The awful finite string manipulation bug that took me forever to fix.

Update: Code last update 2010 Dec. 12 (branch logic fixed: gave default “else” case highest priority instead).

Every so often, a bug creeps in during the implementation of a utility function that halts an entire project until it can be fixed. I bumped into one that was so insidious — so retrospectively obvious — that I must share it with you. This particular bug has to do with the way Edit Strings behave when you use them to represent changes done during a Local Alignment rather than a Global Alignment.

Those of you who have looked into MUSCLE may have seen the 2004 BMC paper that goes into great detail about its implementation. In it, Dr. Edgar describes Edit Strings (e-strings) — these are integer strings which keep track of the edits performed on a sequence in order to insert gaps in the correct locations corresponding to an alignment.

I’ve previously discussed and explained e-strings and their use in sequence alignment here and also how to convert alignment strings (a recording of site-to-site insertions/deletions/substitutions) to edit strings here.

E-Strings in Local Alignments

E-strings weren’t originally designed with local alignments in mind — but that’s OK. We can make a few adjustments that will let us keep track of gaps the way we intend them. In a global alignment, all tokens (characters) in each of the aligned sequences are considered. In a local alignment, we need to be able to specify two aligned subsequences rather than the alignment of two full sequences. We can add a simple integer offset to indicate the number of positions since the beginning of aligned sequences to specify the start position of each subsequence.

These are two sequences s and t — (to keep things easy to visualize, I’ve just used characters from the English alphabet).

s = DEIJKLMNOPQRSTUVWXYZ
t = ABCDEFGHIJKLMNOUVWXY

Assume that the following is the optimal global alignment of s and t.

s(global) = ---DE---IJKLMNOPQRSTUVWXYZ
t(global) = ABCDEFGHIJKLMNO-----UVWXY-

The corresponding edit strings for s and t are e_s and e_t respectively.

e(s, global) = < -3, 2, -3, 18 >
e(t, global) = < 15, -5, 5, -1 >

Remember, we can take the sum of the absolute values in two edit strings corresponding to the same alignment to be sure that they contain the number of sites. This is true because a site in a given sequence or profile must be either an amino acid (in a sequence) or a distribution of amino acids (in a profile) or a gap (in either). In this case, the absolute values of both e_s and e_t correctly add up to 26.

Now let’s adapt the above to local alignments.

The corresponding optimal local alignment for s and t is as follows.

s(local) = DE---IJKLMNOPQRSTUVWXY
t(local) = DEFGHIJKLMNO-----UVWXY

The edit strings only need an offset to specify the correct start position in each sequence. Remember that our indexing is zero based, so the position for the character “A” was at position zero.

e(s, local) = 3 + < 2, -3, 17 >
e(t, local) = 3 + < 12, -5, 5 >

It’s a pretty straight forward fix. Notice that the absolute values of the integers inside the vectors now only sum up to 22 (corresponding to the sites stretching from “D” to “Y”).

Local Alignment E-String Subprofiles

The bug I faced wasn’t in the above however. It only occurred when I needed to specify subprofiles of local alignments. Imagine having a local alignment of two sequences in memory. Imagine now that there is a specific run of aligned sites therein that you are interested in and want to extract. That specific run should be specifiable by using the original sequences and a corresponding pair of e-strings and pair of offsets. Ideally, the new e-strings and offsets should be derived from the e-strings and offsets we had gathered from performing the original local alignment. We call this specific run of aligned sites a subprofile.

As you will see, this bug is actually a result of forgetting to count the number of gaps needed that occur after the original local alignment offset and before where the new subprofile should start. Don’t worry, this will become clearer with an example.

For the local alignment of s and t, I want to specify a subprofile stretching from the site at letter “I” to the site at letter “V”.

s(local) = DE---IJKLMNOPQRSTUVWXY → s(subprofile) = IJKLMNOPQRSTUV
t(local) = DEFGHIJKLMNO-----UVWXY → t(subprofile) = IJKLMNO-----UV

What do you suppose the edit string to derive would be?

Let’s puzzle this out — we want to take a substring from each s and t from position 5 to position 18 of the local alignment as shown below …

Position:       0    0            1  2
                0    5            8  1
     s(local) = DE---IJKLMNOPQRSTUVWXY →
s(subprofile) =      IJKLMNOPQRSTUV
                |    |            |  |
     t(local) = DEFGHIJKLMNO-----UVWXY →
t(subprofile) =      IJKLMNO-----UV
                     |            |
                     |<-  Want  ->|

Let’s break this down further and get first the vector part of the e-string and then the offset part of the e-string.

Getting the Vector part of the Edit String

Remember, we’re trying to get an edit string that will operate on the original sequences s, t and return s_subprofile, t_subprofile. Getting the vector part of the estring is straight forward, you literally find all of the sites specified in the subsequence and save them. Since we want positions [5, 18] in the local edit strings, we retrieve the parts of the edit strings which correspond to those sites (we’ll ignore offsets for now). To visualize this site-to-site mapping, let’s decompose our local estrings into something I call unit-estrings — an equivalent estring which only uses positive and negative ones (given as “+” and “-” below in the unit vectors u_s, u_t for s, t respectively).

Position:                                0         0                         1     2
                                         0         5                         8     1
    s(local) =                           D E - - - I J K L M N O P Q R S T U V W X Y
e(s, vector) = < 2, -3, 17 > -> u(s) = < + + - - - + + + + + + + + + + + + + + + + + >
e(s, newvec) = < 14 >        <-        <           + + + + + + + + + + + + + +       >
                                                   |<-        want!        ->|
e(t, newvec) = < 7, -5, 2 >  <-        <           + + + + + + + - - - - - + +       >
e(t, vector) = < 12, -5, 5 > -> u(t) = < + + + + + + + + + + + + - - - - - + + + + + >
    t(local) =                           D E F G H I J K L M N O - - - - - U V W X Y

We retrieve as our new vectors e_s_,_newvec = < 14 > and e_t_,_newvec = < 7, -5, 2 >. Notice that the integer vector elements are just a count of the runs of positive and negative ones in the subprofile region that we want.

Getting the Offset part of the Edit String (the Right way)

This part gets a bit tricky. The offset is a sum of three items.

The offset from the original sequence to the local alignment.
The offset from the local alignment to the subprofile.
The negative of all of the gap characters that are inserted in the local alignment until the start of the subprofile.

The below diagram shows where each of these values come from.

            s =    DEIJKLMNOPQRSTUVWXYZ
     s(local) =    DE---IJKLMNOPQRSTUVWXY -> +0 -- offset from original to local
s(subprofile) =    .....IJKLMNOPQRSTUV    -> +5 -- offset from local to subprofile
                     ---                  -> -3 -- gaps: local start to subprofile start
                                          => +2 == TOTAL
                        |            |
            t = ABCDEFGHIJKLMNOUVWXY
     t(local) = ...DEFGHIJKLMNO-----UVWXY -> +3 -- offset from original to local
t(subprofile) =    .....IJKLMNO-----UV    -> +5 -- offset from local to subprofile
                                          -> -0 -- gaps: local start to subprofile start
                                          => +8 == TOTAL

You might be wondering about the weird gap subtraction for sequence s. This is the solution to my bug! We need this gap subtraction! When you forget to subtract these gaps, you end up with an alignment that is incorrect where the subprofile of s starts too late. But why is this so? When we transform a sequence from the original to the subprofile, the gaps that we’ve taken for granted in the local alignment don’t exist yet. This way, we would be mistakenly pushing our subprofile into a position where non-existent gapped sites are virtually aligned to existing amino acid sites in the alignment partner t. This was a very odd concept for me to grasp, so let yourself think on this for a bit if it still doesn’t make sense. … You could always take it for granted too once I show you below that this is indeed the correct solution 😛

The completed edit strings to specify the subprofile from “I” to “V” for the local alignment of s and t are thus:

e(s, subprofile) = 2 + < 14 >
e(t, subprofile) = 8 + < 7, -5, 2 >

So let’s make sure these work — let’s apply e_s_{, subprofile} and e_{t, subprofile} onto s and t to see if we get back the correct subprofiles.

>>> apply offsets ...

s = DEIJKLMNOPQRSTUVWXYZ : +2 remaining ...
s = IJKLMNOPQRSTUVWXYZ

t = ABCDEFGHIJKLMNOUVWXY : +8 remaining ...
t = IJKLMNOUVWXY

>>> apply e-strings ...

s = IJKLMNOPQRSTUVWXYZ : < 14 > remaining ...
s = IJKLMNOPQRSTUV

t = IJKLMNOUVWXY : < 7, -5, 2 > remaining ...
t = IJKLMNO : UVWXY : < -5, 2 > remaining ...
t = IJKLMNO----- : UVWXY : < 2 > remaining ...
t = IJKLMNO-----UV

s = IJKLMNOPQRSTUV -- selected subprofile ok.
t = IJKLMNO-----UV -- selected subprofile ok.

The above application results in the desired subprofile.

The SubEString Function in C#

The below source code gives the correct implementation in order to find the e-string corresponding to a subprofile with the following arguments. It is an extension function for List<int> — a natural choice since I am treating lists of integer type as edit strings in my project.

this List<int> u — the local alignment e-string from which to create the subprofile e-string.
int start — the offset from the left specifying the location of the subprofile with respect to the local alignment.
int end_before — the indexed site before which we stop; this is different from length since start and end_before are zero-based and are specified from the very start of the local alignment.
out int lostGaps — returns a negative value to add to the offset — this is the number of gaps that occur between the start of the local alignment and the start of the subprofile.

This function returns a new List<int> e-string as you would expect.

public static List<int> SubEString(
     this List<int> u, int start, int end_before, out int lostGaps) {

    if(end_before < start) {
        Console.Error.WriteLine("Error: end_before < start?! WHY?");
        throw new ValueOutOfRange();
    }

    var retObj = new List<int>();
    var consumed = 0; // the amount "consumed" so far.
    var remaining_start = start; // subtract "remaining_start" to zero.
    var ui = 0; // the current index in the e-string
    var uu = 0; // the current element of the e-string
    var uu_abs = 0; // the absolute value of the current element
    lostGaps = 0; // the number of gaps that is subtracted from the offset.

    //START -- dealing with the "start" argument ...
    for(; ui < u.Count; ui ++) {
        uu = u[ui];
        uu_abs = Math.Abs(uu);
        if(uu_abs < remaining_start) {
            if(uu < 0) lostGaps += uu; //LOSTGAPS
            // add uu this time
            // it's the fraction of remaining_start we care about.
            consumed += uu_abs; // uu smaller than start amount, eat uu.
            remaining_start -= uu_abs; // start shrinks by uu.
            uu_abs = 0; // no uu left.
            uu = 0; // no uu left.
            // not done -- go back and get more uu for me to eat.
        }
        else if(consumed + uu_abs == remaining_start) {
            if(uu < 0) lostGaps += -remaining_start; //LOSTGAPS
            // -- if thus uu is equal to the start amount,
            // count them into LOSTGAPS
            // we want to get everything occurring before the subestring start
            consumed += remaining_start; // same size -- consume start amount.
            uu_abs = 0; // MAD (mutually assured destruction 0 -> 0)
            remaining_start = 0; // MAD
            uu = 0; // MAD
            ui ++; break; // done.
        }
        else if(consumed + uu_abs > remaining_start) {
            if(consumed + uu_abs > end_before && consumed < remaining_start) {
                // easier to say ui == u.Count-1?
                if(uu < 0) lostGaps += -remaining_start; //LOSTGAPS
                uu_abs = (end_before - (consumed + remaining_start));
                uu = uu > 0 ? uu_abs : -uu_abs;
                retObj.Add(uu);
                return retObj;
            } else if(consumed + uu_abs < end_before) {
                consumed += uu_abs;
                // Seemingly wrong -- however:
                // push all the way to the end; subtract the beginning of uu_abs
                // ensures correct calculation of remaining string
                uu_abs -= remaining_start; // remove only start amount.
                // we're getting the end half of uu_abs!
                if(uu < 0) lostGaps += -remaining_start; //LOSTGAPS
                remaining_start = 0; // no start left.
                uu = uu > 0 ? uu_abs : -uu_abs; // remaining uu at this ui.
                if(uu != 0) retObj.Add(uu); // fixes a zero-entry bug
                ui ++; break; // done.
            } else {
                if(uu < 0) lostGaps += -remaining_start; //LOSTGAPS
                uu_abs = (end_before - remaining_start);
                uu = uu > 0 ? uu_abs : -uu_abs;
                retObj.Add(uu);
                return retObj;
            }
        }
    }

    //END_BEFORE -- dealing with the "end_before" argument ...
    for(; ui < u.Count; ui ++) {
        uu = u[ui];
        uu_abs = Math.Abs(uu);
        if((consumed + uu_abs) == end_before) {
            retObj.Add(uu);
            break;
        } else if((consumed + uu_abs) < end_before) {
            retObj.Add(uu);
            consumed += uu_abs;
        } else /*if((consumed + uu_abs) > end_before)*/ {
            uu_abs = end_before - consumed;
            uu = uu > 0 ? uu_abs : -uu_abs;
            retObj.Add(uu);
            break;
        }
    }
    return retObj;
}

I’ve really sent the above code through unit tests and also in practical deployment so it should be relatively bug free. I hope that by posting this here, you can tease out the logic from both the code and the comments and port it to your preferred language. In retrospect, maybe this bug wasn’t so obvious after all. When you think about it, each of the pieces of code and logic are doing exactly what we asked them to. It was only in chaining together an e-string subprofile function along with a local alignment that this thing ever even popped out. Either way, the fix is here. Hopefully it’ll be helpful.

Two Important Notes

Since global alignments represent the aligned sites of all sequences involved, no offset is used and hence this bug would never occur when specifying a subprofile of a global alignment.
Gaps that occur in the original sequence don’t affect retrieving the subprofile e-string calculation at all because those gaps DO exist and can be authentically counted into an offset.

Next

I haven’t had a need yet for porting the multiply function (earlier post) from global alignment logic to local alignment logic. If there are any insidious logical flaws like the ones I encountered this time, I’ll be sure to make a post.

Eddie Ma

November 18th, 2010 at 1:58 pm

Posted in Computational Biology

Tagged with Bioinformatics, C Sharp, Edit Strings, linkedin, Local Alignment, Sequence Alignment, Software Bugs, String Manipulation

Ed's Big Plans

Archive for the ‘Bioinformatics’ tag

C & Bioinformatics: ASCII Nucleotide Comparison Circuit

C# & Bioinformatics: Edit Strings and Local Alignment Subprofiles

Eddie Yee Tak Ma

Navigation

Profiles

Colleagues

License