Ed's Big Plans

C & Bioinformatics: ASCII Nucleotide Comparison Circuit

Here’s a function I developed for Andre about a week ago. The C function takes two arguments. Both arguments are C characters. The first argument corresponds to a degenerate nucleotide as in the below table. The second argument corresponds to a non-degenerate nucleotide {‘A’, ‘C’, ‘G’, ‘T’} or any nucleotide ‘N’. The function returns zero if the logical intersection between the two arguments is zero. The function returns a one-hot binary encoding for the logical intersection if it exists so that {‘A’ = 1, ‘C’ = 2, ‘G’ = 4, ‘T’ = 8} and {‘N’ = 15}. All of this is done treating the lower 5-bits of the ASCII encoding for each character as wires of a circuit.

 Character ASCII (Low 5-Bits) Represents One Hot Equals A 00001 A 0001 1 B 00010 CGT 1110 14 C 00011 C 0010 2 D 00100 AGT 1101 13 G 00111 G 0100 4 H 01000 ACT 1011 11 K 01011 GT 1100 12 M 01101 AC 0011 3 N 01110 ACGT 1111 15 R 10010 AG 0101 5 S 10011 GC 0110 6 T 10100 T 1000 8 V 10110 ACG 0111 7 W 10111 AT 1001 9 Y 11001 CT 1010 10

The premise is that removing all of the logical branching and using only binary operators would make things a bit faster — I’m actually not sure if the following solution is faster because there are twelve variables local to the function scope — we can be assured that at least half of these variables will be stored outside of cache and will have to live in memory. We’d get a moderate speed boost if at all.

```/*
f():
Bitwise comparison circuit that treats nucleotide and degenerate
nucleotide ascii characters as mathematical sets.
The operation performed is set i intersect j.
arg char i:
Primer -- accepts ascii |ABCDG HKMNR STVWY| = 15.
arg char j:
Sequence -- accepts ascii |ACGTN| = 5.
return char (k = 0):
false -- i intersect j is empty.
return char (k > 0):
1 -- the intersection is 'A'
2 -- the intersection is 'C'
4 -- the intersection is 'G'
8 -- the intersection is 'T'
15 -- the intersection is 'N'
return char (undefined value):
? -- if any other characters are placed in i or j.
*/
char f(char i, char j) {

// break apart Primer into bits ...
char p = (i >> 4) &1;
char q = (i >> 3) &1;
char r = (i >> 2) &1;
char s = (i >> 1) &1;
char t =  i       &1;

// break apart Sequence into bits ...
char a = (j >> 4) &1;
char b = (j >> 3) &1;
char c = (j >> 2) &1;
char d = (j >> 1) &1;
char e =  j       &1;

return

( // == PRIMER CIRCUIT ==
( // -- A --
((p|q|r|s  )^1) &         t |
((p|q|  s|t)^1) &     r     |
((p|  r|s|t)^1) &   q       |
((p|    s  )^1) &   q&r&  t |
((p|      t)^1) &   q&r&s   |
((  q|    t)^1) & p&s       |
((  q      )^1) & p&r&s
)
|
( // -- C --
((p|q|r    )^1) &       s   |
((p|  r|s|t)^1) &   q       |
((p|    s  )^1) &   q&r&  t |
((p|      t)^1) &   q&r&s   |
((  q|r    )^1) &       s&t |
((  q|    t)^1) & p  &r&s   |
((    r|s  )^1) & p&q&    t
) << 1
|
( // -- G --
((  q|r|  t)^1) &       s   |
((p|q|  s|t)^1) &     r     |
((p|q      )^1) &     r&s&t |
((p|  r    )^1) &   q&  s&t |
((p|      t)^1) &   q&r&s   |
((  q|r    )^1) & p&    s   |
((  q|    t)^1) & p&    s
) << 2
|
( // -- T --
((p|q|r|  t)^1) &       s   |
((  q|  s|t)^1) &     r     |
((p|  r|s|t)^1) &   q       |
((p|  r    )^1) &   q&  s&t |
((p|      t)^1) &   q&r&s   |
((  q      )^1) & p&  r&s&t |
((    r|s  )^1) & p&q&    t
) << 3
)
&
( // == SEQUENCE CIRCUIT ==
( // -- A --
((a|b|c|d  )^1) &         e |
((a|      e)^1) &   b&c&d
)
|
( // -- C --
((a|b|c    )^1) &       d&e |
((a|      e)^1) &   b&c&d
) << 1
|
( // -- G --
((a|b      )^1) &     c&d&e |
((a|      e)^1) &   b&c&d
) << 2
|
( // -- T --
((a|      e)^1) &   b&c&d   |
((  b|  d|e)^1) & a&  c
) << 3
);
}
```

Andre’s eventual solution was to use a look-up table which very likely proves faster in practice. At the very least, this was a nice refresher and practical example for circuit logic using four sets of minterms (one for each one-hot output wire).

Should you need this logic to build a fast physical circuit or have some magical architecture with a dozen registers (accessible to the compiler), be my guest 😀

Eddie Ma

February 27th, 2011 at 11:49 pm

J .A. Brown says...

Those are the same degenerate motifs I was asking you about the other day. I have some cool things to show you next time I see you then.

Eddie Ma says...

I thought you were talking about these — ya, we should discuss the Regex goodness of it all 😀

Andre Masella says...

You know what was forgotten? Multiplication and addition. Both of those scramble bits in a useful way. That might have yielded a solution with fewer shifts.

Eddie Ma says...

Hehe — it’ll be some time before I sit down to work out how to play with addition and multiplication. Might be useful — certainly will be convoluted. We’d be playing up the advantages of using C instead of a literal circuit.