C & Bioinformatics: ASCII Nucleotide Comparison Circuit
Here’s a function I developed for Andre about a week ago. The C function takes two arguments. Both arguments are C characters. The first argument corresponds to a degenerate nucleotide as in the below table. The second argument corresponds to a non-degenerate nucleotide {‘A’, ‘C’, ‘G’, ‘T’} or any nucleotide ‘N’. The function returns zero if the logical intersection between the two arguments is zero. The function returns a one-hot binary encoding for the logical intersection if it exists so that {‘A’ = 1, ‘C’ = 2, ‘G’ = 4, ‘T’ = 8} and {‘N’ = 15}. All of this is done treating the lower 5-bits of the ASCII encoding for each character as wires of a circuit.
| Character | ASCII (Low 5-Bits) |
Represents | One Hot | Equals |
| A |
00001 |
A | 0001 |
1 |
| B | 00010 | CGT | 1110 |
14 |
| C | 00011 | C | 0010 |
2 |
| D | 00100 | AGT | 1101 |
13 |
| G | 00111 | G | 0100 |
4 |
| H | 01000 | ACT | 1011 |
11 |
| K |
01011 | GT | 1100 |
12 |
| M | 01101 | AC | 0011 |
3 |
| N | 01110 | ACGT | 1111 |
15 |
| R | 10010 | AG | 0101 | 5 |
| S | 10011 | GC | 0110 |
6 |
| T | 10100 | T |
1000 |
8 |
| V | 10110 | ACG |
0111 |
7 |
| W | 10111 | AT |
1001 |
9 |
| Y | 11001 | CT |
1010 |
10 |
The premise is that removing all of the logical branching and using only binary operators would make things a bit faster — I’m actually not sure if the following solution is faster because there are twelve variables local to the function scope — we can be assured that at least half of these variables will be stored outside of cache and will have to live in memory. We’d get a moderate speed boost if at all.
/*
f():
Bitwise comparison circuit that treats nucleotide and degenerate
nucleotide ascii characters as mathematical sets.
The operation performed is set i intersect j.
arg char i:
Primer -- accepts ascii |ABCDG HKMNR STVWY| = 15.
arg char j:
Sequence -- accepts ascii |ACGTN| = 5.
return char (k = 0):
false -- i intersect j is empty.
return char (k > 0):
1 -- the intersection is 'A'
2 -- the intersection is 'C'
4 -- the intersection is 'G'
8 -- the intersection is 'T'
15 -- the intersection is 'N'
return char (undefined value):
? -- if any other characters are placed in i or j.
*/
char f(char i, char j) {
// break apart Primer into bits ...
char p = (i >> 4) &1;
char q = (i >> 3) &1;
char r = (i >> 2) &1;
char s = (i >> 1) &1;
char t = i &1;
// break apart Sequence into bits ...
char a = (j >> 4) &1;
char b = (j >> 3) &1;
char c = (j >> 2) &1;
char d = (j >> 1) &1;
char e = j &1;
return
( // == PRIMER CIRCUIT ==
( // -- A --
((p|q|r|s )^1) & t |
((p|q| s|t)^1) & r |
((p| r|s|t)^1) & q |
((p| s )^1) & q&r& t |
((p| t)^1) & q&r&s |
(( q| t)^1) & p&s |
(( q )^1) & p&r&s
)
|
( // -- C --
((p|q|r )^1) & s |
((p| r|s|t)^1) & q |
((p| s )^1) & q&r& t |
((p| t)^1) & q&r&s |
(( q|r )^1) & s&t |
(( q| t)^1) & p &r&s |
(( r|s )^1) & p&q& t
) << 1
|
( // -- G --
(( q|r| t)^1) & s |
((p|q| s|t)^1) & r |
((p|q )^1) & r&s&t |
((p| r )^1) & q& s&t |
((p| t)^1) & q&r&s |
(( q|r )^1) & p& s |
(( q| t)^1) & p& s
) << 2
|
( // -- T --
((p|q|r| t)^1) & s |
(( q| s|t)^1) & r |
((p| r|s|t)^1) & q |
((p| r )^1) & q& s&t |
((p| t)^1) & q&r&s |
(( q )^1) & p& r&s&t |
(( r|s )^1) & p&q& t
) << 3
)
&
( // == SEQUENCE CIRCUIT ==
( // -- A --
((a|b|c|d )^1) & e |
((a| e)^1) & b&c&d
)
|
( // -- C --
((a|b|c )^1) & d&e |
((a| e)^1) & b&c&d
) << 1
|
( // -- G --
((a|b )^1) & c&d&e |
((a| e)^1) & b&c&d
) << 2
|
( // -- T --
((a| e)^1) & b&c&d |
(( b| d|e)^1) & a& c
) << 3
);
}
Andre’s eventual solution was to use a look-up table which very likely proves faster in practice. At the very least, this was a nice refresher and practical example for circuit logic using four sets of minterms (one for each one-hot output wire).
Should you need this logic to build a fast physical circuit or have some magical architecture with a dozen registers (accessible to the compiler), be my guest 😀
Ed's Big Plans
Those are the same degenerate motifs I was asking you about the other day. I have some cool things to show you next time I see you then.
I thought you were talking about these — ya, we should discuss the Regex goodness of it all 😀
You know what was forgotten? Multiplication and addition. Both of those scramble bits in a useful way. That might have yielded a solution with fewer shifts.
Hehe — it’ll be some time before I sit down to work out how to play with addition and multiplication. Might be useful — certainly will be convoluted. We’d be playing up the advantages of using C instead of a literal circuit.