Archive for the ‘Degenerate Nucleotide’ tag
C & Bioinformatics: ASCII Nucleotide Comparison Circuit
Here’s a function I developed for Andre about a week ago. The C function takes two arguments. Both arguments are C characters. The first argument corresponds to a degenerate nucleotide as in the below table. The second argument corresponds to a non-degenerate nucleotide {‘A’, ‘C’, ‘G’, ‘T’} or any nucleotide ‘N’. The function returns zero if the logical intersection between the two arguments is zero. The function returns a one-hot binary encoding for the logical intersection if it exists so that {‘A’ = 1, ‘C’ = 2, ‘G’ = 4, ‘T’ = 8} and {‘N’ = 15}. All of this is done treating the lower 5-bits of the ASCII encoding for each character as wires of a circuit.
Character | ASCII (Low 5-Bits) |
Represents | One Hot | Equals |
A |
00001 |
A | 0001 |
1 |
B | 00010 | CGT | 1110 |
14 |
C | 00011 | C | 0010 |
2 |
D | 00100 | AGT | 1101 |
13 |
G | 00111 | G | 0100 |
4 |
H | 01000 | ACT | 1011 |
11 |
K |
01011 | GT | 1100 |
12 |
M | 01101 | AC | 0011 |
3 |
N | 01110 | ACGT | 1111 |
15 |
R | 10010 | AG | 0101 | 5 |
S | 10011 | GC | 0110 |
6 |
T | 10100 | T |
1000 |
8 |
V | 10110 | ACG |
0111 |
7 |
W | 10111 | AT |
1001 |
9 |
Y | 11001 | CT |
1010 |
10 |
The premise is that removing all of the logical branching and using only binary operators would make things a bit faster — I’m actually not sure if the following solution is faster because there are twelve variables local to the function scope — we can be assured that at least half of these variables will be stored outside of cache and will have to live in memory. We’d get a moderate speed boost if at all.
/* f(): Bitwise comparison circuit that treats nucleotide and degenerate nucleotide ascii characters as mathematical sets. The operation performed is set i intersect j. arg char i: Primer -- accepts ascii |ABCDG HKMNR STVWY| = 15. arg char j: Sequence -- accepts ascii |ACGTN| = 5. return char (k = 0): false -- i intersect j is empty. return char (k > 0): 1 -- the intersection is 'A' 2 -- the intersection is 'C' 4 -- the intersection is 'G' 8 -- the intersection is 'T' 15 -- the intersection is 'N' return char (undefined value): ? -- if any other characters are placed in i or j. */ char f(char i, char j) { // break apart Primer into bits ... char p = (i >> 4) &1; char q = (i >> 3) &1; char r = (i >> 2) &1; char s = (i >> 1) &1; char t = i &1; // break apart Sequence into bits ... char a = (j >> 4) &1; char b = (j >> 3) &1; char c = (j >> 2) &1; char d = (j >> 1) &1; char e = j &1; return ( // == PRIMER CIRCUIT == ( // -- A -- ((p|q|r|s )^1) & t | ((p|q| s|t)^1) & r | ((p| r|s|t)^1) & q | ((p| s )^1) & q&r& t | ((p| t)^1) & q&r&s | (( q| t)^1) & p&s | (( q )^1) & p&r&s ) | ( // -- C -- ((p|q|r )^1) & s | ((p| r|s|t)^1) & q | ((p| s )^1) & q&r& t | ((p| t)^1) & q&r&s | (( q|r )^1) & s&t | (( q| t)^1) & p &r&s | (( r|s )^1) & p&q& t ) << 1 | ( // -- G -- (( q|r| t)^1) & s | ((p|q| s|t)^1) & r | ((p|q )^1) & r&s&t | ((p| r )^1) & q& s&t | ((p| t)^1) & q&r&s | (( q|r )^1) & p& s | (( q| t)^1) & p& s ) << 2 | ( // -- T -- ((p|q|r| t)^1) & s | (( q| s|t)^1) & r | ((p| r|s|t)^1) & q | ((p| r )^1) & q& s&t | ((p| t)^1) & q&r&s | (( q )^1) & p& r&s&t | (( r|s )^1) & p&q& t ) << 3 ) & ( // == SEQUENCE CIRCUIT == ( // -- A -- ((a|b|c|d )^1) & e | ((a| e)^1) & b&c&d ) | ( // -- C -- ((a|b|c )^1) & d&e | ((a| e)^1) & b&c&d ) << 1 | ( // -- G -- ((a|b )^1) & c&d&e | ((a| e)^1) & b&c&d ) << 2 | ( // -- T -- ((a| e)^1) & b&c&d | (( b| d|e)^1) & a& c ) << 3 ); }
Andre’s eventual solution was to use a look-up table which very likely proves faster in practice. At the very least, this was a nice refresher and practical example for circuit logic using four sets of minterms (one for each one-hot output wire).
Should you need this logic to build a fast physical circuit or have some magical architecture with a dozen registers (accessible to the compiler), be my guest 😀