Ed's Big Plans

Computing for Science and Awesome

Archive for the ‘Multiple Sequence Alignment’ tag

Simple Interactive Phylogenetic Tree Sketches JS+SVG+XHTML

without comments

>>> Attached: ( parse_newick_xhtml.js | draw_phylogeny.js — implementation in JS | index.html — demo with IL-1 ) <<<
>>> Attached: ( auto_phylo_diagram.tgz — includes all above, a python script, a demo makefile and IL-1 data ) <<<

In this post, I discuss a Python script I wrote that will convert a Newick Traversal and a FASTA file into a browser-viewable interactive diagram using SVG+JS+XHTML. This method is compatible with WebKit (Safari, Chrome) and Mozilla (Firefox) renderers but not Opera. XHTML doesn’t work with IE, so we’ll have to wait for massive adoption of HTML5 before I can make this work for everyone — don’t worry, it’s coming.

To the left is what the rendered tree looks like (shown as a screen captured PNG).

I recommend looking at the demo index.xhtml above if your browser supports it now. The demo will do a far better job than text in explaining exactly what features are possible using this method.

Try clicking on the different nodes and hyperlinks in the demo — notice that the displayed alignment strips away sites (columns) that are 100% gaps so that each node displays only a profile relevant for itself.

I originally intended to explain all the details of the JavaScripts that render and drive the diagram, but I think it’d be more useful to first focus on the Python script and the data input and output from the script.

The Attached Python Script to_xhtml.py found in auto_phylo_diagram.tgz

I’ll explain how to invoke to_xhtml.py with an example below.

python to_xhtml.py IL1fasta.txt IL1tree2.txt "IL-1 group from NCBI CDD (15)" > index.xhtml

Arguments …

  1. Plain Text FASTA file — IL1fasta.txt
  2. Plain Text Newick Traversal file — IL1tree2.txt
  3. Title of the generated XHTML — “IL-1 group from NCBI CDD (15)”

The data I used was generated with MUSCLE with default parameters. I specified the output FASTA with -out and the output second iteration tree with -tree2.

The Attached makefile found in auto_phylo_diagram.tgz

The makefile has two actual rules. If index.html does not exist, it will be regenerated with the above Python script and IL1fasta.txt and IL1tree2.txt. If either IL1fasta.txt or IL1tree2.txt or both do not exist, these files are regenerated using MUSCLE with default parameters on the included example data file ncbi.IL1.cl00094.15.fasta.txt. Running make on the included package shouldn’t do anything since all of the files are included.

The Example Data

The example data comes from the Interleukin-1 group (cd00100) from the NCBI Conserved Domains Database (CDD).

Finally, I’ll discuss briefly the nitty gritty of the JavaScripts found as stand-alones above that are also included in the package.

Part Two of Two

As is the nature of programatically drawing things, this particular task comes with its share of prerequisites.

In a previous — Part 1, I covered Interactive Diagrams Using Inline JS+SVG+XHTML — see that post for compatibility notes and the general method (remember, this technique is a holdover that will eventually be replaced with HTML5 and that the scripts in this post are not compatible with Opera anymore).

In this current post — Part 2, I’ll also assume you’ve seen Parsing a Newick Tree with recursive descent (you don’t need to know it inside and out, since we’ll only use these functions) — if you look at the example index.html attached, you’ll see that we actually parse the Newick Traversal using two functions in parse_newick_xhtml.jstree_builder() then traverse_tree(). The former converts the traversal to an in-memory binary tree while the latter accepts the root node of said tree and produces an array. This array is read in draw_phylogeny.js by fdraw_tree_svg() which does the actual rendering along with attaching all of the dynamic behaviour.

Parameters ( What are we drawing? )

To keep things simple, let’s focus on three requirements: (1) the tree will me sketched top-down with the root on top and its leaves below; (2) the diagram must make good use of the window width and avoid making horizontal scroll bars (unless it’s absolutely necessary); (3) the drawing function must make the diagram compatible with the overall document object model (DOM) so that we can highlight nodes and know what nodes were clicked to show the appropriate data.

Because of the second condition, all nodes in a given depth of the tree will be drawn at the same time. A horizontal repulsion will be applied so that each level will appear to be centre-justified.

General Strategy

Because we are centre-justifying the tree, we will be drawing each level of the tree simultaneously to calculate the offsets we want for each node. We perform this iteratively for each level of depth of the tree in the function fdraw_tree_svg(). Everything that is written dynamically by JavaScript goes to a div with a unique id. All of the SVG goes to id=”tree_display” and all of the textual output including raster images, text and alignments goes to id=”thoughts”. To attach behaviour to hyperlinks, whether it’s within the SVG or using standard anchor tags, the “onclick” property is defined. In this case, we always call the same function “onclick=javascript:select_node()“. This function accepts a single parameter: the integer serial number that has been clicked.

Additional Behaviour — Images and Links to Databases

After I wrote the first drafts of the JavaScripts, I decided to add two more behaviours in the current version. First, if a particular node name has the letters “GI” in it, the script will attempt to hyperlink it with the proceeding serial number against NCBI. Second, if a particular node name has the letters “PDB” in it, the script will attempt to hyperlink the following identifier against RSCB Protein Database and also pull the first thumbnail with an img tag for the 3D protein structure.

Enjoy!

This was done because I wanted to show a custom alignment algorithm to my advisors. In this demo, I’ve stripped away the custom alignment leaving behind a more general yet pleasing visualizer. I hope that the Python script and JavaScripts are useful to you. Enjoy!

Eddie Ma

January 1st, 2011 at 11:55 pm

Bioinformatics: Edit Strings (Ruby, JavaScript)

without comments

>>> Attached: ( editStringAdder.rb — in Ruby | editStrings.js — in JavaScript ) <<<

I wanted to share something neat that I’ve been using in the phylogeny analyzing software I’ve been developing. In the MUSCLE3 BMC Bioinformatics paper (Robert Edgar, 2004), Edgar describes a way to describe all the changes required to get from one raw sequence to how it appears somewhere in an internal node, riddled with gaps. He calls the data structures Edit Strings (e-strings).

I started using these things in my visualizer because it is a convenient way to describe how to render a profile of sequences at internal nodes of my phylogenetic trees. Rather than storing a copy of the sequence with gaps at each of its ancestors, e-strings allow you to store a discrete pathway of the changes your alignment algorithm made to a sequence at each node.

You can chain e-strings together with a multiply operation so that the profile of any node in the tree can be created out of just the original primary sequences of the dataset (the taxa) and the set of corresponding e-strings for the given tree. You can choose to store the alphabet-distribution profiles this way at each node too, but I decided to use the e-strings only for my visualizer (where seeing each sequence within a profile is important).

Edit strings take the form of a sequence of alternating positive and negative integers; positive integers mean “retain a substring of this length” while negative integers mean “insert gaps of this length” — here’s a few example applications of edit strings on the sequence “ABCDEFGHI“:

<10>("ABCDEFGHIJ") = "ABCDEFGHIJ"
<10 -3>("ABCDEFGHIJ") = "ABCDEFGHIJ---"
<4, -2, 6>("ABCDEFGHIJ") = "ABCD--EFGHIJ"
<2, -1, 5, -4, 3>("ABCDEFGHIJ") = "AB-CDEFG----HIJ"

A few particulars should be mentioned. An application of an e-string onto a sequence is only defined when the sum of the positive integers equals the length of the sequence. The total number of negative integers is arbitrary, and refers to any number of inserted gaps. Finally, a digit zero is meaningless. Edit strings can be multiplied together. If you apply two edit strings in succession onto a sequence, the result is the same as if you had applied the product of two edit strings onto that same sequence. Here are some examples of edit strings multipled together.

<10> * <10> = <10>
<10> * <7, -1, 2> = <7, -1, 2>
<6, -1, 4> * <7, -1, 3, -2, 1> = <6, -2, 3, -2, 1>
<6, -1, 4> * <3, -1, 3, -1, 3, -1, 2> = <3, -1, 3, -2, 2, -1, 2>
<6, -1, 4> * <2, -1, 6, -5, 3> = <2, -1, 4, -1, 1, -5, 3>

The multiply function is a bit tricky to implement at first, but one can work backward from the result to get the intermediate step that’s performed mentally. The last two examples above can manually calculated if we imagine an intermediate step as follows. We convert the first e-string to its application on a sequence of ‘+’ symbols– in both the below cases, this results in ten ‘+’ symbols with one ‘-‘ (gap) inserted after the sixth position. We then apply the second edit string to the result of the first application. Finally, we write out the total changes as an e-string on the original sequence of ten ‘+’ symbols. Don’t worry, we don’t really do this in real life software — it’s just a good way for human brains to comprehend the overall operation.

Second last example above — expanded out…

<6, -1, 4> * <3, -1, 3, -1, 3, -1, 2> = ...
++++++-++++ * <3, -1, 3, -1, 3, -1, 2> = ...
+++-+++--++-++ = <3, -1, 3, -2, 2, -1, 2>

Last example above — expanded out…

<6, -1, 4> * <2, -1, 6, -5, 3> = ...
++++++-++++ * <2, -1, 6, -5, 3> = ...
++-++++-+-----+++ = <2, -1, 4, -1, 1, -5, 3>

Here’s the code in Ruby for my take on the multiply function. I derived my version of the function based on the examples from Edgar’s BMC paper (mentioned before). When you think about it, the runtime is the sum of the number of elements of the two e-strings being multiplied. This becomes obvious when you realize that the only computation that really occurs is when a positive integer is encountered in the second e-string. The function is called estring_product(), it takes two e-strings as arguments (u, v) and returns a single e-string (w). This function internally calls an estring_collapse() function because we intermediately create an e-string that may have several positive integers or several negative integers in a row (when this happens, it’s usually just two in a row). Consecutive same-signed integers of an e-string are added together.

def estring_product(u, v)
    w = []
    uu_replacement = nil
    ui = 0
    vi = 0
    v.each do |vv|
        if vv < 0
            w << vv
        elsif vv > 0
            vv_left = vv
            uu = uu_replacement != nil ? uu_replacement : u[ui]
            uu_replacement = nil
            until vv_left == 0
                if vv_left >= uu.abs
                    w << uu
                    vv_left -= uu.abs
                    ui += 1
                    uu = u[ui]
                else
                    if uu > 0
                        w << vv_left
                        uu -= vv_left
                    elsif uu < 0
                        w << -vv_left
                        uu += vv_left
                    end
                    vv_left = 0
                    uu_replacement = uu
                end
            end
        end
        vi += 1
    end
    return estring_collapse(w)
end

If you aren’t familiar with Ruby, the “<<” operator means “append the right value to the left collection”. Everything else should be pretty self-explanatory (leave a comment if it’s not).

def estring_collapse(u)
    v = []
    u.each do |uu|
        if v[-1] == nil
            v << uu
        elsif (v[-1] <=> 0) == (uu <=> 0)
            v[-1] += uu
        else
            v << uu
        end
    end
    return v
end

The “<=>” lovingly referred to as the spaceship operator compares the right value to the left value; if the right value is greater, then the result is 1.0; if the left value is greater, then the result is -1.0; if the left and right values are the same, then the result is 0.0.

A Few Notes on the Attached Files (Ruby, Javascript)

The Ruby source has a lot of comments in it that should help you understand when and where to use the included two functions. The Javascript is actually used in a visualizer I’ve deployed now so it has a few more practical functions in it. A function called sequence_estring() is included that takes a $sequence and applies an $estring, then returns a new gapped sequence. A utility signum() function is included which takes the place of the spaceship operator in the Ruby version. A diagnostic function, p_uvw() is included that just uses document.write() to print out whatever you give as (estring) $u, (estring) $v and (estring) $w. Finally, the JavaScript functions sequence_estring() and estring_product() will print out error messages with document.write() when the total sequence length specified in the first argument is not the same as the total sequence length implied by the second argument. Remember, the sum of the absolute values of the first argument describes the total length of the sequence– each of these tokens must be accounted for by the positive values of the second argument, the estring that will operate on it.

fsMSA Algorithm Context

without comments

What started as a meeting between me and my advisors ended up being a ball of unresolved questions about the cultural context of multiple sequence alignment and phylogenetic trees. While I had a good idea of what the field and its researchers had looked into and developed, I hadn’t a grasp of how far along we were. The result is the presentation I’ve just finished. In it, I discuss what I consider to be a representative sampling of the alignment and phylogenetic tree building algorithms available right now, at this very instant.

(PDF not posted, contact me if interested.)

Eddie Ma

January 13th, 2010 at 12:12 pm