Notes 20100817 McConkey Meeting
From SnOwy - Ed's Wiki Notebook
Contents |
Committee Members
- Contact Dr. David Rose
- Contact Dr. Dan Brown (and ask about getting him cross appointed)
Thesis Proposal Details
- change the objective function from length to PSP
- figure out how to threshold or evaluate
Table 3
- language: what am I saying when I use "related", change to an indication of how similar the repeats are
- look for spurious repeats as opposed to repeats being inherited (a novel repeat occurs in a child not available in parent)
- see table sent in e-mail -- there are four cases to consider instead
- continue on with the plan we had discussed --- it asks the same question
Examples
- three-fold repeats monstrosities...
- last example has an interesting anomaly -- the right child has an alignment that's off by half a period
- remember it's looking for the highest possible score-- that'll probably be something longer than one period
- we should consider how different the piece of the profile belonging to each child is--
- that is in parent ABCDEFG and child ABC and child DEFG, how different is {ABC ∈ ABCDEFG}'s repeat vs ABC's?
- similarly, how different is {DEFG ∈ ABCDEFG}'s repeat from DEFG's?
- when you use Waterman-Eggert, how do you know where the boundaries of B are?
- are you sure that we can just use the two edges of B?
- can we recalculate the remaining score matrix after a round of zeroing in the Waterman-Eggert?
- should consider the diagonal extending from the zeroed row?
- does zeroing make a difference?
- can I just evaluate the need for this?
- does zeroing make a difference?
- recall: we need to remember that I'm transitioning between the original boundary snipping version and the Waterman-Eggert version
- can WE do these?
- detect three repeats
- off by fractional period repeats
- if so, good!
Equation 1
- need to fix and clarify
From the e-mail
Comments
Table 1 - If the diagonal of a self alignment is set to zero in the Smith Waterman array, does the highest scoring cell in the remaining array identify the highest scoring repeat? I think it might… is this the basis of the IRF algorithm?
p. 7 – “Given a protein with rotational symmetry that has had its three dimensional structure solved” this can limit the problem – what about repeats within a larger structure or sequence? These may not have easily identifiable rotational symmetry or a solved structure.
finding best repeat for each profile. Is this set up in such a way that it doesn’t matter if there isn’t a repeat, i.e. Say sequence A has a repeat but sequence B does not?
p. 11 – Identifyrepeatfunction
- this doesn’t really have a basis in theory. Is it good to include if it’s a dead end?
this has the same issue raised on p. 7- what if only one of the sequences contains a repeat? As currently written, if any sequence has repeat length <20, the algorithm returns false. Relatedness is entirely based on the length of the profiles? Likely not the best approach… ‘relatedness’ is likely not a good choice of terminology – relatedness implies homology(?). But, we expect all sequences to be related; is this addressing common repeats? I’ll assume Relatedness = true means the sequences share a common repeat(?).
Table 3 “A heuristic to guess where” - to approximate, to estimate? A heuristic implies it does not necessarily produce optimal results; a guess implies there is no formal procedure.
The Boolean output of the Identifyrepeatfunction may be leading down a path that isn’t ideal.
Here, the output of the heuristic should ideally be an alignment, incorporating each of the profiles. There are four cases – no novel repeats, the best alignment is the current parent alignment
novel repeat in Child1, profile for parent and Child1 adjusted novel repeat in Child2, profile for parent and Child2 adjusted novel repeats in Child 1 and Child2, all profiles adjusted.
e.g. example node 5 in ricins:
parent length: 35 ; left length : 26 ; right length : 36 ; >>> I think that the parent repeat is related to the left repeat. >>> I think that the parent repeat is related to the right repeat. >>> I don't think that the left repeat is related to the right repeat. >>> The repeat found in the parent has been inherited by both siblings. >>> But the repeat has changed since then. 5 (selected) 3 : -------------------------------------------------------THSCLDSNAQGQVYTLGCNQGNYQHWVYAAGNDGVRLRNAQTNNCVGSRANPAP 3 : DGTRYQGTVYAIGCDGGAAQLWTTSSDGAGMTFRNAATGECLDSNADGRVYTQGCNHGDYQRWG-- 4 : MNTLTKLTIGAVALTGSFLAAAPASAAPAADTTASPALGSQVSAQFASVTIRNAQTGRLLDSNYNGNVYTLPANGGNYQRWT--GPGDGT-VRNAQTGRCLDS------ 4 : ---NYDGAVYTLPCNGGSYQKWLFYSNGY---IQNVETGRVLDSNYNGNVYTLPANGGNYQKWYTG 3 : YP_710563 (left child) 3 : THSCLDSNAQGQVYTLGCNQGNYQHWVYAAGNDGVRLRNAQTNNCVGSRANPAPDGTR 3 : YQGTVYAIGCDGGAAQLWTTSSDGAGMTFRNAATGECLDSNADGRVYTQGCNHGDYQRWG 4 : Q9KWN0 (right child) 4 : MNTLTKLTIGAVALTGSFLAAAPASAAPAADTTASPALGSQVSAQFASVTIRNAQTGRLLDSNYNGNVYTLPANGGNYQRWTGPGDGTVRNAQTGRCLDSN 4 : YDGAVYTLPCNGGSYQKWLFYSNGYIQNVETGRVLDSNYNGNVYTLPANGGNYQKWYTG
Most likely conclusion is that the repeat is common to both, i.e. no new repeats for either child (the most frequent expected result in a large tree?)
In node 6, the repeat for the child node 2 has very different boundaries in the parent alignment than in the child. The parent alignment looks better and is more likely; this is a result of the three-fold repeats, but also is a case where the length assessment produces an incorrect result. (This raises other questions – should repeats be constrained so they must be adjacent? Should three-fold repeats be addressed directly, and not as two separate events?)
[aside- a simple scoring function for the repeat alignment could address this]
p. 13:
these three examples correspond to the three possible cases of inheritance: (1) that the repeat was inherited by both children, (2) that the repeat was inherited only by one child, (3) that the repeat was not inherited.
does this address the case of a novel repeat in a child? I think this is what we are looking for…
p. 15: the issue here is the boundaries have been selected differently – on inspection, the repeats are present in all sequences. We should discuss three-fold repeat detection and boundary cases. Repeat in node 27 looks better than either in node 18 or 28.
p. 16 – wow this is messy. A better means of assessment is needed to distinguish these repeats – it looks like there are several, and different ones are identified in each subset. [node 15 looks like it has alignment issues outside of the defined repeat area]
we should discuss eq’n 1 – there may be a nomenclature issue(?).