CS798 Final Project Log 20100810
From SnOwy - Ed's Wiki Notebook
20100812.230514
- The 84 DATE Tim Barrels ...
- http://www.mrc-lmb.cam.ac.uk/genomes/date/
- 82 structures were retrieved from PDB (two have been purged and their serial numbers retired)
- http://toolkit.tuebingen.mpg.de/hhrepid -- repeat finding
- http://www-lbit.iro.umontreal.ca/DILTAG/ -- repeat history construction
- Okay-- here's the scoop: Anywhere on this page where I've indicated I've used HMMrep, I've actually used HMMrepID. HMMrep is somehow broken on the server side :( -- additionally, I can't seem to render the same alignments as those given in the paper. I must be missing some important algorithm parameters.
- So far, it looks like there's insufficient signal strength.
Ideal Topology Template: BA BA BA BA BA BA BA BA B = Beta Strand in template A = Alpha Helix in template b = beta strand outside of template a = alpha helix outside of template o = "random coil" ? = "missing density" d = "domain" 0 = not feasible 1 = feasible 2 = expedite
- firebrick = todo
- blue = checkout
- forestgreen = done
- v:w:x:y:z => beta start, non-sse start, alpha start, non-sse start, end (exclusive).
- BETANON-SSEALPHANON-SSE
- 1 : 1A4M.pdb (7 ; mono; 352) - BaaaA BA BA BA BA BA BaA BAaa
- 1 : 1A53.pdb (60; mono; 248) - aaBbbA BA BA BA BA BbbA BA BA
- 1 : 1A80.pdb (72; mono; 277) - bbBA BA BA BA BA BA BaA BAa
- 1 : 1ADS.pdb (68; mono; 315) - bbBA BA BA BA BA BA BAa BAao
- 1 : 1AFS.pdb (42; mono; 322) - bbBA BA BA BA BA BA BaA BAa
1AJ0.pdb
|
|
- 2 : 1AJ0.pdb (29; hodi; 282) - bbBA BA BA BA BA BA BA BA
- (1:16
MKLFAQGTSLDLSHP)
- 16:23:35:51:54 HVMGILNVTPDSFSDGGTHNSLIDAVKHANLMINAGAT
- 54:59:71:89:92 IIDVGGESTRPGAAEVSVEEELQRVIPVVEAIAQRFEV
- 92:97:100:109:113 WISVDTSKPEVIRESAKVGAH
- 113:117:125:133:135 IINDIRSLSEPGALEAAAETGL
- 135:140:157:175:179 PVCLMHMQGNPKTMQEAPKYDDVFAEVNRYFIEQIARCEQAGIA
- 179:186:194:211:211 KEKLLLDPGFGFGKNLSHNYSLLARLAEFHHF
- Chimera labels KEK as helix-like
- 211:211 => no non-sse tail
- 211:222:222:229:253 NLPLLVGMSRKSMIGQLLNVGPSERLSGSLACAVIAAMQGAH
- Chimera labels 211:222 as non-sse-like -- is straight chain (X), occupies the space where one would expect a strand
- 222:222 => no non-sse between X and helix
- underline used to delineate between two otherwise consecutive bold regions :(
- 232:250 are labelled as helix-like -- likely function as non-sse-like
- 253:257:259:276:283 IIRVHDVKETVEAMRVVEATLSAKENKRYE
- (1:16
__prefix__ = MKLFAQGTSLDLSHP HVMGILNVTPDSFSDGGTHNSLIDAVKHANLMINAGAT IIDVGGESTRPGAAEVSVEEELQRVIPVVEAIAQRFEV WISVDTSKPEVIRESAKVGAH IINDIRSLSEPGALEAAAETGL PVCLMHMQGNPKTMQEAPKYDDVFAEVNRYFIEQIARCEQAGIA KEKLLLDPGFGFGKNLSHNYSLLARLAEFHHF NLPLLVGMSRKSMIGQLLNVGPSERLSGSLACAVIAAMQGAH IIRVHDVKETVEAMRVVEATLSAKENKRYE
- The above images are broken apart by visual inspection into eight and four parts respectively;
- Trying HMMrep to do the cutting now since it's been known to do well with TIM Barrels.
- HMMrep output below...
Repeats 3 P-value 0.00028 Length 60 Offset 66
ID Probab P-value RepScore RepScoreNorm Cols Query HMM Template HMM A1 78.50 2.4e-05 16.14 0.42 38 27-65 86-124 A2 89.00 5.2e-08 34.90 0.58 60 66-125 66-125 A3 81.42 1.7e-04 17.74 0.57 32 141-175 75-106
A1 Mon_Aug_16_14: 27-65 +0 --------------------KHANLMINAGA...T-IIDVGGEStRPGAAEVSVEEELQRVIP-............... A2 Mon_Aug_16_14: 66-125 +15 VVEAIAQRFEVWISVDTSKPEVIRESAKVGA...HIINDIRSLS.EPGALEAAAETGLPVCLMHmqgnpktmqeapkyd A3 Mon_Aug_16_14: 141-175 +0 ---------DVFAEVNRYFIEQIARCEQAGIakeKLLLDPGFGF.-------------------...............
- Alignment:
- Trace:
- DILTAG with new alignment:
- 1 : 1AK5.pdb (27; tetr; 503) - obbBA BA BA BA BA BA BA BbbAa
- 1 : 1AQ0.pdb (4 ; mono; 312) - BA BA BaA BA BA BA BA BA
- 0 : 1AQM.pdb (70; mono; 669) - BA BbbA BbbbbbbA BA Bo Ba BA BbbAd
1AW5.pdb
|
|
- 2 : 1AW5.pdb (36; octo; 342) - oBboA BA BA BA BA BA BA BA
- (1:43
MHTAEFLETEPTEISSVLAGGYNHPLLRQWQSERQLTKNMLI)
- 43:49:71:81:85 FPLFISDNPDDFTEIDSAPNINRIGVNRLKDYLKPLVAKGLR
- underlined segments are a pair of matching anti-parallel strands
- 85:92:111:123:127 SVILFGVPLIPGTKDPVGTAADDPAGPVIQGIRFIREKFPEL
- underlined segment is a spurious short helix region
- 127:133:154:172:175 YIICDVCLCEYTSHGHCGVLYDDGTINRERSVSRLAAVAVNYAKAGAH
- CVAPSDMIDGRIRDIKRGLINANLAHK
- TFVLSYAAKFSGNLYGPARDAACSAPSNGDRKCYQLPPAGRGLARRALERDMSEGAD
- not a beta-strand, but is straight and fits into template: TFVLSYAAKFSG
- tertiary structure for this segment is unsolved: RDAACSAPSNGDRK
- GIIVKPSTFYLDIVRDASEICKDL
- PICAYHVSGEYAMLHAAAEKGVVDLKTIAFESHQGFLRAGAR
- skewed helix: SGEYAMLHAAAE
- template fitting helix: LKTIAFESHQGFLRA
- LIITYLAPEFLDWLDE
- (1:43
__prefix__ = MHTAEFLETEPTEISSVLAGGYNHPLLRQWQSERQLTKNMLI FPLFISDNPDDFTEIDSAPNINRIGVNRLKDYLKPLVAKGLR SVILFGVPLIPGTKDPVGTAADDPAGPVIQGIRFIREKFPEL YIICDVCLCEYTSHGHCGVLYDDGTINRERSVSRLAAVAVNYAKAGAH CVAPSDMIDGRIRDIKRGLINANLAHK TFVLSYAAKFSGNLYGPARDAACSAPSNGDRKCYQLPPAGRGLARRALERDMSEGAD GIIVKPSTFYLDIVRDASEICKDL PICAYHVSGEYAMLHAAAEKGVVDLKTIAFESHQGFLRAGAR LIITYLAPEFLDWLDE
Repeats 3 P-value 0.0091 Length 77 Offset 128
ID Probab P-value RepScore RepScoreNorm Cols Query HMM Template HMM A1 89.51 9.6e-10 43.19 0.56 77 128-204 128-204 A2 81.72 2.0e-03 23.65 0.33 65 205-284 135-204 A3 86.25 1.7e-05 24.55 0.61 43 285-327 135-177
A1 Mon_Aug_16_21: 128-204 +0 IICDVCLCEYTS...HGHCGVLY............DDGTINRERSVSRLAAVAVNYAKAGAHCVAPSDMIDGRIRDIKRG A2 Mon_Aug_16_21: 205-284 +0 -------LSYAAkfsGNLYGPARdaacsapsngdrKCYQLPPAGRGLARRA-LERDMSEGADGIIVKPSTF-YLDIVRDA A3 Mon_Aug_16_21: 285-327 +0 -------CAYHV...SGEYAMLH............AAAEKGVVDLKTIAFESHQGFLRAGARLII---------------
A1 Mon_Aug_16_21: 128-204 +0 LINANLAHKTFV A2 Mon_Aug_16_21: 205-284 +0 SE-I-C-KDLPI A3 Mon_Aug_16_21: 285-327 +0 ------------
1B54.pdb
- 2 : 1B54.pdb (2 ; mono; 257) - BA BA BA BA BA BA BA BA
__prefix__ = MSTGITYDEDRKTQLIAQYESVREVVNAEAKNVHVNENASKI LLLVVSKLKPASDIQILYDHGVR EFGENYVQELIEKAKLLPDDI KWHFIGGLQTNKCKDLAKVPN LYSVETIDSLKKAKKLNESRAKFQPDCNP ILCNVQINTSHEDQKSGLNNEAEIFEVIDFFLSEECKY IKLNGLMTIGSWNVSHEDSKENRDFATLVEWKKKIDAKFGTSL KLSMGMSADFREAIRQGTA EVRIGTDIFGARPPKNEARII
1B5T.pdb
- 2 : 1B5T.pdb (39; hote; 296) - BA BA BA BA BA BA BaoaoA BA
- in the seventh strand: IIPGILPVSNFKQAKKFADMTNVRIPAWMAQMFDGLDDDAETRKLVGANIAMDMVKILSREGVK
- non-template helix: NFKQAKKFADM
- non-template helix: PAWMAQMFD
- correct helix: DAETRKLVGANIAMDMVKILSRE
__prefix__ = GQI NVSFEFFPPRTSEMEQTLWNSIDRLSSLKPK FVSVTYGANSGERDRTHSIIKGIKDRTGLE AAPHLTCIDATPDELRTIARDYWNNGIR HIVALRGDLPPGSGKPEMYASDLVTLLKEVADF DISVAAYPEVHPEAKSAQADLLNLKRKVDAGAN RAITQFFFDVESYLRFRDRCVSAGIDVE IIPGILPVSNFKQAKKFADMTNVRIPAWMAQMFDGLDDDAETRKLVGANIAMDMVKILSREGVK DFHFYTLNRAEMSYAICHTLGVRPA
1BD0.pdb
- 2 : 1BD0.pdb (34; dime; 388) - aBA BA BA BA BA BA BA BAd
__prefix__ = MNDFHRDTWAEVDLDAIYDNVENLRRLLPDDT HIMAVVKANAYGHGDVQVARTALEAGAS RLAVAFLDEALALREKGIEA PILVLGASRPADAALAAQQR IALTVFRSDWLEEASALYSGPFP IHFHLKMDTGMGRLGVKDEEETKRIVALIERHPH FVLEGLYTHFATADEVNTDYFSYQYTRFLHMLEWLPSRPP LVHCANSAASLRFPDRTFN MVRFGIAMYGLAPSPGIKPLLPYPLKEA __suffix__ = FSLHSRLVHVKKLQPGEKVSYGATYTAQTEEWIGTIPIGYADGWLRRLQHFHVLVDGQKAPIVGRICMDQCMIRLPGPLP __suffix__ = VGTKVTLIGRQGDEVISIDDVARHLETINYEVPCTISYRVPRIFFRHKRIMEVRNAIGAGESSA
- 1 : 1BF2.pdb (26; mono; 776) - dBA BA BA BA Bo BaoA BA BAd
- 1 : 1BG4.pdb (53; mono; 302) - BA BA BA BA BA BboA BboA BboA
- 1 : 1BGG.pdb (8 ; octa; 448) - bBaoA BA BA BaoA BAaao BbbA BA BbbA
1BQC.pdb
- 2 : 1BQC.pdb (17; mono; 279) - bbBA BA BA BA BA BA BA BA
- : 1BQG.pdb () - BA BA BA BA BA BA BA BA
- 1 : 1BYA.pdb (83; mono; 495) - BA BA BoaoA BoaoA BoaoA BA BA BAo
- : 1C7S.pdb () - BA BA BA BA BA BA BA BA
- : 1C9W.pdb () - BA BA BA BA BA BA BA BA
- 1 : 1CB7.pdb (50; hete) - aaaBA BoaoA BA BA BA BA BA BAd
- 1 : 1CIU.pdb (10; mono; 710) - BA BA BA BA BA BaA BA BAd
1CNV.pdb
- 2 : 1CNV.pdb (47; mono; 324) - BA BA BA BA BA BA BA BA
- 0 : 1CTN.pdb (59; mono; 563) - dBo BaoA BaA BA BA BA BdA BA
- : 1D8C.pdb () - BA BA BA BA BA BA BA BA
1D9E.pdb
- 2 : 1D9E.pdb (75; hotr; 284) - bbBA BA BA BA BA BA BA BA
1DBT.pdb
- 2 : 1DBT.pdb (5 ; hodi; 239) - BA BA BA BA BA BA BA BA
- 1 : 1DE5.pdb (77; hote; 419) - aaBA BA BA BA BA BA BA BAoaaaa
- : 1DHP.pdb () - BA BA BA BA BA BA BA BA
- : 1DJX.pdb () - BA BA BA BA BA BA BA BA
- : 1DXE.pdb () - BA BA BA BA BA BA BA BA
- : 1ECE.pdb () - BA BA BA BA BA BA BA BA
- 1 : 1EGM.pdb (80; cplx; 224) - dBA BA BA BAa BA BA BA BAd
- 1 : 1EZW.pdb (78; ? ; 348) - BA BA BA BA BA BA BAd BA
- : 1F2J.pdb () - BA BA BA BA BA BA BA BA
- : 1F61.pdb () - BA BA BA BA BA BA BA BA
- : 1F6Y.pdb () - BA BA BA BA BA BA BA BA
- : 1FCB.pdb () - BA BA BA BA BA BA BA BA
- 0 : 1FIY.pdb (12; hote; 883) - dBaaA BAaaaa BoaaA BA BA BA BaoA BAd
- 1 : 1FRB.pdb ( 5; mono; 315) - bbBA BA BA BA BA BA BaA BAao
- : 1FWJ.pdb () - BA BA BA BA BA BA BA BA
- : 1GHS.pdb () - BA BA BA BA BA BA BA BA
- : 1GOX.pdb () - BA BA BA BA BA BA BA BA
- : 1LUC.pdb () - BA BA BA BA BA BA BA BA
- : 1MNS.pdb () - BA BA BA BA BA BA BA BA
- : 1MUC.pdb () - BA BA BA BA BA BA BA BA
- : 1NAL.pdb () - BA BA BA BA BA BA BA BA
- : 1NAR.pdb () - BA BA BA BA BA BA BA BA
- : 1ONR.pdb () - BA BA BA BA BA BA BA BA
- : 1OYA.pdb () - BA BA BA BA BA BA BA BA
- : 1PII.pdb () - BA BA BA BA BA BA BA BA
- : 1PKL.pdb () - BA BA BA BA BA BA BA BA
- : 1PSC.pdb () - BA BA BA BA BA BA BA BA
- : 1PUD.pdb () - BA BA BA BA BA BA BA BA
- : 1PYM.pdb () - BA BA BA BA BA BA BA BA
1QFE.pdb
- 2 : 1QFE.pdb (48; hodi; 252) - bbBA BA BA BA BA BA Bo BA
1QO2.pdb
- 2 : 1QO2.pdb (79; hote; 241) - ? ? ? ? ? ? ? ?
__prefix__ = ML VVPAIDLFRGKVARMIKGRKENTIFYEKDPVELVEKLIEEGFTL IHVVDLSNAIENSGENLPVLEKLSEFAEH IQIGGGIRSLDYAEKLRKLGYR RQIVSSKVLEDPSFLKSLREIDV EPVFSLDTRGGRVAFKGWLAEEEIDPVSLLKRLKEYGLE EIVHTEIEKDGTLQEHDFSLTKKIAIEAEV KVLAAGGISSENSLKTAQKVHTETNGL LKGVIVGRAFLEGILTVEVMKRYAR
- : 1QPO.pdb () - BA BA BA BA BA BA BA BA
- : 1QR7.pdb () - BA BA BA BA BA BA BA BA
- : 1QRQ.pdb () - BA BA BA BA BA BA BA BA
- : 1QTW.pdb () - BA BA BA BA BA BA BA BA
- 2 : 1RPX.pdb (9 ; hexa; 280) - BA BA BA BA BA BA BA BA
1THF.pdb
|
|
- 2 : 1THF.pdb (82; mono; 253) - BA BA BA BA BA BA BA BA
__prefix__ = MLAK RIIACLDVKDGRVVKGSNFENLRDSGDPVELGKFYSEIGID ELVFLDITASVEKRKTMLELVEKVAEQIDI PFTVGGGIHDFETASELILRGAD KVSINTAAVENPSLITQIAQTFGSQA VVVAIDAKRVDGEFMVFTYSGKKNTGILLRDWVVEVEKRGAG EILLTSIDRDGTKSGYDTEMIRFVRPLTTL PIIASGGAGKMEHFLEAFLAGAD AALAASVFHFREIDVRELKEYLKKHGVNVRLEGL
- : 1TPF.pdb () - BA BA BA BA BA BA BA BA
- : 1UOK.pdb () - BA BA BA BA BA BA BA BA
- : 1URO.pdb () - BA BA BA BA BA BA BA BA
- 1 : 1XYA.pdb (67; dime; 386) - BA BA BA BA BA BA BA BAa
2ALR.pdb
- ? : 2ALR.pdb (20; ? ; ? ) - ? ? ? ? ? ? ? ?
__prefix__ = AASCVLLHTGQKMPL IGLGTWKSEPGQVKAAVKYALSVGYR HIDCAAIYGNEPEIGEALKEDVGPGKAVPREEL FVTSKLWNTKHHPEDVEPALRKTLADLQLEYLD LYLMHWPYAFERGDNPFPKNADGTICYDSTHYKETWKALEALVAKGLVQ ALGLSNFNSRQIDDILSVASVRPA VLQVECHPYLAQNELIAHCQARGL EVTAYSPLGSSDRAWRDPDEPVLLEEPVVLALAEKYGRSPAQILLRWQVQRKV ICIPKSITPSRILQNIKVFDFTFSPEEMKQLNALNKNWRYIVP __suffix__ = MLTVDGKRVPRDAGHPLYPFNDPY
- 0 : 2AMG.pdb (3 ; mono; 551) - BA BbbA BA BA BA BA BA BAbbbbb
- 0 : 2CHR.pdb (38; octa; 370) - dBA BA BA BA BA BA BA BAb
- 1 : 2DIK.pdb (41; hodi; 873) - BA BaoA BoaA BA BA BaoA BA BA
- 0 : 2DOR.pdb (63; hodi; 311) - bbBA BbbA BA BA BA BbbbbA BA BA
- 0 : 2EBN.pdb (15; mono; 339) - Bb BboA BA BA Bo Bo BA BA
- 2 : 2EXO.pdb (25; mono; 312) - aBA BA BA BA BA BA BA BAa
- 2 : 2HVM.pdb (35; mono; 273) - BA BboA BA BA BA BA BA BA
- 0 : 2TMD.pdb (61; hodi; 729) - bbBA BA BbbboA BA BA BA BA BAm
- 2 : 2TPS.pdb (18; mono; 222) - aBA BA BA BA BA BA BA BA
- 0 : 2WSY.pdb (54; tetr; 268) - aBA BaoA BA BA BA BA BA BaA
- 0 : 4REQ.pdb (65; hedi; 637) - aaBA BA BaoA BaoAao BA BA BA BAd
- 0 : 7ENL.pdb (32; dime; 436) - bB aA BbbA BA BA BA BA BAb
- 0 : 7ODC.pdb (14; hodi; 461) - bBAb BAbo BA BA BA BA BA BAd*
1FQ0.pdb -- added from HHMrep paper
>1FQ0:A|PDBID|CHAIN|SEQUENCE MKNWKTSAESILTTGPVVPVIVVKKLEHAVPMAKALVAGGVRVLEVTLRTECAVDAIRAIAKEVPEAIVGAGTVLNPQQL AEVTEAGAQFAISPGLTEPLLKAATEGTIPLIPGISTVSELMLGMDYGLKEFKFFPAEANGGVKALQAIAGPFSQVRFCP TGGISPANYRDYLALKSVLCIGGSWLVPADALEAGDYDRITKLAREAVEGAKL
Repeats 7 P-value 0.0096 Length 23 Offset 72
ID Probab P-value RepScore RepScoreNorm Cols Query HMM Template HMM A1 37.08 1.1e-02 3.88 0.30 13 35-47 82-94 A2 18.31 2.7e-01 1.47 0.11 13 50-62 73-85 A3 75.98 3.9e-05 13.03 0.57 23 72-94 72-94 A4 63.90 7.1e-04 8.98 0.43 21 95-115 74-94 A5 58.26 2.4e-03 7.01 0.47 15 116-130 75-89 A6 32.21 2.0e-01 3.35 0.16 21 137-160 72-92 A7 62.00 1.5e-02 8.85 0.40 22 161-182 72-93
A1 Mon_Aug_16_16: 35-47 +2 ----------ALVAGG...VRVLEVTlr....... A2 Mon_Aug_16_16: 50-62 +9 -TECAVDAIRAIAK--...-------evpeaivga A3 Mon_Aug_16_16: 72-94 +0 GTVLNPQQLAEVTEAG...AQFAISP......... A4 Mon_Aug_16_16: 95-115 +0 --GLTEPLLKAATEGT...IPLIPGI......... A5 Mon_Aug_16_16: 116-130 +6 ---STVSELMLGMDYG...LK-----efkffp... A6 Mon_Aug_16_16: 137-160 +0 AEANGGVKALQAIAGPfsqVRFCP--......... A7 Mon_Aug_16_16: 161-182 +0 TGGISPANYRDYLALK...SVLCIG-.........
__prefix__ = MKNWKTSAESILTTGP VVPVIVVKKLEHAVPMAKALVAGGVR VLEVTLRTECAVDAIRAIAKEVPEA IVGAGTVLNPQQLAEVTEAGAQF AISPGLTEPLLKAATEGTIP LIPGISTVSELMLGMDYGLK EFKFFPAEANGGVKALQAIAGPFSQV RFCPTGGISPANYRDYLALKSVLC IGGSWLVPADALEAGDYDRITKLAREAVEGAKL
|
|
20100812.020205
- The TIM Barrel DATE data set is small, consisting only of ~80 proteins
- Of these, there aren't too many that are even compatible with the hypothesis -- that is, they're too divergent structurally
- Do one of two things:
- Remove the ones that are too different, and use a limited set
- Change datasets
- I'm going to try the first option -- there should be a number of structures that are just symmetrical enough.
20100810.095955
CS 798 Dan Brown Phylogeny Final Project
- The project that I proposed to Dan Brown sees the analysis of a large number of TIM Barrels
- For this, I will analyze my own archives and figure out what I have available first
- I will extend the database given selected data -- this is OK as it is more of a proof of concept rather than a full blown analysis
- In ZincSVN, there is a copy of DATE, relevant parts of SCOP and CATH
- I should look through those to see if there's anything I can use.
- There is one week to do this project from here -- here's a breakdown of my intermediate objectives
- Hypothesis: "TIM Barrels consist of clusters of proteins which have unique duplication histories"
- Analyze, Select, Collect, Organize data from the archive
- Cluster similar nodes together -- give Robert Edgar's clustering software a try
- Create the duplication histories for representative sequences using Mathieu Lajoie's software.
- Requires: Breaking apart the TIM Barrels into their putative eighths (OK, Mathieu Lajoie will verify or reject the hypotheses)
- Describe the resulting clusters in the project paper
- Reject or Support hypothesis
- Each point above can take 4 to 8 hours depending on how quickly I work.


























