Notes Re: Illumina iDEA Challenge
From SnOwy - Ed's Wiki Notebook
Contents |
iDEA Challenge
- Main: http://www.illumina.com/landing/idea/
- Judges: http://www.illumina.com/landing/idea/judges.ilmn
- Presentation (Webinar): https://illuminaevents.webex.com/ec0605lb/eventcenter/recording/recordAction.do;jsessionid=WMRyMJ0G1Jh0HnbPc2Zj5rZPQKcmvs1NHZ789l8P5pJRMs8wpy7v!-1892578192?theAction=poprecord&actname=%2Feventcenter%2Fframe%2Fg.do&actappname=ec0605lb&renewticket=0&renewticket=0&apiname=lsr.php&entappname=url0107lb&needFilter=false&&isurlact=true&rID=2513212&entactname=%2FnbrRecordingURL.do&rKey=5690ac10be4cf7bd&recordID=2513212&siteurl=illuminaevents&rnd=6198057508&SP=EC&AT=pb&format=short
- WebEx Player (Cisco): http://www.webex.com/downloadplayer.html
Presentation Notes
- dataset -- eight cancer cell lines
- highly studied model systems
- assays were run on these eight cell lines -- breast cancer lines
- "RNA-Seq"
- 50bp paired-end standard
- 100bp single-read directional
- DNA Methylation
- CNV with low pass genomic DNA sequencing
- small RNA analysis
- "RNA-Seq"
RNA-Seq data
- information rich
- gene count, expression profile
- SNPs, mutations
- chimeric transcript discovery
- gene annotation, discovery
- two methods used ...
- Standard mRNA-Seq (1.5 to 2 years old)
- standard method
- purify poly-A mRNA, randomly fragment
- mRNA → cDNA → dscDNA
- ligate sequencing adapters
- Direction mRNA-Seq aspects of mRNA-Seq + small-RNA kit
- CIP = alkaline phosphatase
- PNK = polynucleotide phosphate kinase
- ligate 3'-, 5'- small RNA adapters
- → cDNA
- shotgun prep
- Standard mRNA-Seq (1.5 to 2 years old)
- Directional: all of the reads are from a single strand
- Standard: reads from both strands
- both protocols: often do get many reads from the introns (normal) -- poly-A mRNA will occasionally retain these
Standard mRNA
- option to do paired-end sequencing: sequencing both ends of the mRNA
- dataset of paired fragment reads: ~90% mappable data
- variable insert sizes -- 100 to 120 bps, 120 bp on average; MB-468 has a lot of short-read contaminants
Directional mRNA data
- dataset of directional fragment reads: 100bp ~25%; 75bp ~60%
- less than half the reads are actually 100 bp length reads -- often 75 bp
- adapters -- influenced alignments -- used TopHat to create alignments
- adapters are a sequence that are found at the 3′ end of the read.
What this data looks like
- gene expression different across the eight samples
- ER+/ER- -like plot for expressed species between two cell lines shown (BT-20 vs MCF-7) -- scatter plot
- ER is estrogen receptor -- refers to the presence of expression
- gene shown is actually NRXN3
- sixty fold increase in reads -- (I think BT-20 has 948, MCF-70 has 15) -- this is a paired-end read
- hierarchical clustering of mRNA-Seq -- Gene Count Data
- data shown for 18k genes
looking for chimeric transcripts
- Title: Chimeric transcript discovery by paired-end transcriptome sequencing
- paired-end data -- chimeric search -- genes have recombined to produce chimeric proteins on their own
- fusion transcripts
- MCF-7 fusion proteins: BCAS4 and BCAS3 (breast cancer amplified sequences 3 and 4)
- velvet algorithm: create contigs that contain fusion junctions
- search through these for contigs for fusion products that are real and that aren't real
- used blast with the fusion contigs to find such products
- result: discovered BCAS4_BCAS3_368_bp expressed product
- highly expressed contig with paired-end reads
- now considered a positive control for finding fusion products
next example: alternative splicing
- Title: Alternative isoform regulation in human tissue transcriptomes
- analyzed splices, 95% of multi-exon human genes show alternative splicing behaviour
- provided alignments using tophat / other data sets
data also can be used to show SNPs
- 35% of all SNPs overlap RefSeq genes have >10X coverage
- 10x coverage: up to 30 to 40 million 50 bp reads (yes! logarithmic) => roughly ten times the coverage of the genome.
- distinct SNPs -- most are non-coding (227k)
- 20k are non-synonymous coding (55%)!
- of the 20k, SIFT classification -- high-confidence damaging -- 24%, low-confidence damaging -- 25%, stop codon -- 2.4%
- many are novel discovered SNPs
- 47% are tolerated mutations; remaining are not scored.
Not really looked at yet
- Allele-specific expression
- Alternative splicing
- Gene discovery
- more?
From the floor
- there are 30 groups viewing the webinar
DNA Methylation
- Title: Genome-scale DNA methylation maps of pluripotent and differentiated cells
- epigenetic -- so called "fifth base" -- 5-methyl-C
- mostly CpG sites
- uncommon, clusters near 50% of gene promoters -- CpG islands
- cancer causes odd patterns of this
- Reduced Representation Bisulfite Sequencing (RRBS)
- genomic DNA restriction digested
- size selection (150-175)bp, (175-225)bp -- insert size -- actual fragment sizes are 30-75, 75-125
- bisulfite treatment used to alter 5-methyl-C bases from C → U => T
- sequence before and after, the changes indicate where the bases have changed
Twin Study - multiple sclerosis present/absent
- Title: Genome, epigenome and RNA sequences of monozygotic twins discordant for multiple sclerosis
- highly correlated for methylation?!
- wait doesn't that mean that we've supported the null hypothesis?
- okay, understand now -- it's the weird outliers that aren't on the main axis
RRBS sample prep
- 2.3M restriction sites in whole genome
- we only care about 0.75M of these fragments after size selection
- MSP1? MFP1? -- enzyme good for cutting CpG-islands
- 1.2% of the genome
- with the size selection, we get 91.4% of the genome we wanted
- then where's the other 8.6%?
- pipeline:
- 50 bp read - may be shorter for shorter fragments
- C → T and G → A for fwd, rev strands due to base substitution for methylation detection
- aligned using 3 kinds of nucleotides: A, G, T.
- methylation rate given by proportion of C → T conversion in the before and after
- for well-methylated sites, we see 99.5% conversion
- typical methylation distribution: 10% low methylation, 20% high methylation -- remaining distribution demonstrates low probability
- think: bimodal -- likely: due to cellular machinery errors: erroneous methylation of non-meth sites / non-meth of methylation sites.
- Note: CpG => C next to a G -- 5′-Cytosine-Phosphate-Guanine-3′ -- direct adjacency
- different cell lines have different methylation patterns
- experiment: correlation with mRNA-Seq patterns!
- occasional correspondence with textbook: methylation α-1 transcription
Data analysis notes
- two outputs
- one for individual sites
- one for each island
- island ID, number of CpG sites in an island
- number of sites that are covered
- depth of read covereage (average)
- level of methylation
- average of all methylation percentage
CpG Island -- RRBS (mRNA) vs WGBS (whole genome)
- lung cancer
- cAMP-mediated signaling -- cell death -- more at risk for cancer with increased methylation
- G-protein coupled receptor signalling -- increased methylation in cancer
- Axonal guidance signalling -- unrelated? pathway - increased methylation for cancer
Copy Number Variation
- low pass genomic DNA sequencing
- coverage of human genome is roughly 0.40x (40%)
- we're actually looking at specific regions of a chromosome
- allows us to describe using the proportion of hits, the likelihood of a particular copy number
- MCF7 - chromosome 17 - BCAS3 extra high copy number (30? fold, at least 25 fold)
- BT474 - chromosome 17 - ERBB2 extra high copy number (25 fold)
- using a sliding window, we estimate the copy number for each segment of a chromosome
- non-covered regions of the genome are likely centromeric regions -- looks like no coverage -- highly unique
Small RNA
- aka microRNA, miRNA
- pri-miRNA (transcription), pre-miRNA -- 60 - 70 nucleotides
- miRNA (mature) -- 21-25 nucleotides
- derived with cleavage of pre-miRNA
Small RNA Informatics
- piRNA: piwi-interacting RNA -- 26, 31 nucleotides
- siRNA: short interfering
- snoRNA: small nucleolar -- modifies rRNA, tRNA, snRNA
- snRNA: small nuclear RNA -- splices introns from mRNA
- 35 bp reads -- typical small RNA is 22 bp long
- adaptor trimming
- alignment for multiple-lengths
- multiple hits -- small sequences align different places in a genome
- lookup in library of alignments as well as raw genome
- sequence errors common as sequencing continues (into the adapter sequence?)
- library: lots of short sequences
- miRBase (collection of miRNA)
- Flicker? Protocol / Software? What's Flicker?!
- Flicker outputs -- alignments to different databases including gDNA (genomic DNA), miRBase, both.
- miRNA hairpin
- Iso-miRs -- enzyme clipping miRNAs might be off by one or two nucleotides
- Cell lines -- MDA-MB-486 vs MDA-MB-231 small RNAs occur more often in the former?
- levels of hsa-miR-205 in the eight cell lines
- data counts the number of reads in each cell line and also in the normal genome
NGS (Next Generation System) Data Set -- functional genomics in Breast Cancer system
- 2*50bp paired end mRNA-seq
- 100bp stranded mRNA-seq
- small RNA
- DNA methtylation sequencing
- CNV-seq analysis
- what can you do with this data in this competition?
- look for fusions, chimeras
- expression of mRNA vs miRNA
- methylation
- rna processing, alternative splicing
- patterns of mutations in cell lines
- iso-miRNA patterns in miRNA between cell lines
- mRNA v miRNA expression
- correlation of expression between data → what we're looking for in this challenge
- mRNA, miRNA, methylation, CNV
Getting the data / The challenge
- sign up for the challenge
- will provide the HDD with data
- microsite just for the people in the competition
- does not include the human lung cancer methylation assay
- publication guidelines -- no guidelines yet, focus on competition; however, eventually will encourage publication later
- normalization: none done
- for the methylation -- only regions which have 10X reads pass the threshold to be included as a percentage methylation
- no other explicit normalization/thresholding ideas
- gschroth @ illumina . com
Properties
- Cytogenetics & CNV: http://www.illumina.com/applications.ilmn#cytogenetics
- Genome & CNV: http://www.illumina.com/applications.ilmn#whole_genome_genotyping_and_copy_number_variation_analysis
- Genotyping & CNV: http://www.illumina.com/applications.ilmn#custom_low_to_mid_plex_genotyping
- Epigenetics: http://www.illumina.com/applications.ilmn#gene_regulation_and_epigenetic_analysis
- Cancer: http://www.illumina.com/applications.ilmn#cancer
- Gene Regulation & Epigenetics: http://www.illumina.com/applications.ilmn#gene_regulation_and_epigenetic_analysis
- Software: http://www.illumina.com/applications/sequencing.ilmn#Software