RNA-seq Applications
- Differential expression
- Gene fusion
- Alternative splicing
- Novel transcribed regions
- Allele-specific(等位基因特异性) expression
- RNA editing
- Transcriptome for non-model organisms
生物学家通过对选定的生物物种进行科学研究,用于揭示某种具有普遍规律的生命现象,这种被选定的生物物种就是模式生物。
RNA-seq vs. microarray
- RNA-seq can be used to characterize novel transcripts and splicing variants as well as to profile the expression levels of known transcripts (but hybridization-based techniques are limited to detect transcripts corresponding to known genomic sequences)
- RNA-seq has higher resolution(分辨率) than whole genome tiling array analysis(全基因组平铺阵列分析)
- In principle, mRNA can achieve single-base resolution, where the resolution of tiling array depends on the density of probes
- RNA-seq can apply the same experimental protocol to various purposes, whereas specialized arrays need to be designed in these cases
- Detecting single nucleotide polymorphisms (needs SNP array otherwise)
- Mapping exon junctions (needs junction array otherwise)
- Detecting gene fusions (needs gene fusion array otherwise)
- Next-generation sequencing (NGS) technologies have often replaced microarrays as the tool of choice for genome analysis
RNA-seq and microarray agree fairly well only for genes with medium levels of expression
Challenges for RNA-seq: library construction
- Unlike small RNAs (microRNAs or miRNAs, piwi-interacting RNAs (piRNAs), short interfering RNAs (siRNAs) and many others, which can be directly sequenced after adaptor ligation), larger RNA molecules must be fragmented into smaller pieces (200-500 bp) to be compatible with most deep-sequencing technologies.
- Common fragmentation(碎片化) methods include RNA fragmentation (RNA hydrolysis or nebulization) and cDNA fragmentation (DNase I treatment or sonication(声波降解法))
- Each of these methods creates a different bias in the outcome
- PCR artefacts
- Many shorts reads that are identical to each other can be obtained from cDNA libraries that have been amplified. These could be a genuine refection of abundant RNA species, or they could be PCR artefacts.
- Use replicates
- Whether or not to prepare strand-specific libraries
- Strand-specific libraries are valuable for transcriptome annotation, expecially for regions with overlapping transcription from opposite direction
- Strand-specific libraries are currently laborious to produce because they require many steps or direct RNA-RNA ligation, which is inefficient.
Why Quality Control
- Sequence output:
- Reads + quality
- Natural questions
- Is the quality of my sequenced data OK?
- If something is wrong can I fix it?
- Problem: Huge files... How do they look?
- Files are flat files and big... ten of Gbs (even hard to browse them)
FastQC
Genome assembly: Genome assembly is the process of converting short reads into a detailed set of sequences corresponding to the chromosome(s) of an organism.
Genome assembly: relevance
- Genome assembly is needed when a genome is first sequenced. We can relate reads to chromosomes.
- For the human genome, the assembly is "frozen "as a snapshot every few years. The current assembly is GRCh38.
- For most human genome work we do not need to do "de novo" (from a new) assembly. Instead we map reads to a reference genome ----one that is already assembled.
Ready-To-Use reference sequences and annotations
- The iGenomes are a collection of reference sequences and annotation files for commonly analyzed organisms.
- The files have been downloaded from Ensembl, NCBI, or UCSC, and chromosome names have been changed to be simple and consistent with their download source.
- Each iGenome is available as a compressed file that contains sequences and annotation files for a single genomic build of an organism.
Sequence alignment: also called mapping, is the process of matching reads to a pre-existing reference by sequence homology.
Bowtie: Bowtie is an ultrafast, memory-efficient short read aligner. It algins short DNA sequences (reads) to the human genome at a rate of over 25 million reads per hour.
Sequence alignment/map format (SAM ) and BAM
- SAM is a common format having sequence reads and their alignment to a reference genome
- BAM is the binary form of a SAM file
- Aligned BAM files are available at respositories (Sequence Read Archive at NCBI, ENA at Ensembl)
- SAMTools is a software package commonly used to analyze SAM/BAM files.
Integrative Genomics Viewer (IGV)
Visualize ChIP-Seq data: UCSC genome browser; IGV browser
Challenges of mapping RNA-Seq reads
- Mapping to just the transcriptome misses unknown transcribed regions
- Additionally portions of the intron region could also be included, making it harder to map the reads to just the transcriptome
- Using the entire reference genome makes it more difficult to deal with alternative splicing junctions
- Unlike DNA-Seq, when mapping RNA-Seq reads back to reference genome, we need to pay attention to exon-exon junction reads.
Mapping with Tophat
Mapping with STAR
- STAR is reportedly 50-times faster at aligning than TopHat 2 with better alignment precision and sensitivity
- STAR: ultrafast universal RNA-seq aligner
- Basic STAR workflow consists of 2 steps:
- Generating genome indexes files
- Mapping reads to the genome
Expression quantification
- FPKM / RPKM
- Cufflinks & Cuffdiff
- Count data
- Summarized mapped reads to CDS, gene or exon level
- The number of reads is roughly proportional to
- the length of the gene
- the total number of reads in the library
HTSeq-count: counting reads in 'features'
Differential Gene Expression Analysis
- Count-based methods(R packages):
- DESeq: based on negative binomial(二项式) distribution
- edgeR: use an overdispersed Poisson model
- baySeq: use an empirical Bayes approach
- TSPM: use a two-stage poisson model
- Statistical Distributions: gaussian, poisson, negative binomial
- RNA-seq data fits a Negative Binomial (NB) distribution
- But really, that's just saying that RNA-seq looks like "counts" data with more variation than just statistical fluctuations - it also has biological variation in it.
- How do we know? Because, when you measure variance (per gene, between replicates), it's not equal to the mean, and it's not even a good linear fit.
RPKM/ FPKM -based methods
- Cufflinks & Cuffdiff
- Other differential analysis methods for microarray data
- t-test, limma, etc.
Quality Control of Experiments
- How well do the replicates correlate with each other?
- Does a PCA plot show that my samples group by genotype?
- What fraction of transcripts are expressed > 1 RPKM?
Hypergeometric(超几何的) test for overrepresentation - the basis for Gene Ontology analysis
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles