《生物信息学：导论与方法》--新一代测序NGS：转录组分析RNA-Seq--听课笔记（十七）

最新推荐文章于 2021-12-19 22:06:27 发布

盲人骑瞎马5555

最新推荐文章于 2021-12-19 22:06:27 发布

阅读量968

点赞数

分类专栏：生物信息学

本文链接：https://blog.csdn.net/wxw060709/article/details/101191568

版权

50 篇文章 52 订阅

订阅专栏

第八章新一代测序NGS：转录组分析RNA-Seq

8.10 学生课堂报告-----Normalization methods for Illumina high-throughput RNA sequencing data analysis

1 Introduction

Normalization is Necessary
Hypothetical scenario: The hypothetical example above highlights the notion that the proportion of reads attributed to a given gene in a library depends on the expression properties of the whole sample rather than just the expression level of that gene.

2 Biological Assumptions

Gene counts are divided by the total number of mapped reads (or library size) associated with their lane and multiplied by the mean total count across all the samples of the dataset.
$TC=\frac{Y_{gk}}{N_{k}}\times \bar{N_{k}}$ , Total count(TC) (具体可以参考阅读paper list中文献)
$UQ=\frac{Y_{gk}}{N_{0.75k}}\times \bar{N_{k}}$ , Upper Quartile（UQ）
the total counts are replaced by the upper quartile of counts different from 0 in the computation of the normalization factors
$Med=\frac{Y_{gk}}{N_{0.5k}}\times \bar{N_{k}}$ , Median(Med)
the total counts are replaced by the median counts different from 0 in the computation of the normalization factors
Quantile(Q): consists in matching distributions of gene counts across lanes. It is implemented in the Bioconductor package Limma by calling the Normalize Quantiles funcition.
$RPKM=10^{9}\times \frac{C}{NL}$ , Reads Per Kilobase per Million(RPKM), C: the number of reads mapped onto the gene's exons; N:total number of reads in the experiment; L: the sum of the exons in base pairs.
Principle: faciliate comparisons between genes within a sample and combines between- and within- sample normalization
reference: Differential expression analysis for sequence count data

3 DE Comparision

TC、UQ、Med、DESeq、TMM、Q、RPKM、RawCount
distribution 比较
Intra-variance 比较
Houskeeping比较
Clustering比较
Fasle-positive rate
Summary of normalization effeciency
reference：A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
这篇reference的图都很值得借鉴。

8.14 学生课堂报告----Differential gene exression analysis

High-throughput sequencing technology is rapidly becoming the standard method for measuring RNA expression levels (aka RNA-seq)
One of the main goals of these experiments is to identify the differentially expressed genes in two or more conditions.
做RNA转录水平的分析，先需要对样品进行RNA-seq得到一系列的datasets，再从这些datasets里面分析出不同条件下基因差异表达，一共有以下三步：

reference：Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data
这篇文章Comparasion of different analysis methods for RNA-seq data by different angels. Such as, Cuffdiff, edgeR, DESeq, PossionSeq, baySeq and limma
所以这也是一篇类似综述的，看看别人是如何讲这篇文章的。
这篇paper主要用的database有以下两个：

The first is the Sequencing Quality Control (SEQC) dataset, which includes replicated samples of the human whole body reference RNA and human brain reference RNA along with RNA spike-in controls.
The second dataset is RNA-seq data from biological replicates of three cell lines that were characterized as part of the ENCODE project.

的确，如果你需要对比很多工具，那么就需要一个基础的数据库，大家都对这个数据库操作，然后才能对比效果。
然后就是从他们的哪些角度去对比,The means of their analysis:
The analysis in this paper focused on a number of measures that are most relevant for detection of differential gene expression from RNA-seq data:

nomalization of count data
sensitivity and specificity of DE detection
performance on the subset of genes that are expressed in one condition but have no detectable expression in the other condition.
the effects of reduced sequencing depth and number of replicates on the detection of differential expression.

In most benchmarks Cuffdiff performed less favorably: with a higher number off false positives; without any increase in sensitivity(Cuffdiff方法不是特别好，它的假阳性概率比较高，并且它的测序灵敏度不是那么高)
Our results conclusively demonstrate that the addition of replicate samples provides substantially greater detection power of DE than increased sequence depth.
Hence, including more repicate samples in RNA-seq experiments is always to be preferred over increasing the number of seqeunced reads.

关注