RNA-seq实践

1. 实验目的

通过对RNA-seq原始的fastq数据进行分析,然后通过R语言的DESeq2包对得到的计数矩阵进行统计分析。目的是为了能够熟悉并掌握RNA-seq的分析流程,对常用软件如fastqc,STAR进行练习,对常用的数据矩阵合成命令如paste,cut,head,tail等进行练习,对差异表达基因分析常用的DESeq进行练习和学习,从而能够理解RNA-seq分析的基本流程和原理。

2. 实验准备

2.1 实验平台
  1. Linux JSvr02 3.10.0-1160.80.1.el7.x86_64 #1 SMP Tue Nov 8 15:48:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  2. windows11 R4.4.1
2.2 数据简述
数据数据来源
exercise1Linux: /work/cb-data/RNA-seq
exercise2pubmed:24853205
pasilla filesR: pasilla library
  • pasilla files的读入方式:下载pasilla包,通过以下命令传入R。
matrixFile <- system.file( "extdata/pasilla_gene_counts.tsv", package="pasilla" )
sampleFile <- system.file( "extdata/pasilla_sample_annotation.csv", package="pasilla")
2.3 软件配置
Softwareversion
fastqcv0.11.9
FileZilla3.67.1_win64
STARSTAR_2.5.4b
R4.4.1

在R中安装DESeq2, ggplot2, pasilla,pheatmap,RColorBrewer,limma。

3. 实验内容

从测序的原始数据(fatsq data)开始,第一部分在Linux系统上将fastq格式数据转换为计数矩阵(Read Count Matrix),第二部分对计数矩阵进行统计分析。

3.1 From Fastq data files to Read Count Matrix (Linux Bash)

在Linux系统下首先通过fastqc工具来检测fastq文件的质量。若质量合格,通过STAR软件来建库比对,通过IGV软件对输出的BAM文件做可视化;通过一些文本操作命令将比对的结果合成reads计数矩阵。

(a) Prepare the working directory

mkdir /workdir/s20223282034/
cd /workdir/s20223282034/
cp -r /work/cb-data/RNAseq/exercise1/* . # -r用于将文件夹复制过来
ls -la

在这里插入图片描述

(b) Examine the quality of the fastq data files

fastqc ERR458493.fastq.gz

The fastqc software would create a new file called “ERR458493_fastqc.html”. You can use FileZilla to download the file to your laptop, and double click the file to check the results.
在这里插入图片描述

这里所看到的是一个箱线图,横轴为read长度(51),纵轴为质量得分,Q = -10*log10(error P)。黄色柱子代表的第四分之一位数到第四分之三位数的所有得分,蓝线表示平均数,红线代表中位数。一般要求所有位置的10%小于20,即最多允许该位置10%的序列低于Q20,即90%的序列的碱基质量都大于Q20,即90%的序列碱基错误率不超过99%。

显然,我们的测序数据有良好的表现。

(c ) Run read mapping software (STAR)
use STAR to map the sequencing reads in the fastq files to the reference genome.

STAR is a fast alignment software, but it requires a computer with large memory
(30GB for the 3GB human genome).

  1. Inspect the files in the working directory to make sure you are in working directory
ls -la
  1. you can find out what are in the files or number of reads in the fastq
gunzip -c ERR458493.fastq.gz | head
gunzip -c ERR458493.fastq.gz | wc -l
less R64.fa
less R64.gtf

R64.fa (fasta file)
在这里插入图片描述
R64.gtf (gtf file)
在这里插入图片描述
GTF文件简介

  1. Map the reads to reference genome using STAR.
    (1) index the reference genome with STAR
mkdir genome
STAR --runMode genomeGenerate --runThreadN 2 --genomeDir genome \
--genomeFastaFiles R64.fa --sjdbGTFfile R64.gtf \
--sjdbOverhang 50

The parameters:
--runMode genomeGenerate: Set runMode to “genomeGenerate” to index the genome;
--runThreadN: Number of CPU cores;
--genomeDir: Output directory for indexed genome database;
--genomeFastaFiles: Reference genome file
--sjdbGTFfile: Genome annotation file and it should be GTF format.
--sjdbOverhang: Use the value (reads_length -1)
在这里插入图片描述
Output:indexed genome database
在这里插入图片描述

(2) align sequencing reads to the indexed genome

STAR --quantMode GeneCounts --genomeDir genome --runThreadN 2 \
--readFilesIn ERR458493.fastq.gz --readFilesCommand zcat \
--outFileNamePrefix wt1_ --outFilterMultimapNmax 1 --outFilterMismatchNmax 2 \
--outSAMtype BAM SortedByCoordinate

The parameters:
--quantMode GeneCounts: Output a file with read counts per gene;
--genomeDir: Reference genome index directory;
--runThreadN:Number of CPU cores;
--readFilesIn: Sequence data file;
--readFilesCommand zcat: Input file is a decompressed .gz file;
--outFileNamePrefix: Prefix of the output file names;
--outFilterMultimapNmax 1: If a read maps to more location ,it will not be output;
--outFilterMismatchNmax 2 : Only report alignment with up to 2 mismatches per read;
--outSAMtype BAM SortedByCoordinate: Output sorted bam files.

(3)Inspect the output files:

less wt1_Log.final.out
less wt1_ReadsPerGene.out.tab

wt1_Log.final.out:
在这里插入图片描述

  • Uniquely mapped reads>80%,证明实验比较成功,可以继续进行分析。

wt1_ReadsPerGene.out.table:
在这里插入图片描述
The four columns are:

  • column 1: gene ID
  • column 2: counts for unstranded RNA-seq
  • column 3: counts for reads aligned with plus strand of RNA
  • column 4: counts for reads aligned with minus strand of RNA

对于不同的实验,我们选取不同的列进行分析,本实验我们选取第2列进行分析(unstranded)。

Use column 4 if you use stranded RNA-seq. Use column 3 if you do 3’ RNA-seq. Use column 2 if you use unstranded RNA-seq library prep kit.

(d) Visualize the BAM file with IGV

  1. Index the bam files
samtools index wt1_Aligned.sortedByCoord.out.bam
  1. Using FILEZILLA to download the “.bam” ,“.bai”, “R64.fa” and “R64.gtf”files to your laptop computer.
  2. Most commonly used genomes are already loaded in IGV. In this exercise, we will create our own genome database. Click “Genomes”->”Create .genome” file. Fill out the following fields:

Unique identifier: R64
Descript name: R64
Fasta: use the “Browse” button to find the R64.fa file
Gene file: use the “Browse” button to find the R64.gtf file

  1. From menu “File” -> “Load file”, open the “wt1_Aligned.sortedByCoord.out.bam”.
    Inspect the following regions by enter the text in the box next to “Go” and click “Go”.
    II:265,593-282,726
    在这里插入图片描述

(e) Run the commands in a shell script
对于多个样本,进行这一步骤往往耗时较长,我们可以写一个bash脚本来执行这个命令。

screen
sh runSTAR.sh >& log &

在脚本成功开始运行后,我们就可以关掉screen分屏,脚本仍在后台运行。

paste wt1_ReadsPerGene.out.tab wt2_ReadsPerGene.out.tab wt3_ReadsPerGene.out.tab 
mu1_ReadsPerGene.out.tab mu2_ReadsPerGene.out.tab mu3_ReadsPerGene.out.tab | \
cut -f1,2,6,10,14,18,22 | \
tail -n +5 > gene_count.txt
  • 此处paste命令将这6个文件按列合并,并根据我们的需求(提取每个文件的第二列合成计数矩阵),利用cut命令提取所需的列。
  • tail命令去掉前4行(从第5行开始提取,前四行是统计信息,并非gene的count信息),将结果输出到gene_count.txt.
3.2 Statistical Analysis of RNA-Seq Data (R Language)

在Windows系统,R语言环境下,对RNA-seq的计数矩阵进行统计分析,并根据例子观察交互关系(interaction)和批次效应(batch effect)对统计分析的影响。

(a) Basic Analysis: Identify differentially expressed (DE) genes

SampleGenotype
wt1wt
wt2wt
wt3wt
mu1mu
mu2mu
mu3mu
  1. Load the matrix and sample files into R, and examine their contents.
setwd("C:/Users/pppp/Desktop/CB")
matrixFile <- "gene_count.txt"
sampleFile <- "samples.txt"
cts <- as.matrix(read.csv(matrixFile, sep="\t", row.names=1, header=FALSE))
coldata <- read.csv(sampleFile, sep="\t", row.names=1, header=TRUE)
head(coldata)
head(cts) # cts没有列名
colnames(cts) <- rownames(coldata) #将coldata的行名作为cts的列名
head(cts)

在这里插入图片描述
当然,此处对于列名的顺序也是我们之前合成这个矩阵时候的顺序。

  1. Load the DESeq2 library. Create a DESeq2 object named dds from the gene read count and sample information.
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = cts, #表达矩阵
colData = coldata,   #样本信息
design = ~ Genotype) #需要参考的实验设计参数
  1. Create a PCA plot from the DESeq2 object
library(ggplot2)
vsd <- vst(dds, blind=FALSE)
pcaData <- plotPCA(vsd, intgroup=c("Genotype"), returnData=TRUE)
# attr用于获取对象(pcadata)的属性,round函数用于四舍五入
percentVar <- round(100 * attr(pcaData, "percentVar"))
#按照Genotype划分颜色
ggplot(pcaData, aes(PC1, PC2, color=Genotype)) +
#geom_point创建散点图
geom_point(size=3) +
xlim(-2.5, 2.5) +
ylim(-1, 1) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
#geom_text用于给点添加文本标签,,hjust和vjust分别调节其水平和垂直位置。
geom_text(aes(label=name),vjust=2)
ggsave("myplot.png")
  • vst是基于负二项分布的一种标准化方法
  • plotPCA函数中的intgroup:colData()中用于分组的参数向量
  • plotPCA函数中的returnData=TRUE表示返回一个绘图数据(data frame)用于绘制PCA图

在这里插入图片描述

  1. Get a list of differentially expressed genes.
dds<-DESeq(dds) #差异表达分析
resultsNames(dds) # return the names of the estimated effects (coefficents) of the model
# [1] "Intercept"         "Genotype_wt_vs_mu"
# unfiltered results
res <- results(dds, name="Genotype_wt_vs_mu")
summary(res) # print the summary on screen
# filter the results for FDR < 0.05 and absolute-value-of-Fold-Change > 2
res05 <- res[which(res$padj<0.05 & abs(res$log2FoldChange)>log2(2)), ]
# sort by padj and write to file
res05Ordered <- res05[order(res05$padj),]
write.csv(as.data.frame(res05Ordered), file="wt_vs_mu_fdr05_fc2_results.csv")
  • as.data.framedata.frame区别在于前者是直接构建一个frame,而后者是将已有类型转化为frame。
    在这里插入图片描述
  1. Shrinkage of log fold change (log fc) of genes with very low expression.
    Genes with very low expression could have imprecise fold changes. It is desirable to shrink the fold change of genes with low read counts, but not shrink the fold change of highly expressed genes.we will use “apeglm” algorithms for shrinkage.
    compare the MA plot with or without shrinkage
## no shrinkage
res <- results(dds, name="Genotype_wt_vs_mu")
pdf("res_no_shrink.pdf")
## alpha用于控制假阳性率(FDR)
plotMA(res, main = "No shrinkage", alpha=0.05, ylim=c(-4,4))
dev.off()
## shrunk by apeglm
## coef是上面提到的coefficient
res_shrink <- lfcShrink(dds, coef="Genotype_wt_vs_mu", type="apeglm")
pdf("res_shrink.pdf")
plotMA(res_shrink, main = "Shrinkage by apeglm", alpha=0.05, ylim=c(-4,4))
dev.off()

在这里插入图片描述
在这里插入图片描述
通过对比可以看出,对于count较低的gene进行了明显收缩,而对于count较高的gene没有发生明显变化,收缩较为理想。

(b) Interactions

data files:, there are 12 samples, with two variables: strains (wt vs mut), and sample
collection time (0 vs 120 minutes).You are provided with two files: fission_gene_count.csv
(count matrix) and fission_sample.csv (sample annotation)

strainminute
GSM1368273wt0
GSM1368274wt0
GSM1368275wt0
GSM1368285wt120
GSM1368286wt120
GSM1368287wt120
GSM1368291mut0
GSM1368292mut0
GSM1368293mut0
GSM1368303mut120
GSM1368304mut120
GSM1368305mut120
  1. Load data files into R. As the values of one of the variable “minute” are numeric, the minute
    variable needs to be converted to R factor object.
setwd("C:/Users/pppp/Desktop/CB/exercise2")
matrixFile <- "fission_gene_count.csv"
sampleFile <- "fission_sample.csv"
cts <- as.matrix(read.csv(matrixFile, row.names=1))
coldata <- read.csv(sampleFile, row.names=1)
coldata$minute<-as.factor(coldata$minute)
  1. Test of interactions of the strain and minute variable.
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = cts,
       colData = coldata,
       design = ~ strain + minute + strain:minute) #多因素设计
ddsLRT <- DESeq(dds, test="LRT", reduced= ~ strain + minute) 
#test="LRT"似然比检验   reduced简化模型(此处去除strain:minute)
resLRT <- results(ddsLRT)
summary(resLRT)

在这里插入图片描述
可以发现,去除interactions后,数据出现较大差异。

(c )Batch effects
data files: there are 7 samples from two conditions (untreated and treated), and the data are from two different batches (single- and paired-end sequencing). We will correct for the batch effect in this analysis.

fileconditiontype
treated1fbtreatedsingle-read
treated2fbtreatedpaired-end
treated3fbtreatedpaired-end
untreated1fbuntreatedsingle-read
untreated2fbuntreatedsingle-read
untreated3fbuntreatedpaired-end
untreated4fbuntreatedpaired-end
matrixFile <- system.file( "extdata/pasilla_gene_counts.tsv", package="pasilla" )
sampleFile <- system.file( "extdata/pasilla_sample_annotation.csv", package="pasilla")
cts <- as.matrix(read.csv(matrixFile,sep="\t",row.names="gene_id"))
coldata <- read.csv(sampleFile, row.names=1)
coldata <- coldata[,c("condition","type")]

We need to remove the two extra characters (“fb”) in sample names of the sample annotation file
(and convert the “-” in the type to “_” to remove some warning messages). We also need to fix
the order of samples in the read count matrix, to be consistent with sample annotation file.

rownames(coldata) <- sub("fb", "", rownames(coldata))
coldata$type <- sub("-", "_", coldata$type)
coldata$type <- as.factor(coldata$type)
cts <- cts[, rownames(coldata)] #调节样本顺序,使其与注释文件一致
  1. DE test accounting for the batch effect
    We include the batch effect variable “type” in the model by using the design formula “type+condition”. Then we retrieve the results for the factor “condition”, with batch effect “type” corrected.
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = cts,
colData = coldata,
design = ~ type+condition)
dds <- DESeq(dds)
resultsNames(dds)
#[1] "Intercept"   "type_single_read_vs_paired_end" "condition_untreated_vs_treated"
res <- results(dds, name="condition_untreated_vs_treated")#此处只选择condition来矫正批次影响
summary(res)

在这里插入图片描述

  1. Plot PCA before and after removing batch effect.

    (1) PCA plot before removing batch effect

library(ggplot2)
vsd <- vst(dds, blind=FALSE)
pcaData <- plotPCA(vsd, intgroup=c("condition", "type"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape = type)) +
geom_point(size=3) +
xlim(-12, 12) +
ylim(-10, 10) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
geom_text(aes(label=name),vjust=2)
ggsave("myPCAWithBatchEffect.png")

在这里插入图片描述
去除批次效应前,paired_end类型的数据明显位于PCA图的下方,single_end的数据明显位于PCA图的上方,具有明显的批次效应

(2) PCA plot after removing the batch effect with limma::removeBatchEffect

assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$type)
pcaData <- plotPCA(vsd, intgroup=c("condition", "type"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape = type)) +
geom_point(size=3) +
xlim(-12, 12) +
ylim(-10, 10) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
geom_text(aes(label=name),vjust=2)
ggsave("myPCABatchEffectRemoved.png")

在这里插入图片描述
去除批次效应后,paired_end和single_end的数据在PCA的图中不在明显分开,证明已去除批次效应。
3. Heatmap of the samples

library("pheatmap")
library("RColorBrewer")
sampleDists <- dist(t(assay(vsd))) #dist计算距离,t用来转置
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pdf("heatmap.pdf", height = 4, width = 5)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
dev.off()
  • colorRampPalette: 颜色渐变函数。
  • brewer.pal(9,“Blues”): 用于生产包含9个蓝色渐变色的向量,rev用于颠倒顺序,(255)包含255个颜色。
    在这里插入图片描述
    作热图,发现single_end和paired_end的数据不在明显区分,说明批次效应已去除。

4. 实验总结

4.1 实验结论

RNA-seq是生物信息/计算生物分析中常见的一种数据,对于多组数据,要充分考虑测序方式(如,single_end /paired_end),交互关系(interaction)和批次效应(Batch effects)的影响,设计合理、正确的分析方案。

4.2 实验收获

通过本实验,掌握了RNA-seq的基本分析流程,对fastqc,STAR软件的基本用法和结果解读有了一定认识,对DESeq2和ggplot2包的基本用法有了一些心得。充分理解了交互关系(interaction)和批次效应(Batch effects)对数据分析的影响。

4.3 其它心得

生物信息分析中常用的library,software较多,所以学好英语,多看多练习多思考,才能有一个较好的效果。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值