RNA-seq实践

晚渔声

已于 2024-09-29 20:35:38 修改

阅读量862

点赞数 9

文章标签： linux

于 2024-09-27 00:22:58 首次发布

本文链接：https://blog.csdn.net/telasi2004/article/details/142501798

版权

计算生物学实验3: RNA-seq实践

1. 实验目的

通过对RNA-seq原始的fastq数据进行分析，然后通过R语言的DESeq2包对得到的计数矩阵进行统计分析。目的是为了能够熟悉并掌握RNA-seq的分析流程，对常用软件如fastqc，STAR进行练习，对常用的数据矩阵合成命令如paste，cut，head，tail等进行练习，对差异表达基因分析常用的DESeq进行练习和学习，从而能够理解RNA-seq分析的基本流程和原理。

2. 实验准备

2.1 实验平台

Linux JSvr02 3.10.0-1160.80.1.el7.x86_64 #1 SMP Tue Nov 8 15:48:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
windows11 R4.4.1

2.2 数据简述

数据	数据来源
exercise1	Linux: /work/cb-data/RNA-seq
exercise2	pubmed:24853205
pasilla files	R: pasilla library

pasilla files的读入方式：下载pasilla包，通过以下命令传入R。

matrixFile <- system.file( "extdata/pasilla_gene_counts.tsv", package="pasilla" )
sampleFile <- system.file( "extdata/pasilla_sample_annotation.csv", package="pasilla")

2.3 软件配置

Software	version
fastqc	v0.11.9
FileZilla	3.67.1_win64
STAR	STAR_2.5.4b
R	4.4.1

在R中安装DESeq2, ggplot2, pasilla,pheatmap,RColorBrewer，limma。

3. 实验内容

从测序的原始数据（fatsq data）开始，第一部分在Linux系统上将fastq格式数据转换为计数矩阵（Read Count Matrix），第二部分对计数矩阵进行统计分析。

3.1 From Fastq data files to Read Count Matrix (Linux Bash)

在Linux系统下首先通过fastqc工具来检测fastq文件的质量。若质量合格，通过STAR软件来建库比对，通过IGV软件对输出的BAM文件做可视化；通过一些文本操作命令将比对的结果合成reads计数矩阵。

(a) Prepare the working directory

mkdir /workdir/s20223282034/
cd /workdir/s20223282034/
cp -r /work/cb-data/RNAseq/exercise1/* . # -r用于将文件夹复制过来
ls -la

在这里插入图片描述

(b) Examine the quality of the fastq data files

fastqc ERR458493.fastq.gz

The fastqc software would create a new file called “ERR458493_fastqc.html”. You can use FileZilla to download the file to your laptop, and double click the file to check the results.
在这里插入图片描述

这里所看到的是一个箱线图，横轴为read长度（51），纵轴为质量得分，Q = -10*log10（error P）。黄色柱子代表的第四分之一位数到第四分之三位数的所有得分，蓝线表示平均数，红线代表中位数。一般要求所有位置的10%小于20，即最多允许该位置10%的序列低于Q20，即90%的序列的碱基质量都大于Q20，即90%的序列碱基错误率不超过99%。

显然，我们的测序数据有良好的表现。

(c ) Run read mapping software (STAR)
use STAR to map the sequencing reads in the fastq files to the reference genome.

STAR is a fast alignment software, but it requires a computer with large memory
(30GB for the 3GB human genome).

Inspect the files in the working directory to make sure you are in working directory

ls -la

you can find out what are in the files or number of reads in the fastq

gunzip -c ERR458493.fastq.gz | head
gunzip -c ERR458493.fastq.gz | wc -l
less R64.fa
less R64.gtf

R64.fa (fasta file)
在这里插入图片描述
R64.gtf (gtf file)

GTF文件简介

Map the reads to reference genome using STAR.
(1) index the reference genome with STAR

mkdir genome
STAR --runMode genomeGenerate --runThreadN 2 --genomeDir genome \
--genomeFastaFiles R64.fa --sjdbGTFfile R64.gtf \
--sjdbOverhang 50

The parameters:
--runMode genomeGenerate: Set runMode to “genomeGenerate” to index the genome;
--runThreadN: Number of CPU cores;
--genomeDir: Output directory for indexed genome database;
--genomeFastaFiles: Reference genome file
--sjdbGTFfile: Genome annotation file and it should be GTF format.
--sjdbOverhang: Use the value (reads_length -1)
在这里插入图片描述
Output：indexed genome database

(2) align sequencing reads to the indexed genome

STAR --quantMode GeneCounts --genomeDir genome --runThreadN 2 \
--readFilesIn ERR458493.fastq.gz --readFilesCommand zcat \
--outFileNamePrefix wt1_ --outFilterMultimapNmax 1 --outFilterMismatchNmax 2 \
--outSAMtype BAM SortedByCoordinate

The parameters:
--quantMode GeneCounts: Output a file with read counts per gene;
--genomeDir: Reference genome index directory;
--runThreadN:Number of CPU cores;
--readFilesIn: Sequence data file;
--readFilesCommand zcat: Input file is a decompressed .gz file;
--outFileNamePrefix: Prefix of the output file names;
--outFilterMultimapNmax 1: If a read maps to more location ,it will not be output;
--outFilterMismatchNmax 2 : Only report alignment with up to 2 mismatches per read;
--outSAMtype BAM SortedByCoordinate: Output sorted bam files.

(3)Inspect the output files:

less wt1_Log.final.out
less wt1_ReadsPerGene.out.tab

wt1_Log.final.out:
在这里插入图片描述

Uniquely mapped reads>80%,证明实验比较成功，可以继续进行分析。

wt1_ReadsPerGene.out.table:
在这里插入图片描述
The four columns are:

column 1: gene ID
column 2: counts for unstranded RNA-seq
column 3: counts for reads aligned with plus strand of RNA
column 4: counts for reads aligned with minus strand of RNA

对于不同的实验，我们选取不同的列进行分析，本实验我们选取第2列进行分析（unstranded）。

Use column 4 if you use stranded RNA-seq. Use column 3 if you do 3’ RNA-seq. Use column 2 if you use unstranded RNA-seq library prep kit.

(d) Visualize the BAM file with IGV

Index the bam files

samtools index wt1_Aligned.sortedByCoord.out.bam

Using FILEZILLA to download the “.bam” ,“.bai”, “R64.fa” and “R64.gtf”files to your laptop computer.
Most commonly used genomes are already loaded in IGV. In this exercise, we will create our own genome database. Click “Genomes”->”Create .genome” file. Fill out the following fields:

Unique identifier: R64
Descript name: R64
Fasta: use the “Browse” button to find the R64.fa file
Gene file: use the “Browse” button to find the R64.gtf file

From menu “File” -> “Load file”, open the “wt1_Aligned.sortedByCoord.out.bam”.
Inspect the following regions by enter the text in the box next to “Go” and click “Go”.
II:265,593-282,726

(e) Run the commands in a shell script
对于多个样本，进行这一步骤往往耗时较长，我们可以写一个bash脚本来执行这个命令。

screen
sh runSTAR.sh >& log &

在脚本成功开始运行后，我们就可以关掉screen分屏，脚本仍在后台运行。

paste wt1_ReadsPerGene.out.tab wt2_ReadsPerGene.out.tab wt3_ReadsPerGene.out.tab 
mu1_ReadsPerGene.out.tab mu2_ReadsPerGene.out.tab mu3_ReadsPerGene.out.tab | \
cut -f1,2,6,10,14,18,22 | \
tail -n +5 > gene_count.txt

此处paste命令将这6个文件按列合并，并根据我们的需求（提取每个文件的第二列合成计数矩阵），利用cut命令提取所需的列。
用tail命令去掉前4行（从第5行开始提取，前四行是统计信息，并非gene的count信息），将结果输出到gene_count.txt.

3.2 Statistical Analysis of RNA-Seq Data (R Language)

在Windows系统，R语言环境下，对RNA-seq的计数矩阵进行统计分析，并根据例子观察交互关系（interaction）和批次效应（batch effect）对统计分析的影响。

(a) Basic Analysis: Identify differentially expressed (DE) genes

Sample	Genotype
wt1	wt
wt2	wt
wt3	wt
mu1	mu
mu2	mu
mu3	mu

Load the matrix and sample files into R, and examine their contents.

setwd("C:/Users/pppp/Desktop/CB")
matrixFile <- "gene_count.txt"
sampleFile <- "samples.txt"
cts <- as.matrix(read.csv(matrixFile, sep="\t", row.names=1, header=FALSE))
coldata <- read.csv(sampleFile, sep="\t", row.names=1, header=TRUE)
head(coldata)
head(cts) # cts没有列名
colnames(cts) <- rownames(coldata) #将coldata的行名作为cts的列名
head(cts)

在这里插入图片描述
当然，此处对于列名的顺序也是我们之前合成这个矩阵时候的顺序。

Load the DESeq2 library. Create a DESeq2 object named dds from the gene read count and sample information.

library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = cts, #表达矩阵
colData = coldata,   #样本信息
design = ~ Genotype) #需要参考的实验设计参数

Create a PCA plot from the DESeq2 object

library(ggplot2)
vsd <- vst(dds, blind=FALSE)
pcaData <- plotPCA(vsd, intgroup=c("Genotype"), returnData=TRUE)
# attr用于获取对象（pcadata）的属性，round函数用于四舍五入
percentVar <- round(100 * attr(pcaData, "percentVar"))
#按照Genotype划分颜色
ggplot(pcaData, aes(PC1, PC2, color=Genotype)) +
#geom_point创建散点图
geom_point(size=3) +
xlim(-2.5, 2.5) +
ylim(-1, 1) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
#geom_text用于给点添加文本标签，，hjust和vjust分别调节其水平和垂直位置。
geom_text(aes(label=name),vjust=2)
ggsave("myplot.png")

vst是基于负二项分布的一种标准化方法
plotPCA函数中的intgroup：colData()中用于分组的参数向量
plotPCA函数中的returnData=TRUE表示返回一个绘图数据（data frame）用于绘制PCA图

在这里插入图片描述

Get a list of differentially expressed genes.

dds<-DESeq(dds) #差异表达分析
resultsNames(dds) # return the names of the estimated effects (coefficents) of the model
# [1] "Intercept"         "Genotype_wt_vs_mu"
# unfiltered results
res <- results(dds, name="Genotype_wt_vs_mu")
summary(res) # print the summary on screen
# filter the results for FDR < 0.05 and absolute-value-of-Fold-Change > 2
res05 <- res[which(res$padj<0.05 & abs(res$log2FoldChange)>log2(2)), ]
# sort by padj and write to file
res05Ordered <- res05[order(res05$padj),]
write.csv(as.data.frame(res05Ordered), file="wt_vs_mu_fdr05_fc2_results.csv")

as.data.frame与data.frame区别在于前者是直接构建一个frame，而后者是将已有类型转化为frame。

Shrinkage of log fold change (log fc) of genes with very low expression.
Genes with very low expression could have imprecise fold changes. It is desirable to shrink the fold change of genes with low read counts, but not shrink the fold change of highly expressed genes.we will use “apeglm” algorithms for shrinkage.
compare the MA plot with or without shrinkage

## no shrinkage
res <- results(dds, name="Genotype_wt_vs_mu")
pdf("res_no_shrink.pdf")
## alpha用于控制假阳性率(FDR)
plotMA(res, main = "No shrinkage", alpha=0.05, ylim=c(-4,4))
dev.off()
## shrunk by apeglm
## coef是上面提到的coefficient
res_shrink <- lfcShrink(dds, coef="Genotype_wt_vs_mu", type="apeglm")
pdf("res_shrink.pdf")
plotMA(res_shrink, main = "Shrinkage by apeglm", alpha=0.05, ylim=c(-4,4))
dev.off()

在这里插入图片描述

通过对比可以看出，对于count较低的gene进行了明显收缩，而对于count较高的gene没有发生明显变化，收缩较为理想。

(b) Interactions

data files：, there are 12 samples, with two variables: strains (wt vs mut), and sample
collection time (0 vs 120 minutes).You are provided with two files: fission_gene_count.csv
(count matrix) and fission_sample.csv (sample annotation)

	strain	minute
GSM1368273	wt	0
GSM1368274	wt	0
GSM1368275	wt	0
GSM1368285	wt	120
GSM1368286	wt	120
GSM1368287	wt	120
GSM1368291	mut	0
GSM1368292	mut	0
GSM1368293	mut	0
GSM1368303	mut	120
GSM1368304	mut	120
GSM1368305	mut	120

Load data files into R. As the values of one of the variable “minute” are numeric, the minute
variable needs to be converted to R factor object.

setwd("C:/Users/pppp/Desktop/CB/exercise2")
matrixFile <- "fission_gene_count.csv"
sampleFile <- "fission_sample.csv"
cts <- as.matrix(read.csv(matrixFile, row.names=1))
coldata <- read.csv(sampleFile, row.names=1)
coldata$minute<-as.factor(coldata$minute)

Test of interactions of the strain and minute variable.

library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = cts,
       colData = coldata,
       design = ~ strain + minute + strain:minute) #多因素设计
ddsLRT <- DESeq(dds, test="LRT", reduced= ~ strain + minute) 
#test="LRT"似然比检验   reduced简化模型(此处去除strain：minute)
resLRT <- results(ddsLRT)
summary(resLRT)

在这里插入图片描述
可以发现，去除interactions后，数据出现较大差异。

（c ）Batch effects
data files: there are 7 samples from two conditions (untreated and treated), and the data are from two different batches (single- and paired-end sequencing). We will correct for the batch effect in this analysis.

file	condition	type
treated1fb	treated	single-read
treated2fb	treated	paired-end
treated3fb	treated	paired-end
untreated1fb	untreated	single-read
untreated2fb	untreated	single-read
untreated3fb	untreated	paired-end
untreated4fb	untreated	paired-end

matrixFile <- system.file( "extdata/pasilla_gene_counts.tsv", package="pasilla" )
sampleFile <- system.file( "extdata/pasilla_sample_annotation.csv", package="pasilla")
cts <- as.matrix(read.csv(matrixFile,sep="\t",row.names="gene_id"))
coldata <- read.csv(sampleFile, row.names=1)
coldata <- coldata[,c("condition","type")]

We need to remove the two extra characters (“fb”) in sample names of the sample annotation file
(and convert the “-” in the type to “_” to remove some warning messages). We also need to fix
the order of samples in the read count matrix, to be consistent with sample annotation file.

rownames(coldata) <- sub("fb", "", rownames(coldata))
coldata$type <- sub("-", "_", coldata$type)
coldata$type <- as.factor(coldata$type)
cts <- cts[, rownames(coldata)] #调节样本顺序，使其与注释文件一致

DE test accounting for the batch effect
We include the batch effect variable “type” in the model by using the design formula “type+condition”. Then we retrieve the results for the factor “condition”, with batch effect “type” corrected.

library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = cts,
colData = coldata,
design = ~ type+condition)
dds <- DESeq(dds)
resultsNames(dds)
#[1] "Intercept"   "type_single_read_vs_paired_end" "condition_untreated_vs_treated"
res <- results(dds, name="condition_untreated_vs_treated")#此处只选择condition来矫正批次影响
summary(res)

在这里插入图片描述

Plot PCA before and after removing batch effect.

(1) PCA plot before removing batch effect

library(ggplot2)
vsd <- vst(dds, blind=FALSE)
pcaData <- plotPCA(vsd, intgroup=c("condition", "type"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape = type)) +
geom_point(size=3) +
xlim(-12, 12) +
ylim(-10, 10) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
geom_text(aes(label=name),vjust=2)
ggsave("myPCAWithBatchEffect.png")

在这里插入图片描述
去除批次效应前，paired_end类型的数据明显位于PCA图的下方，single_end的数据明显位于PCA图的上方，具有明显的批次效应。

(2) PCA plot after removing the batch effect with limma::removeBatchEffect

assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$type)
pcaData <- plotPCA(vsd, intgroup=c("condition", "type"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape = type)) +
geom_point(size=3) +
xlim(-12, 12) +
ylim(-10, 10) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
geom_text(aes(label=name),vjust=2)
ggsave("myPCABatchEffectRemoved.png")

在这里插入图片描述
去除批次效应后，paired_end和single_end的数据在PCA的图中不在明显分开，证明已去除批次效应。
3. Heatmap of the samples

library("pheatmap")
library("RColorBrewer")
sampleDists <- dist(t(assay(vsd))) #dist计算距离，t用来转置
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pdf("heatmap.pdf", height = 4, width = 5)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
dev.off()

colorRampPalette: 颜色渐变函数。
brewer.pal(9,“Blues”): 用于生产包含9个蓝色渐变色的向量，rev用于颠倒顺序，（255）包含255个颜色。

作热图，发现single_end和paired_end的数据不在明显区分，说明批次效应已去除。