RNA-seq analysis (practice)

最新推荐文章于 2023-10-21 21:21:42 发布

Al_Liz_Well

最新推荐文章于 2023-10-21 21:21:42 发布

阅读量1.2k

点赞数

分类专栏： RNA-seq 文章标签： RNA-seq mapping differential express

本文链接：https://blog.csdn.net/Al_Liz_Well/article/details/51758465

版权

RNA-seq 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

For this RNA-seq analysis, we will use the data from: Cheng et al, MicroRNA silencing for cancer therapy targeted to the tumour microenvironment. Nature 2015. The raw sequencing data is at GEO with six FASTQ files ：http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE61851

Normally for paired-end RNA-seq, each sample will have two separate FASTQ files, with line-by-line correspondence to the two reads from the same fragment. This data is a little unusual in that the two reads of the same fragment are written in the same file. We are going to use the RNA-seq workflow (http://www.bioconductor.org/help/workflows/rnaseqGene/) in BioConductor to analyze the RNA-seq data.

For RNA-seq read mapping, a popular mapper is TopHat (Trapnell et al, Bioinformatics 2009): https://ccb.jhu.edu/software/tophat/index.shtml. However, if the data is paired-end sequencing with huge read count, STAR is much faster

1.STAR and TopHat : Use STAR (Dobin et al, Bioinformatics 2012) and TopHat to map the reads to the mouse reference genome.

Build index:

STAR --runMode genomeGenerate --runThreadN 2–-genomeDir /week_5/index/ --genomeFastaFiles /mnt/Storage/data/BWA/mm9.fa --sjdbGTFfile /mnt/Storage/home/huse/Data/RNA_pipe/mm9_refSeq.gtf --sjdb Overhang 49

Mapping:

STAR runThreadN 2 --genomeDir RNASEQ/star/index/ --readFilesIn RNASEQ/SRR1592185_1.fastq SRR1592185_2.fastq --outFileNamePrefix SRR1592185

tophat -p 2 -G /mnt/Storage/home/huse/Data/RNA_pipe/mm9_refSeq.gtf /mnt/Storage/data/Bowtie/mm9 SRR1592185_1.fastq SRR1592185_2.fastq > mapping.log

#–runThreadN 30

#–runMode genomeGenerate

#–genomeDir /home/jmzeng/hoston/mouse/STAR-mouse #put builded indexes into this directory

#–genomeFastaFiles fasta file of

#–sjdbGTFfile #annotation file

#–sjdbOverhang 284 #length of reads sequenced - 1

tophat -p 2 -G /mnt/Storage/home/huse/Data/RNA_pipe/mm9_refSeq.gtf /mnt/Storage/data/Bowtie/mm9 GSM1515732_1.fastq GSM1515732_2.fastq > mapping.log

2. The above test will allow you to learn how to use the read mappers well.After mapping using TopHat , we can get bam files. Use HTSeq to count the reads for each gene from the BAM files. For paired-end sequencing, the reads must be sorted by name. Input files of htseq-count are sam format. So we sort sam files by name and transfrom bam files into sam files. Below are Commands:

samtools sort -n GSM1515732.bam

samtools view -h GSM1515732.bam > GSM1515732.sam

htseq-count -s no -a 10 week_5/GSM1515732.sam RNA_pipe/mm9_refSeq.gtf > GSM1515732_count.txt

3. Use DESeq2 (Love et al, Genome Biol 2014) to look at differential expression. \

source("https://bioconductor.org/biocLite.R")

biocLite("DESeq2")

#count matrix

sampleNames <- c("DOX_1","DOX_2","DOX_3","NODOX_1","NODOX_2","NODOX_3")

mydata <- read.table("counts.txt", header = TRUE, quote = '\t',skip =1)

names(mydata)[7:12] <- sampleNames

countMatrix <- as.matrix(mydata[7:12])

rownames(countMatrix) <-mydata$Geneid

table2 <- data.frame(name = c("DOX_1","DOX_2","DOX_3","NODOX_1","NODOX_2","NODOX_3"),condition = ("DOX","DOX","DOX","NODOX","NODOX","NODOX"))

rownames(table2) <- sampleNames

head(countMatrix)

#transform count matrix into format of DESeq2

dds <- DESeqDataSetFromMatrix(countMatrix, colData=table2, design= ~ condition)

#filter data of count below 0.Because they are useless

dds <- dds[ rowSums(counts(dds)) > 1, ]

#PCA

rld <- rlog(dds)

plotPCA(rld, intgroup=c("name","condition"))

#PCA - ggplot2 note : don't library DESeq before PCA analysis

library(ggplot2)

rld <- rlog(dds)

data <- plotPCA(rld, intgroup=c("condition", "name"), returnData=TRUE)

percentVar <- round(100 * attr(data, "percentVar"))

p<- ggplot(data, aes(PC1, PC2, color=condition, shape=name)) +

geom_point(size=3) +

xlab(paste0("PC1: ",percentVar[1],"% variance")) +

ylab(paste0("PC2: ",percentVar[2],"% variance"))

#differential expression

library(DESeq)

dds <- DESeq(dds)

res <- results(dds)

write.table(res,"result.csv", sep = ",", row.names = TRUE)

head(res)

#(1)rownames:gene ID (2)baseMean (3)log2FoldChange (4)pvalue (5)padj:adjusted p value

#summary　

summary(res)

#MA plot

library(geneplotter)

plotMA(res, main="DESeq2", ylim=c(-2,2))

plotDispEsts(dds)

#heatmap

sum(res$padj < 0.1, na.rm=TRUE)

library("pheatmap")

select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:1000]

nt <- normTransform(dds) # defaults to log2(x+1)

log2.norm.counts <- assay(nt)[select,]

df <- as.data.frame(colData(dds)[,c("name","condition")])

pdf('heatmap1000.pdf',width = 6, height = 7)

pheatmap(log2.norm.counts, cluster_rows=TRUE, show_rownames=FALSE,

cluster_cols=TRUE, annotation_col=df)

dev.off()

( To be continued ... )

Al_Liz_Well

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
RNA-seq analysis (practice)

RNA-seq analysis （index building, mapping and differential expression analysis using DESeq2）
复制链接

扫一扫

专栏目录