RNA-seq analysis (practice)

    For this RNA-seq analysis, we will use the data from: Cheng et al, MicroRNA silencing for cancer therapy targeted to the tumour microenvironment. Nature 2015. The raw sequencing data is at GEO with six FASTQ files :http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE61851

    Normally for paired-end RNA-seq, each sample will have two separate FASTQ files, with line-by-line correspondence to the two reads from the same fragment. This data is a little unusual in that the two reads of the same fragment are written in the same file. We are going to use the RNA-seq workflow (http://www.bioconductor.org/help/workflows/rnaseqGene/) in BioConductor to analyze the RNA-seq data.

    For RNA-seq read mapping, a popular mapper is TopHat (Trapnell et al, Bioinformatics 2009): https://ccb.jhu.edu/software/tophat/index.shtml. However, if the data is paired-end sequencing with huge read count, STAR is much faster


1.STAR and TopHat : Use STAR (Dobin et al, Bioinformatics 2012)  and TopHat to map the reads to the mouse reference genome.

Build index:

    STAR --runMode genomeGenerate --runThreadN 2-genomeDir /week_5/index/ --genomeFastaFiles /mnt/Storage/data/BWA/mm9.fa --sjdbGTFfile /mnt/Storage/home/huse/Data/RNA_pipe/mm9_refSeq.gtf --sjdb Overhang 49

Mapping:

    STAR runThreadN 2 --genomeDir RNASEQ/star/index/ --readFilesIn RNASEQ/SRR1592185_1.fastq  SRR1592185_2.fastq --outFileNamePrefix SRR1592185

    tophat -p 2 -G /mnt/Storage/home/huse/Data/RNA_pipe/mm9_refSeq.gtf /mnt/Storage/data/Bowtie/mm9  SRR1592185_1.fastq  SRR1592185_2.fastq > mapping.log

#runThreadN 30

#runMode genomeGenerate

#genomeDir  /home/jmzeng/hoston/mouse/STAR-mouse #put builded indexes into this directory

#genomeFastaFiles  fasta file of

#sjdbGTFfile  #annotation file

#sjdbOverhang 284   #length of reads sequenced - 1

tophat -p 2 -G /mnt/Storage/home/huse/Data/RNA_pipe/mm9_refSeq.gtf /mnt/Storage/data/Bowtie/mm9 GSM1515732_1.fastq GSM1515732_2.fastq > mapping.log


 2. The above test will allow you to learn how to use the read mappers well.After mapping using TopHat , we can get bam files. Use HTSeq to count the reads for each gene from the BAM files. For paired-end sequencing, the reads must be sorted by name. Input files of htseq-count are sam format. So we sort sam files by name and transfrom bam files into sam files. Below are Commands:

samtools sort -n GSM1515732.bam

samtools view -h GSM1515732.bam > GSM1515732.sam

htseq-count -s no -a 10 week_5/GSM1515732.sam RNA_pipe/mm9_refSeq.gtf > GSM1515732_count.txt


3. Use DESeq2 (Love et al, Genome Biol 2014) to look at differential expression. \


source("https://bioconductor.org/biocLite.R")

biocLite("DESeq2")


#count matrix

sampleNames <- c("DOX_1","DOX_2","DOX_3","NODOX_1","NODOX_2","NODOX_3")

mydata <- read.table("counts.txt", header = TRUE, quote = '\t',skip =1)

names(mydata)[7:12] <- sampleNames

countMatrix <- as.matrix(mydata[7:12])

rownames(countMatrix) <-mydata$Geneid

table2 <- data.frame(name = c("DOX_1","DOX_2","DOX_3","NODOX_1","NODOX_2","NODOX_3"),condition = ("DOX","DOX","DOX","NODOX","NODOX","NODOX"))

rownames(table2) <- sampleNames

head(countMatrix)

#transform count matrix into format of DESeq2

dds <- DESeqDataSetFromMatrix(countMatrix, colData=table2, design= ~ condition)

#filter data of count below 0.Because they are useless

dds <- dds[ rowSums(counts(dds)) > 1, ]


#PCA

rld <- rlog(dds)

plotPCA(rld, intgroup=c("name","condition"))


#PCA - ggplot2  note : don't library DESeq before PCA analysis  

library(ggplot2)

rld <- rlog(dds)

data <- plotPCA(rld, intgroup=c("condition", "name"), returnData=TRUE)

percentVar <- round(100 * attr(data, "percentVar"))

p<- ggplot(data, aes(PC1, PC2, color=condition, shape=name)) +

geom_point(size=3) +

xlab(paste0("PC1: ",percentVar[1],"% variance")) +

ylab(paste0("PC2: ",percentVar[2],"% variance"))


#differential expression

library(DESeq)

dds <- DESeq(dds)

res <- results(dds)

write.table(res,"result.csv", sep = ",", row.names = TRUE)

head(res)

#(1)rownames:gene ID (2)baseMean (3)log2FoldChange (4)pvalue (5)padj:adjusted p value

#summary 

summary(res)


#MA plot

library(geneplotter)

plotMA(res, main="DESeq2", ylim=c(-2,2))

plotDispEsts(dds)


#heatmap

sum(res$padj < 0.1, na.rm=TRUE)

library("pheatmap")

select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:1000]

nt <- normTransform(dds) # defaults to log2(x+1)

log2.norm.counts <- assay(nt)[select,]

df <- as.data.frame(colData(dds)[,c("name","condition")])

pdf('heatmap1000.pdf',width = 6, height = 7)

pheatmap(log2.norm.counts, cluster_rows=TRUE, show_rownames=FALSE,

cluster_cols=TRUE, annotation_col=df)

dev.off()

( To be continued ... )

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值