CHIP-seq 全流程

最新推荐文章于 2023-07-17 10:27:39 发布

yuxiang&chenxi

最新推荐文章于 2023-07-17 10:27:39 发布

阅读量1.4k

点赞数 2

文章标签： r语言数据挖掘

本文链接：https://blog.csdn.net/doctor_yuxiang/article/details/127229648

版权

转录组主要研究的问题是基因在不同情况下的差异表达以及RNA结构变化等，而表观组研究的问题是在基因序列不变的情况下，基因的表达、调控和性状发生了可遗传变化的分子机制。也就是相同的DNA, RNA，蛋白质经过一定的修饰后会使得生物性状发生了改变。

能实现的作用：

1.明确每一类组蛋白或者转录因子在整个基因组上结合基因的位置

2.如果比较多个组蛋白在亚基，可以看这些亚基之间在基因组上结合的基因的包含关系，即用韦恩图展示这些组蛋白结合基因相互之间是否包含。

3.检查每一类组蛋白结合基因在TSS上的位置。

4.检查每一组（不同组蛋白之间结合相同的基因）在TSS上的位置。（这样可以看出缺少某一类组蛋白之后，基因是否表达，验证这个组蛋白具有的功能和意义）

5.不同组蛋白结合基因的功能（GO），及参与的代谢通路（KEGG）

6.可以研究每一个组蛋白targets 的基因的表达

步骤：

1.质量控制，用到的是FastQC

2.序列比对，Bowtie2或这BWA

3.peak calling, 建议用MACS

4.peak注释，推荐Y叔的ChIPseeker

一、创建运行环境

conda  create -n epigenetic  python=2 bwa
conda info --envs
source activate epigenetic
# 可以用search先进行检索
conda search trim_galore
## 保证所有的软件都是安装在 epigenetic 这个环境下面
conda install -y sra-tools  
conda install -y trim-galore  samtools
conda install -y deeptools homer  meme
conda install -y macs2 bowtie bowtie2

二、数据下载

cd project/
#将下列数据写入vim文件中
vim SraAccList.txt
SRR1266976
SRR1266977
SRR1266978
SRR1266979
SRR1266980
SRR1266981
source activate epigenetic
mkdir {sra,bedgraph,fastq,rmdup,tss,clean,align,peaks,motif,qc/{raw,trimed},annotation} #在project目录下新建目录
#将数据下载于sra目录下
cat SraAccList.txt | while read id;
do
(nohup prefetch -O ./sra $id &)#保存在sra目录下
done
#control +c可以后台下载

# 索引大小为3.2GB， 不建议自己下载基因组构建
mkdir referece && cd reference
wget -4 -q ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/mm10.zip
unzip mm10.zip

sra文件转fastq文件

fastq-dump --split-3 filename

其中 --split-3 参数代表着如果是单端测序就生成一个、.fastq文件，如果是双端测序就生成*_1.fastq 和*_2.fastq 文件；即单端测序输出1个fastq文件，双端测序输出2个fastq文件。

三、质控

## 需要安装fastqc 和 multiqc
# 先获取QC结果
ls *gz | while read id; do fastqc -t 4 $id; done
# multiqc
multiqc *fastqc.zip --pdf

四、序列比对

bowtie2 -p 6 -3 5 --local -x reference/mm10 -U ChIP-Seq/SRR620204.fastq | ~/miniconda3/bin/samtools sort -O bam -o ../analysis/alignment/ring1B.bam
bowtie2 -p 6 -3 5 --local -x reference/mm10 -U ChIP-Seq/SRR620205.fastq | ~/miniconda3/bin/samtools sort -O bam -o ../analysis/alignment/cbx7.bam
bowtie2 -p 6 -3 5 --local -x reference/mm10 -U ChIP-Seq/SRR620206.fastq | ~/miniconda3/bin/samtools sort -O bam -o ../analysis/alignment/suz12.bam
bowtie2 -p 6 -3 5 --local -x reference/mm10 -U ChIP-Seq/SRR620207.fastq | ~/miniconda3/bin/samtools sort -O bam -o ../analysis/alignment/RYBP.bam
bowtie2 -p 6 -3 5 --local -x reference/mm10 -U ChIP-Seq/SRR620208.fastq | ~/miniconda3/bin/samtools sort -O bam -o ../analysis/alignment/IgGold.bam
bowtie2 -p 6 -3 5 --local -x reference/mm10 -U ChIP-Seq/SRR620209.fastq | ~/miniconda3/bin/samtools sort -O bam -o ../analysis/alignment/IgG.bam

五、用MACS2获取Chip-seq富集区

macs2 callpeak -c IgGold.bam -t suz12.bam -q 0.05 -f BAM -g mm -n suz12 &
macs2 callpeak -c IgGold.bam -t cbx7.bam -q 0.05 -f BAM -g mm -n cbx7 &
macs2 callpeak -c IgGold.bam -t ring1B.bam -q 0.05 -f BAM -g mm -n ring1B &
macs2 callpeak -c IgGold.bam -t RYBP.bam -q 0.05 -f BAM -g mm -n RYBP &

每个比较都会得到四个文件，如下

NAMEpeaks.xls: 以表格形式存放peak信息，虽然后缀是xls，但其实能用文本编辑器打开，和bed格式类似，但是以1为基，而bed文件是以0为基.也就是说xls的坐标都要减一才是bed文件的坐标
NAMEpeaks.narrowPeak NAMEpeaks.broadPeak 类似。后面4列表示为， integer score for display， fold-change，-log10pvalue，-log10qvalue，relative summit position to peak start。内容和NAMEpeaks.xls基本一致，适合用于导入R进行分析。
NAMEsummits.bed：记录每个peak的peak summits，也就是记录极值点的位置。MACS建议用该文件寻找结合位点的motif。
NAME_model.r，能通过NAME_model.r作图，得到是基于你提供数据的peak模型

六、去除组间重复

评估重复样本间peaks一致性的另一种方法是IDR。IDR是通过比较一对经过排序的regions/peaks 的列表，然后计算反映其重复性的值。IDR在ENCODE和modENCODE项目中被广泛使用，也是ChIP-seq指南和标准中的一部分。

IDR的优点：

避免了初始阈值的选择，解决了不同callers的不可比较性
IDR不依赖于阈值的选择，所有regions/peaks都被考虑在内。
它是依赖regions/peaks的排序，不要求对输入信号进行校准或标准化
IDR的详细说明参考:
GitHub - nboley/idr: IDR
https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course/blob/master/sessionV/lessons/06_handling-replicates.md#irreproducibility-discovery-rate-idr

使用IDR的注意事项：

建议使用IDR时，MACS2 call peaks的步骤参数设置不要过于严格，以便鉴定出更多的peaks。
使用IDR需要先对MACS2的结果文件narrowPeak根据-log10(p-value)进行排序。

idr --samples sample_Rep1_sorted_peaks.narrowPeak sample_Rep2_sorted_peaks.narrowPeak \
--input-file-type narrowPeak \
--rank p.value \
--output-file sample-idr \
--plot \
--log-output-file sample.idr.log

输出文件包括：

sample-idr，是common peaks的结果输出文件，格式与输入文件格式类似，只是多了几列信息。前10列是标准的narrowPeak格式文件，包含重复样本整合后的peaks信息

sample-idr.log，log文件会给出peaks通过IDR < 0.05的比率，如下图所示

sample-idr.png