Chip-seq流程文献学习笔记-CSDN博客

本文链接：https://blog.csdn.net/qq_29300341/article/details/52951239

文献大致内容：寻找“食管癌致癌基因 ”的“超级增强子 ”区域

文献全文可在如下链接阅读

http://gut.bmj.com/content/early/2016/05/10/gutjnl-2016-311818.full#DC1

1、阅读文章Chip-seq相关的部分

Chromatin(核染色质) immunoprecipitation(免疫沉淀反应)sequencing(按顺序排好) data analysis

Chromatin immunoprecipitation sequencing (ChIP-Seq) reads were aligned(结盟) to humanreference(参考) genome(基因组) (build GRCh37/hg19) using Bowtie Aligner. ChIP-Seqpeaks(山峰) were identified(确定) using MACS (Model-Based Analysis of ChIP-seq) by considering reads mapped only once at a given locus(轨迹). Wiggle(摆动) files weregenerated(形成) using read pileups(连环相撞) for every 50 base pair bins. These wiggle files were normalised(正常化) in terms of reads per million (rpm) by dividing tag counts in each bin by the total number of reads (in millions, duplicates(副本) removed). Wiggle files were converted(转变)into bigwig(权贵之人) files using wigToBigWig tool (http://hgdownload.cse.ucsc.edu/admin/exe/) and visualised(可见) in Integrative Genomics Viewer (http://www.broadinstitute.org/igv/home). SEs were identified using ROSE (https://bitbucket.org/youngcomputation/rose). Closely spaced peaks (except those within 2 kb of TSS) within a range of 12.5 kb were merged(合并) together, followed by the measurement(测量) of input(投入) and H3K27Ac signals. These merged peaks were ranked by H3K27Ac signal and then classified(分类) into SEs or TEs. Both SEs and TEs were assigned(分配) to the nearest Ensemble genes. The ChIP sequencing files have beendeposited(沉积) into Gene Expression Omnibus (GSE76861).

Gene set enrichment(丰富) analysis

Gene set enrichment analysis (GSEA) was performed using GSEA standalone(独立的电脑)desktop(桌面) programme. An expression matrix(矩阵) was created containing expression values at zero and 6 h (upon 50 nM THZ1 treatment). All SE-associated genes were used as a ‘gene set database’. GSEA was run with parameter(参数) ‘Metric for ranking genes’ set to ‘log2_Ratio_of_classes’ to calculate(计算) enrichment score for SE-associated genes.

2、根据文献内容，下载基因表达数据

数据源编号GSE76861，在GEOdataset数据库中，搜索下载即可，

GSM2039110	TE7_H3K27Ac
GSM2039111	TE7_Input
GSM2039112	KYSE510_H3K27Ac

GSM2039113

KYSE510_Input

以上四个数据为Chip-seq原始数据，使用aspera下载代码（nohup+命令+& 可以后台运行）

nohup+命令+&：将命令放置到后台运行，并且断开连接依旧运行，QT参数可以断点续存并且加到最大速度

nohup ascp -QT -l 100M -i ~/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR310/004/SRR3101254/SRR3101254.fastq.gz . &

nohup ascp -QT -l 100M -i ~/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR310/001/SRR3101251/SRR3101251.fastq.gz . &

nohup ascp -QT -l 100M -i ~/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR310/002/SRR3101252/SRR3101252.fastq.gz . &

nohup ascp -QT -l 100M -i ~/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR310/003/SRR3101253/SRR3101253.fastq.gz . &

并且解压缩

gunzip SRR3101251.fastq.gz

gunzip SRR3101252.fastq.gz

gunzip SRR3101253.fastq.gz

gunzip SRR3101254.fastq.gz

文献处理好的数据：

GSE76861_RAW.tar

431.0 Mb

(http)(custom)

TAR (of BW, TXT)

相关链接 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76861

解压缩命令：tar -xvf GSE76861_RAW.tar

3、下载人类参考基因组

文献说reads were aligned to humanreference(参考) genome(基因组) (build GRCh37/hg19) using Bowtie Aligner

bowtie官网上面有hg19建好的序列

解压缩命令：unzip hg19.ebwt.zip

4、fastqc质量检测

fastqc命令：

fastqc -o . -t 5 -f fastq SRR3101251.fastq &

-t 5：表示开5个线程运行

最后的 &：表示将命令放置到后台执行

（要分别对四个fastq文件执行四次）

4、使用bowtie比对

bowtie命令：

bowtie genome/hg19 -q reads/SRR3101251.fastq -m 1 -p 4 -S 2> SRR3101251.out > SRR3101251.sam

-q表示输入fastq文件

-m 1 只保留比对上一次的序列

-5 1 由于fastqc结果显示5‘端(左端)质量不是很好，可以选择切掉1个碱基，当然也可以不切

-p 4 设置多线程个数

-S 表示输出为sam格式的文件

2> SRR3101251.out将屏幕上的结果输出到SRR3101251.out文件

> SRR3101251.sam将标准输出，定向到SRR3101251.sam文件

5、使用MACS建模获得peaks富集区

nohup macs14 -t SRR3101251.sam -c SRR3101252.sam --format SAM --name "TE7 " --keep-dup 1 --wig --single-profile --space=50 --diag &

nohup macs14 -t SRR3101253.sam -c SRR3101254.sam --format SAM --name "KYSE510" --keep-dup 1 --wig --single-profile --space=50 --diag &

-t 表示Chip-seq处理过的文件

-c control对照组文件

--format SAM 表示输入为sam文件格式

--name "macs14" 输出文件附加的前缀

--keep-dup 1 指明Macs对于在染色体同一位置的reads(重复序列)处理方式。使用说明里写默认值1效果最好

--wig和--space=50 输出文献要求的wiggle file

Wiggle files were generated(形成) using read pileups(连环相撞) for every 50 base pair bins.

以下为相关参数使用手册详解

--keep-dup=KEEPDUPLICATES

It controls the MACS behavior towards duplicate tags

at the exact same location -- the same coordination

and the same strand. The 'auto' option makes MACS

calculate the maximum tags at the exact same location

based on binomal distribution using 1e-5 as pvalue

cutoff; and the 'all' option keeps every tags. If an

integer is given, at most this number of tags will be

kept at the same location. Default: 1. To only keep

one performs the best in terms of detecting enriched

regions, from our internal study.

--bw=BW Band width. This value is only used while building the

shifting model. DEFAULT: 300

-g GSIZE, --gsize=GSIZE 此参数默认为人类，因此无需填写

Effective genome size. It can be 1.0e+9 or 1000000000,

or shortcuts:'hs' for human (2.7e9), 'mm' for mouse

(1.87e9), 'ce' for C. elegans (9e7) and 'dm' for

fruitfly (1.2e8), Default:hs

-w, --wig Whether or not to save extended fragment pileup at

every WIGEXTEND bps into a wiggle file. When --single-

profile is on, only one file for the whole genome is

saved. WARNING: this process is time/space consuming!!

-B, --bdg Whether or not to save extended fragment pileup at

every bp into a bedGraph file. When it's on, -w,

--space and --call-subpeaks will be ignored. When

--single-profile is on, only one file for the whole

genome is saved. WARNING: this process is time/space

consuming!!

-S, --single-profile When set, a single wiggle file will be saved for

treatment and input. Default: False

--space=SPACE The resoluation for saving wiggle files, by default,

MACS will save the raw tag count every 10 bps. Usable

only with '--wig' option.

MACS输出文件，需要注意蓝色部分

Output files

NAME_peaks.xls is a tabular file which contains information about called peaks. You can open it in excel and sort/filter using excel functions. Information include: chromosome name, start position of peak, end position of peak, length of peak region, peak summit position related to the start position of peak region, number of tags in peak region, -10*log10(pvalue) for the peak region (e.g. pvalue =1e-10, then this value should be 100), fold enrichment for this region against random Poisson distribution with local lambda, FDR in percentage. Coordinates in XLS is 1-based which is different with BED format.
NAME_peaks.bed is BED format file which contains the peak locations. You can load it to UCSC genome browser or Affymetrix IGB software.
NAME_summits.bed is in BED format, which contains the peak summits locations for every peaks. The 5th column in this file is the summit height of fragment pileup. If you want to find the motifs at the binding sites, this file is recommended.
NAME_negative_peaks.xls is a tabular file which contains information about negative peaks. Negative peaks are called by swapping the ChIP-seq and control channel.
NAME_model.r is an R script which you can use to produce a PDF image about the model based on your data. Load it to R by the following command. Then a pdf file NAME_model.pdf will be generated in your current directory. Note, R is required to draw this figure:

$ R —vanilla < NAME_model.r

NAME_treat/control_afterfiting.wig.gz files in NAME_MACS_wiggle directory are wiggle format files which can be imported to UCSC genome browser/GMOD/Affy IGB. The .bdg.gz files are in bedGraph format which can also be imported to UCSC genome browser or be converted into even smaller bigWig files.
NAME_diag.xls is the diagnosis report. First column is for various fold_enrichment ranges; the second column is number of peaks for that fc range; after 3rd columns are the percentage of peaks covered after sampling 90%, 80%, 70% ... and 20% of the total tags.
NAME_peaks.subpeaks.bed is a text file which IS NOT in BED format. This file is generated by PeakSplitter (<http://www.ebi.ac.uk/bertone/software/PeakSplitter_Cpp_usage.txt>) when —call-subpeaks option is set

6、编写程序对wig文件进行normalised

python程序如下

对TE7_H3K27Ac和KYSE510_H3K27Ac的wig文件(即MACS后生成的treat文件夹里的wig文件)计算RPM

RPM公式：(某位置的reads数目÷所有染色体上总reads数目)×1000000

7、使用wigToBigWig转化格式

下载fetchChromSizes程序，

wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64.v287/fetchChromSizes

chmod 777 fetchChromSizes

获取hg19基因组对应的染色体大小信息，为 wigToBigWig程序做准备

fetchChromSizes hg19 >hg19.chrom.sizes

chrM对应的值改为16750

下载wigToBigWig程序

wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64.v287/bedGraphToBigWig

进行Wiggle文件到bigwig文件的转换

wigToBigWig KYSE510_control_afterfiting_all.wig hg19.chrom.sizes KYSE510_control_afterfiting_all.bw

8、安装IGV(Integrative Genomics Viewer)对结果可视化

从IGV官网下载windows版本 http://software.broadinstitute.org/software/igv/download 根据提示安装

直接点击打开igv.jar或者对bat文件以管理员身份运行

首先，载入hg19基因组；接着载入两个normalised后的bw文件即可

9、使用deeptools进行可视化

安装：Requirements: Python 2.7, numpy, scipy installed

Commands:

$ cd ~

$ export PYTHONPATH=$PYTHONPATH:~/lib/python2.7/site-packages

$ export PATH=$PATH:~/bin:~/.local/bin

这里ubuntu可以用apt安装pip

If pip is not already available, install it with:

$ easy_install --prefix=~ pip

这里会出很多问题，把报错信息粘贴谷歌一下即可

Install deepTools and dependencies with pip:

$ pip install --user deeptools

10、安装ROSE鉴定super Enhancer

ROSE程序可以到http://younglab.wi.mit.edu/super_enhancer_code.html下载，并且有2.7G的示例数据

数据预处理

（1）安装samtools，将sam文件转化为bam文件，

需要将 SRR3101251.sam, SRR3101253.sam, SRR3101253.sam, SRR3101254.sam都进行此步骤

sam转成bam文件+排序

samtools view -bS SRR3101251.sam | samtools sort - SRR3101251_sorted

为bam文件建立索引

samtools index SRR3101251_sorted.bam SRR3101251_sorted.bai

（2）准备指明峰位置的gff文件（PS：此处的gff文件不是基因注释文件）

NAME_peaks.bed和NAME_summits.bed 为MACS结果中的存储峰位置信息的文件，而 NAME_summits.bed仅为峰顶的位置信息，故选择 NAME_peaks.bed提取所需信息

awk '{print $1"\t"$4"\t"".""\t"$2"\t"$3"\t"".""\t"".""\t"".""\t"$4}' KYSE510_peaks.bed>KYSE510_peaks.gff

awk '{print $1"\t"$4"\t"".""\t"$2"\t"$3"\t"".""\t"".""\t"".""\t"$4}' TE7_peaks.bed>TE7_peaks.gff

也可以直接指定MACS14的结果中TE7_peaks.bed和KYSE510_peaks.bed为gff文件，ROSE程序会自动进行转换。

PS：ROSE使用手册中关于gff文件的说明

.gff file of constituent enhancers previously identified (gff format ref: https://genome.ucsc.edu/FAQ/FAQformat.html#format3).

.gff must have the following columns:

1: chromosome (chr#)

2: unique ID for each constituent enhancer region

4: start of constituent

5: end of constituent

7: strand (+,-,.)

9: unique ID for each constituent enhancer region

NOTE: if value for column 2 and 9 differ, value in column 2 will be used

运行ROSE程序

文献SEs were identified using ROSE ( https://bitbucket.org/youngcomputation/rose). Closely spaced peaks (except those within 2 kb of TSS) within a range of 12.5 kb were merged (合并) together, followed by the measurement (测量) of input (投入) and H3K27Ac signals. These merged peaks were ranked by H3K27Ac signal and then classified (分类) into SEs or TEs. Both SEs and TEs were assigned (分配) to the nearest Ensemble genes.

nohup python ROSE_main.py -g HG19 -i TE7_peaks.gff -r SRR3101251_sorted.bam -c SRR3101252_sorted.bam -o 5-ROSE-result/TE7/ -s 12500 -t 2000 2>5-ROSE-result/TE7/log.txt &

nohup python ROSE_main.py -g HG19 -i KYSE510_peaks.gff -r SRR3101253_sorted.bam -c SRR3101254_sorted.bam -o 5-ROSE-result/KYSE510/ -s 12500 -t 2000 2>5-ROSE-result/KYSE510/log.txt &

-g HG19表示基因组版本，选定HG19即可

-i 选定gff文件

-r 实验组的bam文件

-c control组的bam文件

-o 输出目录

-s 12500 相邻12500bp内的峰合并

-t 2000 除去2000bp内的TSS(转录开始位置)，是考虑到了起始子promoter

使用手册：

From within root directory:

python ROSE_main.py -g GENOME_BUILD -i INPUT_CONSTITUENT_GFF -r RANKING_BAM -o OUTPUT_DIRECTORY [optional: -s STITCHING_DISTANCE -t TSS_EXCLUSION_ZONE_SIZE -c CONTROL_BAM]

Required parameters:

GENOME_BUILD: one of hg18, hg19, mm8, mm9, or mm10 referring to the UCSC genome build used for read mapping

INPUT_CONSTITUENT_GFF: .gff file (described above) of regions that were previously calculated to be enhancers. I.e. Med1-enriched regions identified using MACS.

RANKING_BAM: .bam file to be used for ranking enhancers by density of this factor. I.e. Med1 ChIP-Seq reads.

OUTPUT_DIRECTORY: directory to be used for storing output.

Optional parameters:

STITCHING_DISTANCE: maximum distance between two regions that will be stitched together (Default: 12.5kb)

TSS_EXCLUSION_ZONE_SIZE: exclude regions contained within +/- this distance from TSS in order to account for promoter biases (Default: 0; recommended if used: 2500). If this value is 0, will not look for a gene file.

CONTROL_BAM: .bam file to be used as a control. Subtracted from the density of the RANKING_BAM. I.e. Whole cell extract reads.

进行基因注释

ROSE带有Enhancer的注释程序；

-i 输入 ROSE_main.py运行出来的保存Enhancer和Super Enhancer的文件，为 AllEnhancers.table.txt

-g 输入基因组的名称

-o 输入输出目录

python ROSE_geneMapper.py -i KYSE510_peaks_AllEnhancers.table.txt -g HG19 -o 5-ROSE-result/annotation_KYSE510/

ROSE结果解读

（1）TE7_peaks_AllEnhancers.table.txt和 KYSE510_peaks_AllEnhancers.table.txt

（2）TE7_peaks_Plot_points.png 和KYSE510_peaks_Plot_points.png

横坐标为Enhancer排名。是 AllEnhancers.table.txt里的 enhancerRank ， enhancerRank 越靠前越是super Enhancer

纵坐标为信号值，越高越可能是super Enhancer，

是直接根据 AllEnhancers里 SRR3101251_sorted.bam 列- SRR3101252_sorted.bam列计算出

下载安装GSEA

GSEA有开发出桌面版应用，需要在Java环境下运行，注册后便可以下载

与文献的结果进行比较

1、normalised后，wig文件的统计比较

大致上看，我做出来的结果比文献里总体要小7-8倍

2、IGV可视化比较

大致的峰和文献结果是一样的

PS：问题与解释

1、

问题：如下图，bowtie官网上，点进序列下载那一栏的iGenomes可以看见，分别看见 GRCh37和hg19，这两个有什么不同吗？

回答：两者基因组内容是一样的。

I believe the genomic content for the two is identical, except for the mitochondrial contig.详见 https://www.biostars.org/p/123767/

、

2、IGV加载基因组出错

Genomes->Load genome From Server出错：Warning: could not connect to the genome server

原因：可能是防火墙没开。

解决方法：报错中写道无法连接至： http://igv.broadinstitute.org/genomes/genomes.txt

打开即可看见需要的hg19下载链接为：

Human hg19 http://s3.amazonaws.com/igv.broadinstitute.org/genomes/hg19.genome hg19

下载到本地后，再加载即可

3、deeptools安装出现问题

常见错误

A：

报错：fatal error: Python.h: No such file or directory compilation terminated

解答：Looks like you haven't properly installed the header files and static libraries for python dev. Use your package manager to install them system-wide.For apt (ubuntu, debian...):

sudo apt-get install python-dev  # for python2.x installs
sudo apt-get install python3-dev  # for python3.x installs

For yum (centos, redhat, fedora...):

 
  sudo yum install python-devel 
 

B：

报错`curl-config' not found -- please install the libcurl development files

ubuntu下面apt-get install libcurl4-openssl-dev

总之，出现什么错误，缺失什么东西，谷歌以后再安装即可

Chip-seq流程 文献学习笔记

Chromatin(核染色质) immunoprecipitation(免疫沉淀反应)sequencing(按顺序排好) data analysis

Gene set enrichment(丰富) analysis

Output files

Chip-seq流程文献学习笔记