背景介绍:
癌症病人基因组内有许许多多的Indel,这些Indel与发病机制有着密切关系,而通常的做法是对癌症病人进行基因组测序或者外显子测序,找出相关Indel。
而Chip-seq数据可以找出Enhancer,对Chip-seq数据进行处理,可以将Indel与Enhancer进行结合分析
方法介绍:
简单地说,就是follow最近的高分文章《
Small genomic insertions form enhancers that
misregulate oncogenes》PMID:28181482
用大牛的方法,跑自己的数据,最后如果搜集足够多的癌症病人Chip-seq数据,能找出更多新的enhancer区域内的插入缺失(Indel)
大牛的脚本在:
Chip-Seq display
先做Chip-seq display一段
ChIP-Seq display.Reads were aligned to the hg19 revision of the human reference
genome using bowtie with parameters –best –k 2 –m 2 –sam and –l set to read
length. Read pileup in 50 bp bins was determined using MACS with parameters –
w –S –space=50 –shiftsize=200 –nomodel. WIG file output from MACS was
visualized in the UCSC genome browser.
首先,下载数据
LY4_H3K27AC Chromatin immunoprecipitation against H3K27Ac
ascp -QT -l 100M -i ~/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR155/005/SRR1554605/SRR1554605.fastq.gz .
LY4_WCE Whole cell extract input control for ChIP
ascp -QT -l 100M -i ~/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR155/006/SRR1554606/SRR1554606.fastq.gz .
SUDHL6_H3K27AC Chromatin immunoprecipitation against H3K27Ac
ascp -QT -l 100M -i ~/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR155/009/SRR1554599/SRR1554599.fastq.gz .
SUDHL6_WCE Whole cell extract input control for ChIP
ascp -QT -l 100M -i ~/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR155/000/SRR1554600/SRR1554600.fastq.gz .
接着,fastqc质量检测
fastqc -o . -f fastq -t 4 SRR1554605.fastq &
fastqc -o . -f fastq -t 4 SRR1554606.fastq &
fastqc -o . -f fastq -t 4 SRR1554599.fastq &
fastqc -o . -f fastq -t 4 SRR1554600.fastq &
下载bowtie hg19索引
基因组索引从bowtie官网下载
Pre-built indexes H. sapiens, UCSC hg19 2.7 GB
bowtie序列比对
bowtie genome/hg19 -q SRR1554605.fastq -k 2 --best -m 2 --threads 4 -l 40 -S 2> SRR1554605.out > SRR1554605.sam
bowtie genome/hg19 -q SRR1554606.fastq -k 2 --best -m 2 --threads 4 -l 40 -S 2> SRR1554606.out > SRR1554606.sam
bowtie genome/hg19 -q SRR1554599.fastq -k 2 --best -m 2 --threads 4 -l 40 -S 2> SRR1554599.out > SRR1554599.sam
bowtie genome/hg19 -q SRR1554600.fastq -k 2 --best -m 2 --threads 4 -l 40 -S 2> SRR1554600.out > SRR1554600.sam
-m 2 Suppress all alignments for a particular read or pair if more than 2 reportable alignments exist for it.
--threads 4 4个线程
--best hits guaranteed best stratum; ties broken by quality
-k 2 report up to 2 good alignments per read (default: 1)
-l 40 seed length for -n (default: 28) 文献中-l set to read length
-S/--sam write hits in SAM format
macs建双峰模型
macs14 -t SRR1554605.sam -c SRR1554606.sam --format SAM --name "LY4" --wig --single-profile --space=50 --shiftsize=200 --nomodel
macs14 -t SRR1554599.sam -c SRR1554600.sam --format SAM --name "SUDHL6" --wig --single-profile --space=50 --shiftsize=200 --nomodel
-w, --wig Whether or not to save extended fragment pileup at
every WIGEXTEND bps into a wiggle file. When --single-
profile is on, only one file for the whole genome is
saved. WARNING: this process is time/space consuming!!
-S, --single-profile When set, a single wiggle file will be saved for
treatment and input.
--space=SPACE The resoluation for saving wiggle files, by default,
MACS will save the raw tag count every 10 bps. Usable
only with '--wig' option.
--shiftsize=SHIFTSIZE
The arbitrary shift size in bp. When nomodel is true,
MACS will use this value as 1/2 of fragment size.
DEFAULT: 100
--nomodel Whether or not to build the shifting model. If True,
MACS will not build model. by default it means
shifting size = 100, try to set shiftsize to change
it. DEFAULT: False
放到IGV中对ASNS基因进行可视化。
至此,文章中Chip-Seq display一段算是完成了
数据准备
接下来就是这篇文章的精髓了,从chip-seq数据里面寻找indel
首先是
数据准备
下载bowtie hg19索引和
bowtie2 hg19索引
基因组从bowtie和bowtie2官网下载
H. sapiens, UCSC hg19 3.5 GB
or: part 1 (1.5 GB), part 2 (650 MB), part 3 (1.5 GB)
H. sapiens, UCSC hg19 2.7 GB
or: part 1 - 1.7 GB, part 2 - 1.0 GB
colorspace: full, or part 1, part 2
blat程序所需的染色体序列,同样下载UCSC的
进入http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/
下载chromFa.tar.gz 20-Mar-2009 09:21 905M
寻找Indel
主要还是根据大牛的脚本走
顺序是:首先CANCERS/run_everything.sh ,然后是Scaffold_Dev/run_everything.sh
这两步,基本上把文章的精髓给做完了,把所有的INDEL都找出来了。
两者方法差不多,都是基于一个思路:先用bowtie1比对,把没比对上的reads调整参数用bowtie2再比对,然后通过perl脚本以及一系列awk,sed命令进行处理,得到Indel
而Scaffold_Dev在处理时候,将第一步没比对上的reads用Edena进行了一定的拼接,所以最后找出来的Indel会长一些(其实测序reads就40bp,也不会太长)
接下来做dbSNP和Enhancers文件夹里面的内容
这两步会将找出来的Indel与“SNP以及Chip-seq分析出来的enhancer”进行overlap,整合分析
结果
由于涉及有关科研课题,具体结果就没有啦~等着以后看PubMed吧
不过也用文章的数据,将这篇文章的结果重复出来啦,特别是Jurkat细胞系里那个GTTAGGAAACGG 12bp的插入找出来了。
脚本
最后也写了一个脚本,理论上搜集好一堆Chip-seq数据,安装配置好相关软件,就能自动运行分析。有些Indel地方因为reads coverage比较高,IGV可视化的时候也能看出来有很多reads能证明这个Indel是真实可靠的,不过最有力的证据还是要sanger测序进行相关实验,结合相关gene解释生物学问题。