IGV是本地浏览测序数据最为强大的基因组浏览器,是高通量测序分析的一个重要的可视化工具,它能直观地展示突变位点,查看有无新转录本或新的可变剪接形式,查看peak的可信度,上下游基因,区域保守性,重复原件以及蛋白结合区域等。IGV支持多种不同类型的输入格式和不同的现实方式,但IGV需要在java环境下运行,所以在安装IGV之前,需要确保你的计算机已经安装好了java。
下面是生信技能树jimmy师兄关于使用IGV制作人所有基因组的染色体坐标的笔记:
首先从gencode数据库里面可以下载所有的gtf文件
下载代码是:
mkdir -p ~/reference/gtf/gencode
cd ~/reference/gtf/gencode
## https://www.gencodegenes.org/releases/current.html
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.2wayconspseudos.gtf.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.long_noncoding_RNAs.gtf.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.polyAs.gtf.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.annotation.gtf.gz
## https://www.gencodegenes.org/releases/25lift37.html
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh37_mapping/gencode.v25lift37.annotation.gtf.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh37_mapping/gencode.v25lift37.metadata.HGNC.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh37_mapping/gencode.v25lift37.metadata.EntrezGene.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh37_mapping/gencode.v25lift37.metadata.RefSeq.gz
然后写脚本得到基因的染色体还有起始终止坐标
代码是:
zcat gencode.v25.long_noncoding_RNAs.gtf.gz |perl -alne '{next unless $F[2] eq "gene" ;/gene_name \"(.*?)\";/; print "$F[0]\t$F[3]\t$F[4]\t$1" }' >lncRNA.hg38.position
zcat gencode.v25.2wayconspseudos.gtf.gz |perl -alne '{next unless $F[2] eq "transcript" ;/gene_name \"(.*?)\";/; print "$F[0]\t$F[3]\t$F[4]\t$1" }' >pseudos.hg38.position
zcat gencode.v25.annotation.gtf.gz| grep protein_coding |perl -alne '{next unless $F[2] eq "gene" ;/gene_name \"(.*?)\";/; print "$F[0]\t$F[3]\t$F[4]\t$1" }' >protein_coding.hg38.position
zcat gencode.v25.annotation.gtf.gz|perl -alne '{next unless $F[2] eq "gene" ;/gene_name \"(.*?)\";/; print "$F[0]\t$F[3]\t$F[4]\t$1" }' >allGene.hg38.position
zcat gencode.v25lift37.annotation.gtf.gz | grep protein_coding |perl -alne '{next unless $F[2] eq "gene" ;/gene_name \"(.*?)\";/; print "$F[0]\t$F[3]\t$F[4]\t$1" }' >protein_coding.hg19.position
zcat gencode.v25lift37.annotation.gtf.gz | perl -alne '{next unless $F[2] eq "gene" ;/gene_name \"(.*?)\";/; print "$F[0]\t$F[3]\t$F[4]\t$1" }' >allGene.hg19.position
PS:这里面有一个小问题,gencode里面的数据有着HAVANA和ENSEMBL的区别,尤其是在hg38里面,需要区别对待!
查看基因的坐标信息bed文件如
head ~/reference/gtf/gencode/protein_coding.hg19.position
chr1 69091 70008 OR4F5
chr1 367640 368634 OR4F29
chr1 621096 622034 OR4F16
chr1 859308 879961 SAMD11
chr1 879584 894689 NOC2L
chr1 895967 901095 KLHL17
chr1 901877 911245 PLEKHN1
chr1 910584 917473 PERM1
chr1 934342 935552 HES4
chr1 936518 949921 ISG15
对bed格式的基因组区间文件进行基因注释
首先得到bed文件
比如CNV文本文件,bed格式的,如下:
head Features.bed
chr1 3218610 95674710 TCGA-3C-AAAU-10A-01D-A41E-01 53225 0.0055
chr1 95676511 95676518 TCGA-3C-AAAU-10A-01D-A41E-01 2 -1.6636
chr1 95680124 167057183 TCGA-3C-AAAU-10A-01D-A41E-01 24886 0.0053
chr1 167057495 167059336 TCGA-3C-AAAU-10A-01D-A41E-01 3 -1.0999
chr1 167059760 181602002 TCGA-3C-AAAU-10A-01D-A41E-01 9213 -8e-04
chr1 181603120 181609567 TCGA-3C-AAAU-10A-01D-A41E-01 6 -1.2009
chr1 181610685 201473647 TCGA-3C-AAAU-10A-01D-A41E-01 12002 0.0055
chr1 201474400 201474544 TCGA-3C-AAAU-10A-01D-A41E-01 2 -1.4235
chr1 201475220 247813706 TCGA-3C-AAAU-10A-01D-A41E-01 29781 -4e-04
避免重复造轮子,我就用我擅长的bedtools解决这个需求吧,命令很简单,如下:
bedtools intersect -a Features.bed -b ~/reference/gtf/gencode/protein_coding.hg19.position -wa -wb \
| bedtools groupby -i - -g 1-4 -c 10 -o collapse
注释结果,我挑了几个可以看的给大家,可以看到,每个CNV片段都注释到了对应的基因,有些特别大的片段,会被注释到非常多的基因。
chr8 42584924 42783715 TCGA-5T-A9QA-01A-11D-A41E-01 CHRNB3,CHRNA6,THAP1,RNF170,HOOK3
chr8 42789728 42793594 TCGA-5T-A9QA-01A-11D-A41E-01 HOOK3
chr8 42797957 42933372 TCGA-5T-A9QA-01A-11D-A41E-01 RP11-598P20.5,FNTA,HOOK3
chr8 70952673 70964372 TCGA-5T-A9QA-01A-11D-A41E-01 PRDM14
chr10 42947970 43833200 TCGA-5T-A9QA-01A-11D-A41E-01 BMS1,RET,RASGEF1A,ZNF33B,CSGALNACT2
chr10 106384615 106473355 TCGA-5T-A9QA-01A-11D-A41E-01 SORCS3
chr10 106478366 107298256 TCGA-5T-A9QA-01A-11D-A41E-01 SORCS3
chr10 117457285 117457859 TCGA-5T-A9QA-01A-11D-A41E-01 ATRNL1
chr11 68990173 69277078 TCGA-5T-A9QA-01A-11D-A41E-01 MYEOV
chr11 76378708 76926535 TCGA-5T-A9QA-01A-11D-A41E-01 LRRC32,B3GNT6,OMP,TSKU,MYO7A,ACER3,CAPN5
批量IGV截图脚本
这里我想把我的全基因组重测续数据载入到IGV并且根据所有的基因查看并且截图保存。其中IGV脚本很简单,格式如下:
goto chr1:25158588-25161609
snapshot chr1_25158688_25161509_slop100.png
goto chr1:26489713-26490238
snapshot chr1_26489813_26490138_slop100.png
参考:https://software.broadinstitute.org/software/igv/batch
就是进入基因的区域,然后截图保存即可。但是要注意IGV对bam文件的分辨率上线是37kb。但是人类的近2万个编码蛋白的基因里面有接近一半都超过了这个上线,所以这些基因都需要截多张图!
制作这个IGV脚本的代码如下:
cat ~/reference/gtf/gencode/protein_coding.hg19.position| \
perl -alne '{if (($F[2]-$F[1])<30000){print "goto $F[0]:$F[1]-$F[2]\nsnapshot $F[3].png"} \
else{$n=int(($F[2]-$F[1])/30000)+1;foreach (1..$n){$start=$F[1]+($_-1)*30000;$end=$start+30000;print "goto $F[0]:$start-$end\nsnapshot $F[3]_$_.png" }}}' \
>igv_all_gene_snapshot.txt
goto chr1:7745384-7775384
snapshot CAMTA1_31.png
goto chr1:7775384-7805384
snapshot CAMTA1_32.png
goto chr1:7805384-7835384
snapshot CAMTA1_33.png
goto chr1:7831329-7841492
snapshot VAMP3.png
goto chr1:7844380-7874380
snapshot PER3_1.png
goto chr1:7874380-7904380
snapshot PER3_2.png
goto chr1:7904380-7934380
snapshot PER3_3.png
goto chr1:7903143-7913572
snapshot UTS2.png
goto chr1:7975954-8003225
snapshot TNFRSF9.png
goto chr1:8014351-8044351
snapshot PARK7_1.png
goto chr1:8044351-8074351
snapshot PARK7_2.png
goto chr1:8064464-8086368
snapshot ERRFI1.png
可以看到CAMTA1这个基因长约1M,所以会被切割成33个片段才出图。
生信技能树论坛