这里主要是根据课题需要使用其中的部分功能
1.annovar下载安装
https://doc-openbio.readthedocs.io/projects/annovar/en/latest/
annovar官方网站的的下载需要使用邮箱注册后才可下载,这里为了使用方便我将刚下载的annovar上传到了网盘链接:https://pan.baidu.com/s/1T0PhTP1TMADsoFApTUPUOw
提取码:uw7f
2.安装
annovar解压后可以直接使用,也可以将其加入环境变量
tar -zxvf annovar.latest.tar.gz
cd annovar && ls
# annotate_variation.pl coding_change.pl convert2annovar.pl example humandb retrieve_seq_from_fasta.pl table_annovar.pl variants_reduction.pl
#加入到全局环境变量中
vi /etc/profile
#在最后一行加入export PATH="$PATH:/home/huen/packages/annovar"
source /etc/profile
3.annovar 注释
1:annovar的注释可以分为三种,基于基因的注释,基于区域的注释,基于过2:滤的注释,当然也可以一次性完成三种注释;
3:annovar 注释之前需要对文件格式进行转换,当然table_annovar.pl可以使用vcf文件作为输入文件,只需要使用-vcfinput;
4:需要注意的是,在使用convert2annovar.pl转换格式时,会有一个QC过滤的过程,所以建议使用该程序进行过滤和格式转换
下面进行格式 转换
1.看一下数据
root@huen-PC:/media/huen/Entertainment/mutect_75# ls *.vcf
1159-1.filtered.vcf 1206-1.filtered.vcf 1229-1.filtered.vcf 1283-1.filtered.vcf 1354-1.filtered.vcf 1395-1.filtered.vcf 1535-1.filtered.vcf 1787-1.filtered.vcf 1881-1.filtered.vcf
1160-1.filtered.vcf 1211-1.filtered.vcf 1230-1.filtered.vcf 1284-1.filtered.vcf 1368-1.filtered.vcf 1397-1.filtered.vcf 1564-1.filtered.vcf 1810-1.filtered.vcf
1164-1.filtered.vcf 1215-1.filtered.vcf 1232-1.filtered.vcf 1291-1.filtered.vcf 1379-1.filtered.vcf 1398-1.filtered.vcf 1569-1.filtered.vcf 1819-1.filtered.vcf
1175-1.filtered.vcf 1217-1.filtered.vcf 1241-1.filtered.vcf 1299-1.filtered.vcf 1380-1.filtered.vcf 1405-1.filtered.vcf 1607-1.filtered.vcf 1821-1.filtered.vcf
1185-1.filtered.vcf 1218-1.filtered.vcf 1250-1.filtered.vcf 1315-1.filtered.vcf 1384-1.filtered.vcf 1411-1.filtered.vcf 1684-1.filtered.vcf 1847-1.filtered.vcf
1186-1.filtered.vcf 1219-1.filtered.vcf 1252-1.filtered.vcf 1321-1.filtered.vcf 1387-1.filtered.vcf 1438-1.filtered.vcf 1713-1.filtered.vcf 1848-1.filtered.vcf
1188-1.filtered.vcf 1225-1.filtered.vcf 1263-1.filtered.vcf 1325-1.filtered.vcf 1388-1.filtered.vcf 1456-1.filtered.vcf 1759-1.filtered.vcf 1853-1.filtered.vcf
1191-1.filtered.vcf 1226-1.filtered.vcf 1273-1.filtered.vcf 1326-1.filtered.vcf 1389-1.filtered.vcf 1471-1.filtered.vcf 1765-1.filtered.vcf 1876-1.filtered.vcf
1205-1.filtered.vcf 1227-1.filtered.vcf 1278-1.filtered.vcf 1351-1.filtered.vcf 1393-1.filtered.vcf 1511-1.filtered.vcf 1773-1.filtered.vcf 1879-1.filtered.vcf
2.gatk mutect2的filter会对call到的数据进行过滤,通过过滤会被打上PASS标签,所以接下来首先提取出其中的PASS mutation
mkdir PASS
for i in $(ls *vcf)
do
grep PASS $i >"./PASS/"${i%%\.*}"PASS.vcf"
done
得到以下文件
1159-1PASS.vcf 1191-1PASS.vcf 1219-1PASS.vcf 1241-1PASS.vcf 1284-1PASS.vcf 1351-1PASS.vcf 1388-1PASS.vcf 1411-1PASS.vcf 1569-1PASS.vcf 1787-1PASS.vcf 1876-1PASS.vcf
1160-1PASS.vcf 1205-1PASS.vcf 1225-1PASS.vcf 1250-1PASS.vcf 1291-1PASS.vcf 1354-1PASS.vcf 1389-1PASS.vcf 1438-1PASS.vcf 1607-1PASS.vcf 1810-1PASS.vcf 1879-1PASS.vcf
1164-1PASS.vcf 1206-1PASS.vcf 1226-1PASS.vcf 1252-1PASS.vcf 1299-1PASS.vcf 1368-1PASS.vcf 1393-1PASS.vcf 1456-1PASS.vcf 1684-1PASS.vcf 1819-1PASS.vcf 1881-1PASS.vcf
1175-1PASS.vcf 1211-1PASS.vcf 1227-1PASS.vcf 1263-1PASS.vcf 1315-1PASS.vcf 1379-1PASS.vcf 1395-1PASS.vcf 1471-1PASS.vcf 1713-1PASS.vcf 1821-1PASS.vcf
1185-1PASS.vcf 1215-1PASS.vcf 1229-1PASS.vcf 1273-1PASS.vcf 1321-1PASS.vcf 1380-1PASS.vcf 1397-1PASS.vcf 1511-1PASS.vcf 1759-1PASS.vcf 1847-1PASS.vcf
1186-1PASS.vcf 1217-1PASS.vcf 1230-1PASS.vcf 1278-1PASS.vcf 1325-1PASS.vcf 1384-1PASS.vcf 1398-1PASS.vcf 1535-1PASS.vcf 1765-1PASS.vcf 1848-1PASS.vcf
1188-1PASS.vcf 1218-1PASS.vcf 1232-1PASS.vcf 1283-1PASS.vcf 1326-1PASS.vcf 1387-1PASS.vcf 1405-1PASS.vcf 1564-1PASS.vcf 1773-1PASS.vcf 1853-1PASS.vcf
3.格式 转换bash anno_vcf4_to_aviinput.sh
root@huen-PC:/media/huen/Entertainment/mutect_75/PASS# cat anno_vcf4_to_aviinput.sh
all_number=73
count=0
mkdir avinput_files
for i in $(ls *vcf)
do
count=$((count+1))
remain=$((73-count))
outname="./avinput_files/"${i%%\.*}
convert2annovar.pl -format vcf4 $i > ${outname}.avinput 2>convert.log
echo -en $count $remain"\r"
done
4.进行table_annoval的注释,这里使用的是软件自带的refGene,同时使用shell的有名管道进行多线程的操作
这是格式转换后的文件
1159-1PASS.avinput 1205-1PASS.avinput 1226-1PASS.avinput 1263-1PASS.avinput 1321-1PASS.avinput 1384-1PASS.avinput 1405-1PASS.avinput 1569-1PASS.avinput 1810-1PASS.avinput 1881-1PASS.avinput
1160-1PASS.avinput 1206-1PASS.avinput 1227-1PASS.avinput 1273-1PASS.avinput 1325-1PASS.avinput 1387-1PASS.avinput 1411-1PASS.avinput 1607-1PASS.avinput 1819-1PASS.avinput table_annoval.sh
1164-1PASS.avinput 1211-1PASS.avinput 1229-1PASS.avinput 1278-1PASS.avinput 1326-1PASS.avinput 1388-1PASS.avinput 1438-1PASS.avinput 1684-1PASS.avinput 1821-1PASS.avinput table_annovar_refGene
1175-1PASS.avinput 1215-1PASS.avinput 1230-1PASS.avinput 1283-1PASS.avinput 1351-1PASS.avinput 1389-1PASS.avinput 1456-1PASS.avinput 1713-1PASS.avinput 1847-1PASS.avinput
1185-1PASS.avinput 1217-1PASS.avinput 1232-1PASS.avinput 1284-1PASS.avinput 1354-1PASS.avinput 1393-1PASS.avinput 1471-1PASS.avinput 1759-1PASS.avinput 1848-1PASS.avinput
1186-1PASS.avinput 1218-1PASS.avinput 1241-1PASS.avinput 1291-1PASS.avinput 1368-1PASS.avinput 1395-1PASS.avinput 1511-1PASS.avinput 1765-1PASS.avinput 1853-1PASS.avinput
1188-1PASS.avinput 1219-1PASS.avinput 1250-1PASS.avinput 1299-1PASS.avinput 1379-1PASS.avinput 1397-1PASS.avinput 1535-1PASS.avinput 1773-1PASS.avinput 1876-1PASS.avinput
1191-1PASS.avinput 1225-1PASS.avinput 1252-1PASS.avinput 1315-1PASS.avinput 1380-1PASS.avinput 1398-1PASS.avinput 1564-1PASS.avinput 1787-1PASS.avinput 1879-1PASS.avinput
注释
mkdir table_annovar_refGene
#使用10个线程去注释
bash table_annoval.sh 10
cat table_annoval.sh
thread=$1
mkfifo "test"
exec 6<>"test"
rm -f "test"
for ((i=1;i<=${thread};i++))
do
echo >&6
done
for i in *avinput
do
read -u6
{
table_annovar.pl $i /home/huen/packages/annovar/humandb/ -buildver hg19 -out "./table_annovar_refGene/"${i%%\.*} -remove -protocol refGene -operation g -nastring .
echo " ">&6
}&
done
wait
exec 6>&-
5.将格式转换为maftools需要的格式
bash add_samplename.sh
cat add_samplename.sh
for i in *multianno.txt
do
sed '1d' $i | sed "s/$/\t${i%%\.*}/" >> all_annovar
done
wait
grep -P "\texonic\t" all_annovar >all_annovar_exonic
wait
awk -F "\t" '{if(NR==1) {print "Chr\tStart\tEnd\tRef\tAlt\tFunc.refGene\tGene.refGene\tGeneDetail.refGene\tExonicFunc.refGene\tAAChange.refGene\tTumor_Sample_Barcode"}} {for(i=1;i<=10;i++){printf $i"\t"}} {print $11}' all_annovar_exonic >annovar_filter
最终获得annovar_filter文件
#展示部分结果
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene Tumor_Sample_Barcode
chr1 1426048 1426048 G A exonic ATAD3B . stopgain ATAD3B:NM_001317238:exon13:c.G1473A:p.W491X,ATAD3B:NM_031921:exon15:c.G1611A:p.W537X 1159-1PASS
chr1 2235904 2235904 G T exonic SKI . synonymous SNV SKI:NM_003036:exon5:c.G1647T:p.L549L 1159-1PASS
chr1 19545927 19545927 T G exonic EMC1 . nonsynonymous SNV EMC1:NM_001271429:exon22:c.A2786C:p.K929T,EMC1:NM_001271427:exon23:c.A2849C:p.K950T,EMC1:NM_00127142
8:exon23:c.A2849C:p.K950T,EMC1:NM_015047:exon23:c.A2852C:p.K951T 1159-1PASS
chr1 27876433 27876433 C G exonic AHDC1 . nonsynonymous SNV AHDC1:NM_001029882:exon6:c.G2194C:p.D732H,AHDC1:NM_001371928:exon8:c.G2194C:p.D732H 1159-1PASS
chr1 46073542 46073542 C T exonic NASP . nonsynonymous SNV NASP:NM_001195193:exon4:c.C767T:p.T256I,NASP:NM_002482:exon6:c.C959T:p.T320I 1159-1PASS
chr1 103379236 103379237 CC AA exonic COL11A1 . nonframeshift substitution COL11A1:NM_080630:exon51:c.3640_3641delinsTT:p.G1214F,COL11A1:NM_001190709:exon52:c.3871_387
2delinsTT:p.G1291F,COL11A1:NM_001854:exon53:c.3988_3989delinsTT:p.G1330F,COL11A1:NM_080629:exon53:c.4024_4025delinsTT:p.G1342F 1159-1PASS
接下来就可以用这个结果去R里面进行统计分析啦,我准备使用maftools