AGAT
AGAT是Another Gff Analysis Toolkit的缩写, 是一个用于处理GTF/GFF文件的工具。AGAT 有检查、修复、填充任何类型的 GTF 和 GFF 的缺失信息(特征/属性),以创建完整、排序和标准化的 gff3 格式。
01 安装
conda安装AGAT
conda install -c bioconda agat
包含的部分脚本如下:
其中脚本名称中的_sp_与_sq_具有特殊含义。
_sp_代表SLURP, 即脚本会一次性全部读入gff文件至内存。这样虽然消耗的内存更多,但处理效率更高,也能够修复整个文本中的一些错误。
_sq_代表SEQUENTIAL, 即脚本会从头至尾逐行读入gff文件,而不经过完整性检验。这样的脚本具有相对更高的内存效率。
02 使用
以agat_convert_sp_gxf2gxf.pl为例,它能够检测并修复gtf/gff文件中的错误与缺失信息,得到排序后符合标准的gff3文件。
example 8 - 只含有CDS信息
##gff-version 3
Tob1_contig1 Prodigal:2.60 CDS 476 670 . - 0 ID=Tob1_00001;locus_tag=Tob1_00001;product=hypothetical protein
Tob1_contig1 Prodigal:2.60 CDS 34266 35222 . + 0 ID=Tob1_00024;locus_tag=Tob1_00024;product=hypothetical protein
Tob1_contig1 SignalP:4.1 sig_peptide 34266 34298 . + 0 inference=ab initio prediction:SignalP:4.1;note=predicted cleavage at residue 33;product=putative signal peptide
Tob1_contig1 Prodigal:2.60 CDS 35267 37444 . - 0 ID=Tob1_00025;locus_tag=Tob1_00025;
Tob1_contig1 SignalP:4.1 sig_peptide 37420 37444 . - 0 inference=ab initio prediction:SignalP:4.1;note=predicted cleavage at residue 25;product=putative signal peptide
Tob1_contig1 Prodigal:2.60 CDS 38304 39338 . - 0 ID=Tob1_00026;locus_tag=Tob1_00026;
补足第一级(gene)和第二级(mRNA)信息
agat_convert_sp_gxf2gxf.pl --gff 8_test.gff
##gff-version 3
Tob1_contig1 Prodigal:2.60 gene 476 670 . - 0 ID=nbis_NEW-gene-1;locus_tag=Tob1_00001;product=hypothetical protein
Tob1_contig1 Prodigal:2.60 mRNA 476 670 . - 0 ID=nbis_nol2id-cds-1;Parent=nbis_NEW-gene-1;locus_tag=Tob1_00001;product=hypothetical protein
Tob1_contig1 Prodigal:2.60 exon 476 670 . - . ID=nbis_NEW-exon-1;Parent=nbis_nol2id-cds-1;locus_tag=Tob1_00001;product=hypothetical protein
Tob1_contig1 Prodigal:2.60 CDS 476 670 . - 0 ID=Tob1_00001;Parent=nbis_nol2id-cds-1;locus_tag=Tob1_00001;product=hypothetical protein
Tob1_contig1 Prodigal:2.60 gene 34266 35222 . + 0 ID=nbis_NEW-gene-2;locus_tag=Tob1_00024;product=hypothetical protein
Tob1_contig1 Prodigal:2.60 mRNA 34266 35222 . + 0 ID=nbis_nol2id-cds-2;Parent=nbis_NEW-gene-2;locus_tag=Tob1_00024;product=hypothetical protein
Tob1_contig1 Prodigal:2.60 exon 34266 35222 . + . ID=nbis_NEW-exon-2;Parent=nbis_nol2id-cds-2;locus_tag=Tob1_00024;product=hypothetical protein
Tob1_contig1 Prodigal:2.60 CDS 34266 35222 . + 0 ID=Tob1_00024;Parent=nbis_nol2id-cds-2;locus_tag=Tob1_00024;product=hypothetical protein
Tob1_contig1 SignalP:4.1 sig_peptide 34266 34298 . + 0 ID=sig_peptide-1;Parent=nbis_nol2id-cds-2;inference=ab initio prediction:SignalP:4.1;note=predicted cleavage at residue 33;product=putative signal peptide
Tob1_contig1 Prodigal:2.60 gene 35267 37444 . - 0 ID=nbis_NEW-gene-3;locus_tag=Tob1_00025
Tob1_contig1 Prodigal:2.60 mRNA 35267 37444 . - 0 ID=nbis_nol2id-cds-3;Parent=nbis_NEW-gene-3;locus_tag=Tob1_00025
Tob1_contig1 Prodigal:2.60 exon 35267 37444 . - . ID=nbis_NEW-exon-3;Parent=nbis_nol2id-cds-3;locus_tag=Tob1_00025
Tob1_contig1 Prodigal:2.60 CDS 35267 37444 . - 0 ID=Tob1_00025;Parent=nbis_nol2id-cds-3;locus_tag=Tob1_00025
Tob1_contig1 SignalP:4.1 sig_peptide 37420 37444 . - 0 ID=sig_peptide-2;Parent=nbis_nol2id-cds-3;inference=ab initio prediction:SignalP:4.1;note=predicted cleavage at residue 25;product=putative signal peptide
Tob1_contig1 Prodigal:2.60 gene 38304 39338 . - 0 ID=nbis_NEW-gene-4;locus_tag=Tob1_00026
Tob1_contig1 Prodigal:2.60 mRNA 38304 39338 . - 0 ID=nbis_nol2id-cds-4;Parent=nbis_NEW-gene-4;locus_tag=Tob1_00026
Tob1_contig1 Prodigal:2.60 exon 38304 39338 . - . ID=nbis_NEW-exon-4;Parent=nbis_nol2id-cds-4;locus_tag=Tob1_00026
Tob1_contig1 Prodigal:2.60 CDS 38304 39338 . - 0 ID=Tob1_00026;Parent=nbis_nol2id-cds-4;locus_tag=Tob1_00026
example 9 - 第二级 (mRNA) 与第三级(UTRs)特征信息缺失
##gff-version 3
#!gff-spec-version 1.14
#!source-version NCBI C++ formatter 0.2
##Type DNA NC_003070.9
NC_003070.9 RefSeq source 1 30427671 . + . organism=Arabidopsis thaliana;mol_type=genomic DNA;db_xref=taxon:3702;chromosome=1;ecotype=Columbia
NC_003070.9 RefSeq gene 3631 5899 . + . ID=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq exon 3631 3913 . + . ID=NM_099983.2;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010;
NC_003070.9 RefSeq exon 3996 4276 . + . ID=NM_099983.2;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010;
NC_003070.9 RefSeq exon 4486 4605 . + . ID=NM_099983.2;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010;
NC_003070.9 RefSeq exon 4706 5095 . + . ID=NM_099983.2;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010;
NC_003070.9 RefSeq exon 5174 5326 . + . ID=NM_099983.2;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010;
NC_003070.9 RefSeq exon 5439 5899 . + . ID=NM_099983.2;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010;
NC_003070.9 RefSeq CDS 3760 3913 . + 0 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq CDS 3996 4276 . + 2 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq CDS 4486 4605 . + 0 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq CDS 4706 5095 . + 0 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq CDS 5174 5326 . + 0 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq CDS 5439 5627 . + 0 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq start_codon 3760 3762 . + 0 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq stop_codon 5628 5630 . + 0 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
agat_convert_sp_gxf2gxf.pl --gff 8_test.gff
##gff-version 3
#!gff-spec-version 1.14
#!source-version NCBI C++ formatter 0.2
##Type DNA NC_003070.9
NC_003070.9 RefSeq source 1 30427671 . + . ID=source-1;chromosome=1;db_xref=taxon:3702;ecotype=Columbia;mol_type=genomic DNA;organism=Arabidopsis thaliana
NC_003070.9 RefSeq gene 3631 5899 . + . ID=nbis_NEW-gene-1;locus_tag=AT1G01010
NC_003070.9 RefSeq mRNA 3631 5899 . + . ID=NC_003070.9:NAC001;Parent=nbis_NEW-gene-1;locus_tag=AT1G01010
NC_003070.9 RefSeq exon 3631 3913 . + . ID=NM_099983.2;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010
NC_003070.9 RefSeq exon 3996 4276 . + . ID=nbis_NEW-exon-1;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010
NC_003070.9 RefSeq exon 4486 4605 . + . ID=nbis_NEW-exon-2;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010
NC_003070.9 RefSeq exon 4706 5095 . + . ID=nbis_NEW-exon-3;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010
NC_003070.9 RefSeq exon 5174 5326 . + . ID=nbis_NEW-exon-4;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010
NC_003070.9 RefSeq exon 5439 5899 . + . ID=nbis_NEW-exon-5;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010
NC_003070.9 RefSeq CDS 3760 3913 . + 0 ID=nbis_NEW-cds-1;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010
NC_003070.9 RefSeq CDS 3996 4276 . + 2 ID=nbis_NEW-cds-1;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010
NC_003070.9 RefSeq CDS 4486 4605 . + 0 ID=nbis_NEW-cds-1;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010
NC_003070.9 RefSeq CDS 4706 5095 . + 0 ID=nbis_NEW-cds-1;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010
NC_003070.9 RefSeq CDS 5174 5326 . + 0 ID=nbis_NEW-cds-1;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010
NC_003070.9 RefSeq CDS 5439 5627 . + 0 ID=nbis_NEW-cds-1;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010
NC_003070.9 RefSeq five_prime_UTR 3631 3759 . + . ID=nbis_NEW-five_prime_utr-1;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010
NC_003070.9 RefSeq start_codon 3760 3762 . + 0 ID=nbis_NEW-start_codon-1;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010
NC_003070.9 RefSeq stop_codon 5628 5630 . + 0 ID=nbis_NEW-stop_codon-1;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010
NC_003070.9 RefSeq three_prime_UTR 5628 5899 . + . ID=nbis_NEW-three_prime_utr-1;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010
example 18 - 特定基因的相关特征信息未经排序,零散地分布在文件中
##gff-version 3
scaffold625 maker gene 337818 343277 . + . ID=CLUHARG00000005458;Name=TUBB3_2
scaffold625 maker mRNA 337818 343277 . + . ID=CLUHART00000008717;Parent=CLUHARG00000005458
scaffold625 maker exon 337818 337971 . + . ID=CLUHART00000008717:exon:1404;Parent=CLUHART00000008717
scaffold625 maker exon 340733 340841 . + . ID=CLUHART00000008717:exon:1405;Parent=CLUHART00000008717
scaffold789 maker three_prime_UTR 564589 564780 . + . ID=CLUHART00000006146:three_prime_utr;Parent=CLUHART00000006146
scaffold789 maker mRNA 558184 564780 . + . ID=CLUHART00000006147;Parent=CLUHARG00000003852
scaffold625 maker CDS 337915 337971 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 340733 340841 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 341518 341628 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 341964 343033 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker five_prime_UTR 337818 337914 . + . ID=CLUHART00000008717:five_prime_utr;Parent=CLUHART00000008717
scaffold625 maker three_prime_UTR 343034 343277 . + . ID=CLUHART00000008717:three_prime_utr;Parent=CLUHART00000008717
scaffold789 maker gene 558184 564780 . + . ID=CLUHARG00000003852;Name=PF11_0240
scaffold789 maker mRNA 558184 564780 . + . ID=CLUHART00000006146;Parent=CLUHARG00000003852
scaffold789 maker exon 558184 560123 . + . ID=CLUHART00000006146:exon:995;Parent=CLUHART00000006146
scaffold789 maker exon 561401 561519 . + . ID=CLUHART00000006146:exon:996;Parent=CLUHART00000006146
scaffold789 maker exon 564171 564235 . + . ID=CLUHART00000006146:exon:997;Parent=CLUHART00000006146
scaffold789 maker exon 564372 564780 . + . ID=CLUHART00000006146:exon:998;Parent=CLUHART00000006146
scaffold789 maker CDS 558191 560123 . + 0 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker CDS 561401 561519 . + 2 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold625 maker exon 341518 341628 . + . ID=CLUHART00000008717:exon:1406;Parent=CLUHART00000008717
scaffold625 maker exon 341964 343277 . + . ID=CLUHART00000008717:exon:1407;Parent=CLUHART00000008717
scaffold789 maker CDS 564171 564235 . + 0 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker CDS 564372 564588 . + 1 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker five_prime_UTR 558184 558190 . + . ID=CLUHART00000006146:five_prime_utr;Parent=CLUHART00000006146
scaffold789 maker exon 558184 560123 . + . ID=CLUHART00000006147:exon:997;Parent=CLUHART00000006147
scaffold789 maker exon 561401 561519 . + . ID=CLUHART00000006147:exon:998;Parent=CLUHART00000006147
scaffold789 maker exon 562057 562121 . + . ID=CLUHART00000006147:exon:999;Parent=CLUHART00000006147
scaffold789 maker exon 564372 564780 . + . ID=CLUHART00000006147:exon:1000;Parent=CLUHART00000006147
scaffold789 maker CDS 558191 560123 . + 0 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 561401 561519 . + 2 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 562057 562121 . + 0 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 564372 564588 . + 1 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker five_prime_UTR 558184 558190 . + . ID=CLUHART00000006147:five_prime_utr;Parent=CLUHART00000006147
scaffold789 maker three_prime_UTR 564589 564780 . + . ID=CLUHART00000006147:three_prime_utr;Parent=CLUHART00000006147
agat_convert_sp_gxf2gxf.pl --gff 18_test.gff
##gff-version 3
scaffold625 maker gene 337818 343277 . + . ID=CLUHARG00000005458;Name=TUBB3_2
scaffold625 maker mRNA 337818 343277 . + . ID=CLUHART00000008717;Parent=CLUHARG00000005458
scaffold625 maker exon 337818 337971 . + . ID=CLUHART00000008717:exon:1404;Parent=CLUHART00000008717
scaffold625 maker exon 340733 340841 . + . ID=CLUHART00000008717:exon:1405;Parent=CLUHART00000008717
scaffold625 maker exon 341518 341628 . + . ID=CLUHART00000008717:exon:1406;Parent=CLUHART00000008717
scaffold625 maker exon 341964 343277 . + . ID=CLUHART00000008717:exon:1407;Parent=CLUHART00000008717
scaffold625 maker CDS 337915 337971 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 340733 340841 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 341518 341628 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 341964 343033 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker five_prime_UTR 337818 337914 . + . ID=CLUHART00000008717:five_prime_utr;Parent=CLUHART00000008717
scaffold625 maker three_prime_UTR 343034 343277 . + . ID=CLUHART00000008717:three_prime_utr;Parent=CLUHART00000008717
scaffold789 maker gene 558184 564780 . + . ID=CLUHARG00000003852;Name=PF11_0240
scaffold789 maker mRNA 558184 564780 . + . ID=CLUHART00000006146;Parent=CLUHARG00000003852
scaffold789 maker exon 558184 560123 . + . ID=CLUHART00000006146:exon:995;Parent=CLUHART00000006146
scaffold789 maker exon 561401 561519 . + . ID=CLUHART00000006146:exon:996;Parent=CLUHART00000006146
scaffold789 maker exon 564171 564235 . + . ID=CLUHART00000006146:exon:997;Parent=CLUHART00000006146
scaffold789 maker exon 564372 564780 . + . ID=CLUHART00000006146:exon:998;Parent=CLUHART00000006146
scaffold789 maker CDS 558191 560123 . + 0 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker CDS 561401 561519 . + 2 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker CDS 564171 564235 . + 0 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker CDS 564372 564588 . + 1 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker five_prime_UTR 558184 558190 . + . ID=CLUHART00000006146:five_prime_utr;Parent=CLUHART00000006146
scaffold789 maker three_prime_UTR 564589 564780 . + . ID=CLUHART00000006146:three_prime_utr;Parent=CLUHART00000006146
scaffold789 maker mRNA 558184 564780 . + . ID=CLUHART00000006147;Parent=CLUHARG00000003852
scaffold789 maker exon 558184 560123 . + . ID=CLUHART00000006147:exon:997;Parent=CLUHART00000006147
scaffold789 maker exon 561401 561519 . + . ID=CLUHART00000006147:exon:998;Parent=CLUHART00000006147
scaffold789 maker exon 562057 562121 . + . ID=CLUHART00000006147:exon:999;Parent=CLUHART00000006147
scaffold789 maker exon 564372 564780 . + . ID=CLUHART00000006147:exon:1000;Parent=CLUHART00000006147
scaffold789 maker CDS 558191 560123 . + 0 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 561401 561519 . + 2 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 562057 562121 . + 0 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 564372 564588 . + 1 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker five_prime_UTR 558184 558190 . + . ID=CLUHART00000006147:five_prime_utr;Parent=CLUHART00000006147
scaffold789 maker three_prime_UTR 564589 564780 . + . ID=CLUHART00000006147:three_prime_utr;Parent=CLUHART00000006147
(原文参考:https://sr-c.github.io/2020/08/28/AGAT-usage/)