gff/gtf 注释文件包含了基因的位置及结构信息,但是如何通过位置信息快速生成fa文件呢?强推Seqtik,一行代码解决问题!
seqkit
安装
通过conda直接安装
conda install seqkit -c biocodna
使用
seqkit集众多功能于一体,今天只接受subseq,用于提取基因
Usage:
seqkit subseq [flags]
Flags:
--bed string by tab-delimited BED file
--chr strings select limited sequence with sequence IDs when using --gtf or --bed (multiple value supported, case ignored)
-d, --down-stream int down stream length
--feature strings select limited feature types (multiple value supported, case ignored, only works with GTF)
--gtf string by GTF (version 2.2) file
--gtf-tag string output this tag as sequence comment (default "gene_id")
-h, --help help for subseq
-f, --only-flank only return up/down stream sequence
-r, --region string by region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases, 13:-1 for cutting first 12 bases. type "seqkit subseq -h" for more examples
-u, --up-stream int up stream length
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
--infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)
#根据bed、gtf文件提取基因
seqkit subseq --bed bedfile.bed -o gene.fa genomefile.fa
seqkit subseq --gtf gtffile.bed -o gene.fa genomefile.fa
速度尚可,比自己写perl/python脚本方便快捷!