利用bed格式文件从基因组中提取上游侧翼序列。
示例为猕猴桃红阳v3基因组
https://kiwifruitgenome.org/ftp/A_chinensis/Hongyang/v3.0/
zcat Hongyang_v3.0_update.gff3.gz |grep -w gene|awk '{print$1"\t"$4-1"\t"$5"\t"$9"\t"".""\t"$7}' >genes.bed
seqkit subseq --only-flank --threads 5 -w 60 --up-stream 1500 --bed genes.bed -o hongyang_v3_upstream1500bp.fa.gz Hongyang_genome_v3.0_update.fa.gz
- –only-flank 只提取侧翼序列,不包含bed的区间序列
- –threads 线程数
- -w 提取序列的单行宽度。0表示单行。
- –up-stream 提取上游序列区间长度
- –down-stream 提取下游序列区间长度
- –bed 指定bed文件
- -o 指定输出结果文件。当输出文件添加后缀
.gz
时,即可得到压缩的输出结果文件。 - 命令接受输入文件为压缩格式。
bed文件格式
最少3列,根据bed提取序列,一般有6列就够了。
- chrom - 染色体号
- chromStart - feature在染色体上起始位置. 最小值为0,即从0开始算,染色体上第一个碱基位置标记为0。
- chromEnd - feature在染色体上终止位置。最终为止编号不包含在所需序列中。
bed起始与终止区间为左闭右开区间[chromStart, chromEnd)
- name - 该bed区间的名字
- score - 在基因组浏览器中显示的灰度设定,值介于0-1000;
- strand - 该bed区间的正负链标记. “.” (=no strand) or “+” or “-”.
帮助文档
$ seqkit subseq --help
get subsequences by region/gtf/bed, including flanking sequences.
Attentions:
1. Use "seqkit grep" for extract subsets of sequences.
"seqtk subseq seqs.fasta id.txt" equals to
"seqkit grep -f id.txt seqs.fasta"
Recommendation:
1. Use plain FASTA file, so seqkit could utilize FASTA index.
2. The flag -U/--update-faidx is recommended to ensure the .fai file matches the FASTA file.
The definition of region is 1-based and with some custom design.
Examples:
1-based index 1 2 3 4 5 6 7 8 9 10
negative index 0-9-8-7-6-5-4-3-2-1
seq A C G T N a c g t n
1:1 A
2:4 C G T
-4:-2 c g t
-4:-1 c g t n
-1:-1 n
2:-2 C G T N a c g t
1:-1 A C G T N a c g t n
1:12 A C G T N a c g t n
-12:-1 A C G T N a c g t n
Usage:
seqkit subseq [flags]
Flags:
--bed string by tab-delimited BED file
--chr strings select limited sequence with sequence IDs when using --gtf or --bed (multiple
value supported, case ignored)
-d, --down-stream int down stream length
--feature strings select limited feature types (multiple value supported, case ignored, only
works with GTF)
--gtf string by GTF (version 2.2) file
--gtf-tag string output this tag as sequence comment (default "gene_id")
-h, --help help for subseq
-f, --only-flank only return up/down stream sequence
-r, --region string by region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases, 13:-1 for
cutting first 12 bases. type "seqkit subseq -h" for more examples
-u, --up-stream int up stream length
-U, --update-faidx update the fasta index file if it exists. Use this if you are not sure whether
the fasta file changed
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on
which seqkit guesses the sequence type (0 for whole seq)
(default 10000)
--compress-level int compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
for the range and default value for each format (default -1)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
--infile-list string file of input files list (one file per line), if given, they are
appended to files from cli arguments
-w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it
automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. can also set with environment variable
SEQKIT_THREADS) (default 4)