提取基因上游1500bp序列

利用bed格式文件从基因组中提取上游侧翼序列。

示例为猕猴桃红阳v3基因组

https://kiwifruitgenome.org/ftp/A_chinensis/Hongyang/v3.0/

zcat Hongyang_v3.0_update.gff3.gz |grep -w gene|awk '{print$1"\t"$4-1"\t"$5"\t"$9"\t"".""\t"$7}'  >genes.bed

seqkit subseq --only-flank --threads 5 -w 60 --up-stream 1500  --bed genes.bed -o hongyang_v3_upstream1500bp.fa.gz  Hongyang_genome_v3.0_update.fa.gz

  • –only-flank 只提取侧翼序列,不包含bed的区间序列
  • –threads 线程数
  • -w 提取序列的单行宽度。0表示单行。
  • –up-stream 提取上游序列区间长度
  • –down-stream 提取下游序列区间长度
  • –bed 指定bed文件
  • -o 指定输出结果文件。当输出文件添加后缀.gz 时,即可得到压缩的输出结果文件。
  • 命令接受输入文件为压缩格式。

bed文件格式

最少3列,根据bed提取序列,一般有6列就够了。

  1. chrom - 染色体号
  2. chromStart - feature在染色体上起始位置. 最小值为0,即从0开始算,染色体上第一个碱基位置标记为0。
  3. chromEnd - feature在染色体上终止位置。最终为止编号不包含在所需序列中。

bed起始与终止区间为左闭右开区间[chromStart, chromEnd)

  1. name - 该bed区间的名字
  2. score - 在基因组浏览器中显示的灰度设定,值介于0-1000;
  3. strand - 该bed区间的正负链标记. “.” (=no strand) or “+” or “-”.

帮助文档

$ seqkit subseq --help
get subsequences by region/gtf/bed, including flanking sequences.

Attentions:
  1. Use "seqkit grep" for extract subsets of sequences.
     "seqtk subseq seqs.fasta id.txt" equals to
     "seqkit grep -f id.txt seqs.fasta"

Recommendation:
  1. Use plain FASTA file, so seqkit could utilize FASTA index.
  2. The flag -U/--update-faidx is recommended to ensure the .fai file matches the FASTA file.

The definition of region is 1-based and with some custom design.

Examples:

 1-based index    1 2 3 4 5 6 7 8 9 10
negative index    0-9-8-7-6-5-4-3-2-1
           seq    A C G T N a c g t n
           1:1    A
           2:4      C G T
         -4:-2                c g t
         -4:-1                c g t n
         -1:-1                      n
          2:-2      C G T N a c g t
          1:-1    A C G T N a c g t n
          1:12    A C G T N a c g t n
        -12:-1    A C G T N a c g t n

Usage:
  seqkit subseq [flags] 

Flags:
      --bed string        by tab-delimited BED file
      --chr strings       select limited sequence with sequence IDs when using --gtf or --bed (multiple
                          value supported, case ignored)
  -d, --down-stream int   down stream length
      --feature strings   select limited feature types (multiple value supported, case ignored, only
                          works with GTF)
      --gtf string        by GTF (version 2.2) file
      --gtf-tag string    output this tag as sequence comment (default "gene_id")
  -h, --help              help for subseq
  -f, --only-flank        only return up/down stream sequence
  -r, --region string     by region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases, 13:-1 for
                          cutting first 12 bases. type "seqkit subseq -h" for more examples
  -u, --up-stream int     up stream length
  -U, --update-faidx      update the fasta index file if it exists. Use this if you are not sure whether
                          the fasta file changed

Global Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on
                                        which seqkit guesses the sequence type (0 for whole seq)
                                        (default 10000)
      --compress-level int              compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
                                        for the range and default value for each format (default -1)
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
                                        Pseud...
      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
      --infile-list string              file of input files list (one file per line), if given, they are
                                        appended to files from cli arguments
  -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it
                                        automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. can also set with environment variable
                                        SEQKIT_THREADS) (default 4)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值