提取基因上游1500bp序列

最新推荐文章于 2024-10-02 10:53:34 发布

学生信的大叔

最新推荐文章于 2024-10-02 10:53:34 发布

阅读量540

点赞数 7

分类专栏：生信软件基因家族文章标签：笔记

本文链接：https://blog.csdn.net/qq_33408922/article/details/134968028

版权

生信软件同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

基因家族

1 篇文章 0 订阅

订阅专栏

利用bed格式文件从基因组中提取上游侧翼序列。

示例为猕猴桃红阳v3基因组

https://kiwifruitgenome.org/ftp/A_chinensis/Hongyang/v3.0/

zcat Hongyang_v3.0_update.gff3.gz |grep -w gene|awk '{print$1"\t"$4-1"\t"$5"\t"$9"\t"".""\t"$7}'  >genes.bed

seqkit subseq --only-flank --threads 5 -w 60 --up-stream 1500  --bed genes.bed -o hongyang_v3_upstream1500bp.fa.gz  Hongyang_genome_v3.0_update.fa.gz

–only-flank 只提取侧翼序列，不包含bed的区间序列
–threads 线程数
-w 提取序列的单行宽度。0表示单行。
–up-stream 提取上游序列区间长度
–down-stream 提取下游序列区间长度
–bed 指定bed文件
-o 指定输出结果文件。当输出文件添加后缀.gz 时，即可得到压缩的输出结果文件。
命令接受输入文件为压缩格式。

bed文件格式

最少3列，根据bed提取序列，一般有6列就够了。

chrom - 染色体号
chromStart - feature在染色体上起始位置. 最小值为0，即从0开始算，染色体上第一个碱基位置标记为0。
chromEnd - feature在染色体上终止位置。最终为止编号不包含在所需序列中。

bed起始与终止区间为左闭右开区间[chromStart, chromEnd)

name - 该bed区间的名字
score - 在基因组浏览器中显示的灰度设定，值介于0-1000；
strand - 该bed区间的正负链标记. “.” (=no strand) or “+” or “-”.

帮助文档

$ seqkit subseq --help
get subsequences by region/gtf/bed, including flanking sequences.

Attentions:
  1. Use "seqkit grep" for extract subsets of sequences.
     "seqtk subseq seqs.fasta id.txt" equals to
     "seqkit grep -f id.txt seqs.fasta"

Recommendation:
  1. Use plain FASTA file, so seqkit could utilize FASTA index.
  2. The flag -U/--update-faidx is recommended to ensure the .fai file matches the FASTA file.

The definition of region is 1-based and with some custom design.

Examples:

 1-based index    1 2 3 4 5 6 7 8 9 10
negative index    0-9-8-7-6-5-4-3-2-1
           seq    A C G T N a c g t n
           1:1    A
           2:4      C G T
         -4:-2                c g t
         -4:-1                c g t n
         -1:-1                      n
          2:-2      C G T N a c g t
          1:-1    A C G T N a c g t n
          1:12    A C G T N a c g t n
        -12:-1    A C G T N a c g t n

Usage:
  seqkit subseq [flags] 

Flags:
      --bed string        by tab-delimited BED file
      --chr strings       select limited sequence with sequence IDs when using --gtf or --bed (multiple
                          value supported, case ignored)
  -d, --down-stream int   down stream length
      --feature strings   select limited feature types (multiple value supported, case ignored, only
                          works with GTF)
      --gtf string        by GTF (version 2.2) file
      --gtf-tag string    output this tag as sequence comment (default "gene_id")
  -h, --help              help for subseq
  -f, --only-flank        only return up/down stream sequence
  -r, --region string     by region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases, 13:-1 for
                          cutting first 12 bases. type "seqkit subseq -h" for more examples
  -u, --up-stream int     up stream length
  -U, --update-faidx      update the fasta index file if it exists. Use this if you are not sure whether
                          the fasta file changed

Global Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on
                                        which seqkit guesses the sequence type (0 for whole seq)
                                        (default 10000)
      --compress-level int              compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
                                        for the range and default value for each format (default -1)
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
                                        Pseud...
      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
      --infile-list string              file of input files list (one file per line), if given, they are
                                        appended to files from cli arguments
  -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it
                                        automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. can also set with environment variable
                                        SEQKIT_THREADS) (default 4)