Extract User Defined Region From An Chromosome Fasta File

最新推荐文章于 2023-06-21 09:55:03 发布

msw521sg

最新推荐文章于 2023-06-21 09:55:03 发布

阅读量520

点赞数

分类专栏：生物信息 python 文章标签：染色体-region 区间 fasta

本文链接：https://blog.csdn.net/msw521sg/article/details/53939635

版权

生物信息同时被 2 个专栏收录

54 篇文章 14 订阅

订阅专栏

python

27 篇文章 0 订阅

订阅专栏

Extract User Defined Region From An Chromosome Fasta File

小麦基因组一条染色体序列至少有500M，我们想要获得染色体某一段的序列还是比较难办的。除了samtool可以做到，今天发现一个用python写的脚本pyfaidx，使用起来也比较方便。

首先安装

pip install --user pyfaidx  #安装在自己账户的 $HOME/.local/bin/faidx
sudo pip install pyfaidx #需要权限，安装在/usr/local/bin/faidx

输入命令 faidx 默认返回使用帮助

faidx
$ faidx 
usage: faidx [-h] [-b BED] [-o OUT]
             [-i {bed,chromsizes,nucleotide,transposed}] [-c] [-r]
             [-a SIZE_RANGE] [-n | -f] [-x] [-l] [-s DEFAULT_SEQ]
             [-d DELIMITER] [-g REGEX | -v] [-m | -M] [--no-rebuild]
             [--version]
             fasta [regions [regions ...]]

Fetch sequences from FASTA. If no regions are specified, all entries in the
input file are returned. Input FASTA file must be consistently line-wrapped,
and line wrapping of output is based on input line lengths.

positional arguments:
  fasta                 FASTA file
  regions               space separated regions of sequence to fetch e.g.
                        chr1:1-1000

optional arguments:
  -h, --help            show this help message and exit
  -b BED, --bed BED     bed file of regions
  -o OUT, --out OUT     output file name (default: stdout)
  -i {bed,chromsizes,nucleotide,transposed}, --transform {bed,chromsizes,nucleotide,transposed}
                        transform the requested regions into another format.
                        default: None
  -c, --complement      complement the sequence. default: False #
  -r, --reverse         reverse the sequence. default: False
  -a SIZE_RANGE, --size-range SIZE_RANGE
                        selected sequences are in the size range [low, high].
                        example: 1,1000 default: None
  -n, --no-names        omit sequence names from output. default: False
  -f, --full-names      output full names including description. default:
                        False
  -x, --split-files     write each region to a separate file (names are
                        derived from regions)
  -l, --lazy            fill in --default-seq for missing ranges. default:
                        False
  -s DEFAULT_SEQ, --default-seq DEFAULT_SEQ
                        default base for missing positions and masking.
                        default: N
  -d DELIMITER, --delimiter DELIMITER
                        delimiter for splitting names to multiple values
                        (duplicate names will be discarded). default: None
  -g REGEX, --regex REGEX
                        selected sequences are those matching regular
                        expression. default: .*
  -v, --invert-match    selected sequences are those not matching 'regions'
                        argument. default: False
  -m, --mask-with-default-seq
                        mask the FASTA file using --default-seq default: False
  -M, --mask-by-case    mask the FASTA file by changing to lowercase. default:
                        False
  --no-rebuild          do not rebuild the .fai index even if it is out of
                        date. default: False
  --version             print pyfaidx version number

Please cite: Shirley MD, Ma Z, Pedersen BS, Wheelan SJ. (2015) Efficient
"pythonic" access to FASTA files using pyfaidx. PeerJ PrePrints 3:e1196
https://dx.doi.org/10.7287/peerj.preprints.970v1

从上边的帮助信息了解到除了获取指定的序列外还有诸如倒序反转等等功能。

举个最简单的例子，想要获取chr5B_part2上75800000-76000000的序列，则命令如下

faidx wheat_genome.fasta chr5B_part2:75800000-76000000     # 默认返回序列屏幕

>chr5B_part2:75800000-76000000
GTCGTCTGCGAAGCCCCTACCAGAAATCCGCAACCTCCATGACATCCACCACTACACCAG
GCCTCCACGTCTCTTGTCCTCTGCCGCCAATCCACACCTCTTGTCCGCCCACCATGTGGT
ACCACATCCGCCCCTATAGAAGTCACCACGCCATGCAAGTGTTCGGTGAAATGCTCATGC
TAATTTCCTTGTCCCTTCTTTATACAGTAATGGATTCAGACATGGAGAACATATATGAGC
GCTATGTTGAGTCGTCCGAGGAGCAAGACTACATGGATGAAATGATAATGATGCAGGTGG
TCCTTGCAGACGCGGAGCAAGCAGAAGAGCATGTTCTCAATTTCAAGGGATCCATCAA.... #剩下的序列未列出

msw521sg

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Extract User Defined Region From An Chromosome Fasta File

Extract User Defined Region From An Chromosome Fasta File小麦基因组一条染色体序列至少有500M，我们想要获得染色体某一段的序列还是比较难办的。除了samtool可以做到，今天发现一个用python写的脚本pyfaidx，使用起来也比较方便。首先安装pip install --user pyfaidx #安装在自己账户的 $HOME/.loc
复制链接

扫一扫