Extract User Defined Region From An Chromosome Fasta File

Extract User Defined Region From An Chromosome Fasta File

小麦基因组一条染色体序列至少有500M,我们想要获得染色体某一段的序列还是比较难办的。除了samtool可以做到,今天发现一个用python写的脚本pyfaidx,使用起来也比较方便。

首先安装

pip install --user pyfaidx  #安装在自己账户的 $HOME/.local/bin/faidx
sudo pip install pyfaidx #需要权限,安装在/usr/local/bin/faidx

输入命令 faidx 默认返回使用帮助

faidx
$ faidx 
usage: faidx [-h] [-b BED] [-o OUT]
             [-i {bed,chromsizes,nucleotide,transposed}] [-c] [-r]
             [-a SIZE_RANGE] [-n | -f] [-x] [-l] [-s DEFAULT_SEQ]
             [-d DELIMITER] [-g REGEX | -v] [-m | -M] [--no-rebuild]
             [--version]
             fasta [regions [regions ...]]

Fetch sequences from FASTA. If no regions are specified, all entries in the
input file are returned. Input FASTA file must be consistently line-wrapped,
and line wrapping of output is based on input line lengths.

positional arguments:
  fasta                 FASTA file
  regions               space separated regions of sequence to fetch e.g.
                        chr1:1-1000

optional arguments:
  -h, --help            show this help message and exit
  -b BED, --bed BED     bed file of regions
  -o OUT, --out OUT     output file name (default: stdout)
  -i {bed,chromsizes,nucleotide,transposed}, --transform {bed,chromsizes,nucleotide,transposed}
                        transform the requested regions into another format.
                        default: None
  -c, --complement      complement the sequence. default: False #
  -r, --reverse         reverse the sequence. default: False
  -a SIZE_RANGE, --size-range SIZE_RANGE
                        selected sequences are in the size range [low, high].
                        example: 1,1000 default: None
  -n, --no-names        omit sequence names from output. default: False
  -f, --full-names      output full names including description. default:
                        False
  -x, --split-files     write each region to a separate file (names are
                        derived from regions)
  -l, --lazy            fill in --default-seq for missing ranges. default:
                        False
  -s DEFAULT_SEQ, --default-seq DEFAULT_SEQ
                        default base for missing positions and masking.
                        default: N
  -d DELIMITER, --delimiter DELIMITER
                        delimiter for splitting names to multiple values
                        (duplicate names will be discarded). default: None
  -g REGEX, --regex REGEX
                        selected sequences are those matching regular
                        expression. default: .*
  -v, --invert-match    selected sequences are those not matching 'regions'
                        argument. default: False
  -m, --mask-with-default-seq
                        mask the FASTA file using --default-seq default: False
  -M, --mask-by-case    mask the FASTA file by changing to lowercase. default:
                        False
  --no-rebuild          do not rebuild the .fai index even if it is out of
                        date. default: False
  --version             print pyfaidx version number

Please cite: Shirley MD, Ma Z, Pedersen BS, Wheelan SJ. (2015) Efficient
"pythonic" access to FASTA files using pyfaidx. PeerJ PrePrints 3:e1196
https://dx.doi.org/10.7287/peerj.preprints.970v1

从上边的帮助信息了解到除了获取指定的序列外还有诸如倒序反转等等功能。

举个最简单的例子,想要获取chr5B_part2上75800000-76000000的序列,则命令如下

faidx wheat_genome.fasta chr5B_part2:75800000-76000000     # 默认返回序列屏幕

>chr5B_part2:75800000-76000000
GTCGTCTGCGAAGCCCCTACCAGAAATCCGCAACCTCCATGACATCCACCACTACACCAG
GCCTCCACGTCTCTTGTCCTCTGCCGCCAATCCACACCTCTTGTCCGCCCACCATGTGGT
ACCACATCCGCCCCTATAGAAGTCACCACGCCATGCAAGTGTTCGGTGAAATGCTCATGC
TAATTTCCTTGTCCCTTCTTTATACAGTAATGGATTCAGACATGGAGAACATATATGAGC
GCTATGTTGAGTCGTCCGAGGAGCAAGACTACATGGATGAAATGATAATGATGCAGGTGG
TCCTTGCAGACGCGGAGCAAGCAGAAGAGCATGTTCTCAATTTCAAGGGATCCATCAA.... #剩下的序列未列出
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值