Extract User Defined Region From An Chromosome Fasta File
小麦基因组一条染色体序列至少有500M,我们想要获得染色体某一段的序列还是比较难办的。除了samtool可以做到,今天发现一个用python写的脚本pyfaidx,使用起来也比较方便。
首先安装
pip install --user pyfaidx #安装在自己账户的 $HOME/.local/bin/faidx
sudo pip install pyfaidx #需要权限,安装在/usr/local/bin/faidx
输入命令 faidx 默认返回使用帮助
faidx
$ faidx
usage: faidx [-h] [-b BED] [-o OUT]
[-i {bed,chromsizes,nucleotide,transposed}] [-c] [-r]
[-a SIZE_RANGE] [-n | -f] [-x] [-l] [-s DEFAULT_SEQ]
[-d DELIMITER] [-g REGEX | -v] [-m | -M] [--no-rebuild]
[--version]
fasta [regions [regions ...]]
Fetch sequences from FASTA. If no regions are specified, all entries in the
input file are returned. Input FASTA file must be consistently line-wrapped,
and line wrapping of output is based on input line lengths.
positional arguments:
fasta FASTA file
regions space separated regions of sequence to fetch e.g.
chr1:1-1000
optional arguments:
-h, --help show this help message and exit
-b BED, --bed BED bed file of regions
-o OUT, --out OUT output file name (default: stdout)
-i {bed,chromsizes,nucleotide,transposed}, --transform {bed,chromsizes,nucleotide,transposed}
transform the requested regions into another format.
default: None
-c, --complement complement the sequence. default: False #
-r, --reverse reverse the sequence. default: False
-a SIZE_RANGE, --size-range SIZE_RANGE
selected sequences are in the size range [low, high].
example: 1,1000 default: None
-n, --no-names omit sequence names from output. default: False
-f, --full-names output full names including description. default:
False
-x, --split-files write each region to a separate file (names are
derived from regions)
-l, --lazy fill in --default-seq for missing ranges. default:
False
-s DEFAULT_SEQ, --default-seq DEFAULT_SEQ
default base for missing positions and masking.
default: N
-d DELIMITER, --delimiter DELIMITER
delimiter for splitting names to multiple values
(duplicate names will be discarded). default: None
-g REGEX, --regex REGEX
selected sequences are those matching regular
expression. default: .*
-v, --invert-match selected sequences are those not matching 'regions'
argument. default: False
-m, --mask-with-default-seq
mask the FASTA file using --default-seq default: False
-M, --mask-by-case mask the FASTA file by changing to lowercase. default:
False
--no-rebuild do not rebuild the .fai index even if it is out of
date. default: False
--version print pyfaidx version number
Please cite: Shirley MD, Ma Z, Pedersen BS, Wheelan SJ. (2015) Efficient
"pythonic" access to FASTA files using pyfaidx. PeerJ PrePrints 3:e1196
https://dx.doi.org/10.7287/peerj.preprints.970v1
从上边的帮助信息了解到除了获取指定的序列外还有诸如倒序反转等等功能。
举个最简单的例子,想要获取chr5B_part2上75800000-76000000的序列,则命令如下
faidx wheat_genome.fasta chr5B_part2:75800000-76000000 # 默认返回序列屏幕
>chr5B_part2:75800000-76000000
GTCGTCTGCGAAGCCCCTACCAGAAATCCGCAACCTCCATGACATCCACCACTACACCAG
GCCTCCACGTCTCTTGTCCTCTGCCGCCAATCCACACCTCTTGTCCGCCCACCATGTGGT
ACCACATCCGCCCCTATAGAAGTCACCACGCCATGCAAGTGTTCGGTGAAATGCTCATGC
TAATTTCCTTGTCCCTTCTTTATACAGTAATGGATTCAGACATGGAGAACATATATGAGC
GCTATGTTGAGTCGTCCGAGGAGCAAGACTACATGGATGAAATGATAATGATGCAGGTGG
TCCTTGCAGACGCGGAGCAAGCAGAAGAGCATGTTCTCAATTTCAAGGGATCCATCAA.... #剩下的序列未列出