用python提取fasta序列,如何使用Python随机提取FASTA序列?

I have the following sequences which is in a fasta format with sequence header and its nucleotides. How can I randomly extract the sequences. For example I would like to randomly select 2 sequences out of the total sequences. There are tools provided to do so is to extract according to percentage but not the number of sequences. Can anyone help me?

A.fasta

>chr1:1310706-1310726

GACGGTTTCCGGTTAGTGGAA

>chr1:901959-901979

GAGGGCTTTCTGGAGAAGGAG

>chr1:983001-983021

GTCCGCTTGCGGGACCTGGGG

>chr1:984333-984353

CTGGAATTCCGGGCGCTGGAG

>chr1:1154147-1154167

GAGATCGTCCGGGACCTGGGT

Expected Output

>chr1:1154147-1154167

GAGATCGTCCGGGACCTGGGT

>chr1:901959-901979

GAGGGCTTTCTGGAGAAGGAG

解决方案

If you are working with fasta files use BioPython, to get n sequences use random.sample:

from Bio import SeqIO

from random import sample

with open("foo.fasta") as f:

seqs = SeqIO.parse(f,"fasta")

print(sample(list(seqs), 2))

Output:

[SeqRecord(seq=Seq('GAGATCGTCCGGGACCTGGGT', SingleLetterAlphabet()), id='chr1:1154147-1154167', name='chr1:1154147-1154167', description='chr1:1154147-1154167', dbxrefs=[]), SeqRecord(seq=Seq('GTCCGCTTGCGGGACCTGGGG', SingleLetterAlphabet()), id='chr1:983001-983021', name='chr1:983001-983021', description='chr1:983001-983021', dbxrefs=[])]

You can extract the strings if necessary:

print([(seq.name,str(seq.seq)) for seq in sample(list(seqs),2)])

[('chr1:1310706-1310726', 'GACGGTTTCCGGTTAGTGGAA'), ('chr1:983001-983021', 'GTCCGCTTGCGGGACCTGGGG')]

If the lines were always in pairs and you skipped the metadata at the top you could zip:

from random import sample

with open("foo.fasta") as f:

print(sample(list(zip(f, f)), 2))

Which will give you pairs of lines in tuples:

[('>chr1:983001-983021\n', 'GTCCGCTTGCGGGACCTGGGG\n'), ('>chr1:984333-984353\n', 'CTGGAATTCCGGGCGCTGGAG\n')]

To get the lines ready to be written:

from Bio import SeqIO

from random import sample

with open("foo.fasta") as f:

seqs = SeqIO.parse(f, "fasta")

samps = ((seq.name, seq.seq) for seq in sample(list(seqs),2))

for samp in samps:

print(">{}\n{}".format(*samp))

Output:

>chr1:1310706-1310726

GACGGTTTCCGGTTAGTGGAA

>chr1:983001-983021

GTCCGCTTGCGGGACCTGGGG

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值