如果使用fasta文件,请使用BioPython,要获得n序列,请使用random.sample:from Bio import SeqIO
from random import sample
with open("foo.fasta") as f:
seqs = SeqIO.parse(f,"fasta")
print(sample(list(seqs), 2))
输出:
^{pr2}$
如果需要,可以提取字符串:print([(seq.name,str(seq.seq)) for seq in sample(list(seqs),2)])
[('chr1:1310706-1310726', 'GACGGTTTCCGGTTAGTGGAA'), ('chr1:983001-983021', 'GTCCGCTTGCGGGACCTGGGG')]
如果行总是成对的,并且跳过了顶部的元数据,则可以压缩:from random import sample
with open("foo.fasta") as f:
print(sample(list(zip(f, f)), 2))
这将给你一对元组行:[('>chr1:983001-983021\n', 'GTCCGCTTGCGGGACCTGGGG\n'), ('>chr1:984333-984353\n', 'CTGGAATTCCGGGCGCTGGAG\n')]
要准备好要写的行:from Bio import SeqIO
from random import sample
with open("foo.fasta") as f:
seqs = SeqIO.parse(f, "fasta")
samps = ((seq.name, seq.seq) for seq in sample(list(seqs),2))
for samp in samps:
print(">{}\n{}".format(*samp))
输出:>chr1:1310706-1310726
GACGGTTTCCGGTTAGTGGAA
>chr1:983001-983021
GTCCGCTTGCGGGACCTGGGG