如何将超大fasta文件分割且不影响序列完整度

最新推荐文章于 2023-02-07 17:40:30 发布

zxx11542

最新推荐文章于 2023-02-07 17:40:30 发布

阅读量2.9k

点赞数 4

文章标签： python

本文链接：https://blog.csdn.net/zxx11542/article/details/104177262

版权

fasta文件大于16G的我本人，心情苦痛的写下这篇惨痛的记忆
python的biopython需要建立数据库，不然会爆内存

biopython相关——5.4 序列文件作为字典：[http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc61]

Bio.SeqIO.to_dict()是最灵活但也是最需要内存的选项（请参见5.4.1节）。这基本上是一个帮助程序函数，用于构建普通的Python，dictionary并将每个条目作为SeqRecord对象存储在内存中，从而允许您修改记录。
Bio.SeqIO.index()是有用的中间立场，就像只读字典一样，并SeqRecord按需将序列解析为对象（请参见5.4.2节）。
Bio.SeqIO.index_db()它也像只读字典一样工作，但是将标识符和文件偏移量存储在磁盘上的文件中（作为SQLite3数据库），这意味着它对内存的要求非常低（请参阅第5.4.3节
），但是会慢一点

简单来说，Bio.SeqIO.to_dict()适合小型文件，占用内存多，因为它可以将所有内容保存在内存中。
Bio.SeqIO.index()适用于比较大的文件，但是也就那么大吧，可能4.500mb？或者几个gb？。。。
‘如果要处理数百万条记录，多个文件或重复分析，请查看 Bio.SeqIO.index_db()’

但是Bio.SeqIO.index_db()没有明确的利用fasta序列的方法，像我这种笨蛋是学不会的（至少现在）。

所以笨方法是拆分这个文件。

这位外国友人的方法非常nice，他吧序列拆分成500mb一个的文件

import os, sys
from os import listdir
from stat import *
from Bio import SeqIO

#checks to make sure input file is in the folder
try:
    name = input("\nEnter file name with fasta sequences you want to split: ")
    handle = open(name, 'r')
except:
    print ("File does not exist in folder! Check file name and extension.")
    quit()

#output file(s)
n = 1
outmid = os.path.splitext(name)[0]+" split "+str(n)+".fasta"
output_handle = open(outmid, "a")
print ("\n")

#main script
for line in handle:
    if line.startswith(">"):
        size = os.stat(outmid).st_size
        if size > 500000000:
            #user feedback，当然你如果不想要500mb，可以增大值。
            #但是这个大小适合notepad打开
            print(str(outmid) + "\tfile size = " + str(size))
            n += 1
            output_handle.close()
            outmid = os.path.splitext(name)[0]+" split "+str(n)+".fasta"
            output_handle = open(outmid, "a")
    output_handle.write(str(line))

命令行输出如下：

C:\Users\yemao\Desktop\contig.fa>split.py

Enter file name with fasta sequences you want to split: contig.fa
#此处是个交互代码？，输入你的文件名称

contig split 1.fasta    file size = 500000859
contig split 2.fasta    file size = 500003580
contig split 3.fasta    file size = 500008488
...#省略不写