python读两个分隔符的文件,使用python中指定的分隔符逐块读取文件

I have an input_file.fa file like this (FASTA format):

> header1 description

data data

data

>header2 description

more data

data

data

I want to read in the file one chunk at a time, so that each chunk contains one header and the corresponding data, e.g. block 1:

> header1 description

data data

data

Of course I could just read in the file like this and split:

with open("1.fa") as f:

for block in f.read().split(">"):

pass

But I want to avoid the reading the whole file into memory, because the files are often large.

I can read in the file line by line of course:

with open("input_file.fa") as f:

for line in f:

pass

But ideally what I want is something like this:

with open("input_file.fa", newline=">") as f:

for block in f:

pass

But I get an error:

ValueError: illegal newline value: >

I've also tried using the csv module, but with no success.

I did find this post from 3 years ago, which provides a generator based solution to this issue, but it doesn't seem that compact, is this really the only/best solution? It would be neat if it is possible to create the generator with a single line rather than a separate function, something like this pseudocode:

with open("input_file.fa") as f:

blocks = magic_generator_split_by_>

for block in blocks:

pass

If this is impossible, then I guess you could consider my question a duplicate of the other post, but if that is so, I hope people can explain to me why the other solution is the only one. Many thanks.

解决方案

A general solution here will be write a generator function for this that yields one group at a time. This was you will be storing only one group at a time in memory.

def get_groups(seq, group_by):

data = []

for line in seq:

# Here the `startswith()` logic can be replaced with other

# condition(s) depending on the requirement.

if line.startswith(group_by):

if data:

yield data

data = []

data.append(line)

if data:

yield data

with open('input.txt') as f:

for i, group in enumerate(get_groups(f, ">"), start=1):

print ("Group #{}".format(i))

print ("".join(group))

Output:

Group #1

> header1 description

data data

data

Group #2

>header2 description

more data

data

data

For FASTA formats in general I would recommend using Biopython package.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值