Here's how one file looks:
BEGIN_META
stuff
to
discard
END_META
BEGIN_DB
header
to
discard
data I
wish to
extract
END_DB
I'd like to be able to parse an infinite stream of them all cat'd together, which precludes doing something like re.findall('something useful', '\n'.join(sys.stdin), re.M).
Below is my attempt, but I have to force the generator returned from get_raw_table() so it doesn't quite fit the requirements. Removing the force means you can't test if the returned generator is empty or not, so you cannot see if sys.stdin is empty.
def get_raw_table(it):
state = 'begin'
for line in it:
if line.startswith('BEGIN_DB'):
state = 'discard'
elif line.startswith('END_DB'):
return
elif state is 'discard' and not line.strip():
state = 'take'
elif state is 'take' and line:
yield line.strip().strip('#').split()
# raw_tables is a list (per file) of lists (per row) of lists (per column)
raw_tables = []
while True:
result = list(get_raw_table(sys.stdin))
if result:
raw_tables.append(result)
else:
break
解决方案
Something like this might work:
import itertools
def chunks(it):
while True:
it = itertools.dropwhile(lambda x: 'BEGIN_DB' not in x, it)
it = itertools.dropwhile(lambda x: x.strip(), it)
next(it)
yield itertools.takewhile(lambda x: 'END_DB' not in x, it)
For example:
src = """
BEGIN_META
stuff
to
discard
END_META
BEGIN_DB
header
to
discard
1data I
1wish to
1extract
END_DB
BEGIN_META
stuff
to
discard
END_META
BEGIN_DB
header
to
discard
2data I
2wish to
2extract
END_DB
"""
src = iter(src.splitlines())
for chunk in chunks(src):
for line in chunk:
print line.strip()