python popen.stdout管道_窥探Python中的Popen管道流

Background:

Python 2.6.6 on Linux. First part of a DNA sequence analysis pipeline.

I want to read a possibly gzipped file from a mounted remote storage (LAN) and if it is gzipped; gunzip it to a stream (i.e. using gunzip FILENAME -c) and if the first character of the stream (file) is "@", route that entire stream into a filtering program that takes input on standard input, otherwise just pipe it directly to a file on local disk. I'd like to minimize the number of file reads/seeks from remote storage (just a single pass through the file shouldn't be impossible?).

Contents of an example input file, first four lines corresponding to one record in FASTQ format:

@I328_1_FC30MD2AAXX:8:1:1719:1113/1

GTTATTATTATAATTTTTTACCGCATTTATCATTTCTTCTTTATTTTCATATTGATAATAAATATATGCAATTCG

+I328_1_FC30MD2AAXX:8:1:1719:1113/1

hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhahhhhhhfShhhYhhQhh]hhhhffhU\UhYWc

Files that should not be piped into the filtering program contain records that look like this (first two lines corresponding to one record in FASTA format):

>I328_1_FC30MD2AAXX:8:1:1719:1113/1

GTTATTATTATAATTTTTTACCGCATTTATCATTTCTTCTTTATTTTCATATTGATAATAAATATATGCAATTCG

Some made up semi-pseudo code effort to visualize what I want to do (I know this isn't possible the way I've written it). I hope it makes some sense:

if gzipped:

gunzip = Popen(["gunzip", "-c", "remotestorage/file.gz"], stdout=PIPE)

if gunzip.stdout.peek(1) == "@": # This isn't possible

fastq = True

else:

fastq = False

if fastq:

filter = Popen(["filter", "localstorage/outputfile.fastq"], stdin=gunzip.stdout).communicate()

else:

# Send the gunzipped stream to another file

Disregard the fact that the code won't run like I've written it here and that I have no error handling etc, all that is already in my other code. I just want help with peeking into the stream or finding a way around that. I would be great if you could gunzip.stdout.peek(1) but I realize that's not possible.

What I've tried so far:

I figured subprocess.Popen might help me achieve this, and I've tried a lot of different ideas, amongst others trying to use some kind of io.BufferedRandom() object to write the stream to but I can't figure out how that would work. I know streams are non-seekable but maybe a workaround might be to read the first character of the gunzip-stream and then create a new stream where you first input a "@" or ">" depending on file contents and then stuff the rest of the gunzip.stdout-stream into the new stream. This new stream would then be fed into filter's Popen stdin.

Note that the file sizes might be several times larger than available memory. I do not want to perform more than one single read of the source file from remote storage and no unnecessary file accessing.

Any ideas are welcome! Please ask me questions so I can clarify if I didn't make it clear enough.

解决方案

Here is an implementation of your first input a "@" or ">" depending on file contents and then stuff the rest of the gunzip.stdout-stream into the new stream proposal. I only tested the local-file branch of the test, but it should be enough to demonstrate the concept.

if gzipped:

source = Popen(["gunzip", "-c", "remotestorage/file.gz"], stdout=PIPE)

else:

source = Popen(["cat", "remotestorage/file"], stdout=PIPE)

firstchar = source.stdout.read(1)

# "unread" the char we've just read

source = Popen([r"(printf '\x%02x' && cat)" % ord(firstchar)],

shell=True, stdin=source.stdout, stdout=PIPE)

# Now feed the output to a filter or to a local file.

flocal = None

try:

if firstchar == "@":

filter = Popen(["filter", "localstorage/outputfile.fastq"],

stdin=source.stdout)

else:

flocal = open('localstorage/outputfile.stream', 'w')

filter = Popen(["cat"], stdin=source.stdout, stdout=flocal)

filter.communicate()

finally:

if flocal is not None:

flocal.close()

The idea is to read a single character from the source command's output, and then recreate the original output using (printf '\xhh' && cat), effectively implementing the peek. The replacement stream specifies shell=True to Popen, leaving it to the shell and cat to do the heavy lifting. The data remains in the pipeline at all times, never getting entirely read into memory. Note that services of the shell are only requested for the single call to Popen that implements unreading the peeked byte, not to the calls that involve of user-supplied file names. Even at that point, the byte is escaped to hex to make sure that the shell does not mangle it when invoking printf.

The code could be further cleaned up to implement an actual function named peek that returns the peeked contents and a replacement new_source.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值