python换行符输出多个数据_一次读取多个Python腌制数据,缓冲和换行符?

to give you context:

I have a large file f, several Gigs in size. It contains consecutive pickles of different object that were generated by running

for obj in objs: cPickle.dump(obj, f)

I want to take advantage of buffering when reading this file. What I want, is to read several picked objects into a buffer at a time. What is the best way of doing this? I want an analogue of readlines(buffsize) for pickled data. In fact if the picked data is indeed newline delimited one could use readlines, but I am not sure if that is true.

Another option that I have in mind is to dumps() the pickled object to a string first and then to write the strings to a file, each separated by a newline. To read the file back I can use readlines() and loads(). But I fear that a pickled object may have the "\n" character and it will throw off this file reading scheme. Is my fear unfounded ?

One option is to pickle it out as a huge list of objects, but that will require more memory than I can afford. The setup can be sped up by multi-threading but I do not want to go there before I get the buffering working properly. Whats the "best practice" for situations like this.

EDIT:

I can also read in raw bytes into a buffer and invoke loads on that, but I need to know how many bytes of that buffer was consumed by loads so that I can throw the head away.

解决方案

file.readlines() returns a list of the entire contents of the file. You'll want to read a few lines at a time. I think this naive code should unpickle your data:

import pickle

infile = open('/tmp/pickle', 'rb')

buf = []

while True:

line = infile.readline()

if not line:

break

buf.append(line)

if line.endswith('.\n'):

print 'Decoding', buf

print pickle.loads(''.join(buf))

buf = []

If you have any control over the program that generates the pickles, I'd pick one of:

Use the shelve module.

Print the length (in bytes) of each pickle before writing it to the file so that you know exactly how many bytes to read in each time.

Same as above, but write the list of integers to a separate file so that you can use those values as an index into the file holding the pickles.

Pickle a list of K objects at a time. Write the length of that pickle in bytes. Write the pickle. Repeat.

By the way, I suspect that the file's built-in buffering should get you 99% of the performance gains you're looking for.

If you're convinced that I/O is blocking you, have you thought about trying mmap() and letting the OS handle packing in blocks at a time?

#!/usr/bin/env python

import mmap

import cPickle

fname = '/tmp/pickle'

infile = open(fname, 'rb')

m = mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ)

start = 0

while True:

end = m.find('.\n', start + 1) + 2

if end == 1:

break

print cPickle.loads(m[start:end])

start = end

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值