python大文件排序,如何使用Python排序大型文件?

I found some this promising code on activestate.com to sort huge files. I'm trying to run it on the default Python 2.6.5 interpreter on Ubuntu 10.04. When I try running it on a small test file, I get the error trace below. I asked for help on activestate.com, but this thread has been silent for over 18 months. Is there anyone here who sees an obvious solution?

Thanks.

## {{{ http://code.activestate.com/recipes/576755/ (r3)

# based on Recipe 466302: Sorting big files the Python 2.4 way

# by Nicolas Lehuen

import os

from tempfile import gettempdir

from itertools import islice, cycle

from collections import namedtuple

import heapq

Keyed = namedtuple("Keyed", ["key", "obj"])

def merge(key=None, *iterables):

# based on code posted by Scott David Daniels in c.l.p.

# http://groups.google.com/group/comp.lang.python/msg/484f01f1ea3c832d

if key is None:

keyed_iterables = iterables

else:

keyed_iterables = [(Keyed(key(obj), obj) for obj in iterable)

for iterable in iterables]

for element in heapq.merge(*keyed_iterables):

yield element.obj

def batch_sort(input, output, key=None, buffer_size=32000, tempdirs=None):

if tempdirs is None:

tempdirs = []

if not tempdirs:

tempdirs.append(gettempdir())

chunks = []

try:

with open(input,'rb',64*1024) as input_file:

input_iterator = iter(input_file)

for tempdir in cycle(tempdirs):

current_chunk = list(islice(input_iterator,buffer_size))

if not current_chunk:

break

current_chunk.sort(key=key)

output_chunk = open(os.path.join(tempdir,'%06i'%len(chunks)),'w+b',64*1024)

chunks.append(output_chunk)

output_chunk.writelines(current_chunk)

output_chunk.flush()

output_chunk.seek(0)

with open(output,'wb',64*1024) as output_file:

output_file.writelines(merge(key, *chunks))

finally:

for chunk in chunks:

try:

chunk.close()

os.remove(chunk.name)

except Exception:

pass

Error trace:

Traceback (most recent call last):

File "./batch_sort.py", line 108, in

batch_sort(args[0],args[1],options.key,options.buffer_size,options.tempdirs)

File "./batch_sort.py", line 54, in batch_sort

output_file.writelines(merge(key, *chunks))

File "./batch_sort.py", line 30, in merge

yield element.obj

AttributeError: 'str' object has no attribute 'obj'

解决方案

The code for merge is incorrect.

If you don't provide a key, each element is a string instead of a keyed tuple.

Try this instead:

def merge(key=None, *iterables):

# based on code posted by Scott David Daniels in c.l.p.

# http://groups.google.com/group/comp.lang.python/msg/484f01f1ea3c832d

if key is None:

for element in heapq.merge(*iterables):

yield element

else:

keyed_iterables = [(Keyed(key(obj), obj) for obj in iterable)

for iterable in iterables]

for element in heapq.merge(*keyed_iterables):

yield element.obj

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值