8g内存用python读取10文件_Python:如何将巨大的文本文件读入内存

I'm using Python 2.6 on a Mac Mini with 1GB RAM. I want to read in a huge text file

$ ls -l links.csv; file links.csv; tail links.csv

-rw-r--r-- 1 user user 469904280 30 Nov 22:42 links.csv

links.csv: ASCII text, with CRLF line terminators

4757187,59883

4757187,99822

4757187,66546

4757187,638452

4757187,4627959

4757187,312826

4757187,6143

4757187,6141

4757187,3081726

4757187,58197

So each line in the file consists of a tuple of two comma separated integer values.

I want to read in the whole file and sort it according to the second column. I know, that I could do the sorting without reading the whole file into memory. But I thought for a file of 500MB I should still be able to do it in memory since I have 1GB available.

However when I try to read in the file, Python seems to allocate a lot more memory than is needed by the file on disk. So even with 1GB of RAM I'm not able to read in the 500MB file into memory.

My Python code for reading the file and printing some information about the memory consumption is:

#!/usr/bin/python

# -*- coding: utf-8 -*-

import sys

infile=open("links.csv", "r")

edges=[]

count=0

#count the total number of lines in the file

for line in infile:

count=count+1

total=count

print "Total number of lines: ",total

infile.seek(0)

count=0

for line in infile:

edge=tuple(map(int,line.strip().split(",")))

edges.append(edge)

count=count+1

# for every million lines print memory consumption

if count%1000000==0:

print "Position: ", edge

print "Read ",float(count)/float(total)*100,"%."

mem=sys.getsizeof(edges)

for edge in edges:

mem=mem+sys.getsizeof(edge)

for node in edge:

mem=mem+sys.getsizeof(node)

print "Memory (Bytes): ", mem

The output I got was:

Total number of lines: 30609720

Position: (9745, 2994)

Read 3.26693612356 %.

Memory (Bytes): 64348736

Position: (38857, 103574)

Read 6.53387224712 %.

Memory (Bytes): 128816320

Position: (83609, 63498)

Read 9.80080837067 %.

Memory (Bytes): 192553000

Position: (139692, 1078610)

Read 13.0677444942 %.

Memory (Bytes): 257873392

Position: (205067, 153705)

Read 16.3346806178 %.

Memory (Bytes): 320107588

Position: (283371, 253064)

Read 19.6016167413 %.

Memory (Bytes): 385448716

Position: (354601, 377328)

Read 22.8685528649 %.

Memory (Bytes): 448629828

Position: (441109, 3024112)

Read 26.1354889885 %.

Memory (Bytes): 512208580

Already after reading only 25% of the 500MB file, Python consumes 500MB. So it seem that storing the content of the file as a list of tuples of ints is not very memory efficient.

Is there a better way to do it, so that I can read in my 500MB file into my 1GB of memory?

解决方案

There is a recipe for sorting files larger than RAM on this page, though you'd have to adapt it for your case involving CSV-format data. There are also links to additional resources there.

Edit: True, the file on disk is not "larger than RAM", but the in-memory representation can easily become much larger than available RAM. For one thing, your own program doesn't get the entire 1GB (OS overhead etc). For another, even if you stored this in the most compact form for pure Python (two lists of integers, assuming 32-bit machine etc), you'd be using 934MB for those 30M pairs of integers.

Using numpy you can also do the job, using only about 250MB. It isn't particular fast to load this way, as you have to count the lines and pre-allocate the array, but it may be the fastest actual sort given that it's in-memory:

import time

import numpy as np

import csv

start = time.time()

def elapsed():

return time.time() - start

# count data rows, to preallocate array

f = open('links.csv', 'rb')

def count(f):

while 1:

block = f.read(65536)

if not block:

break

yield block.count(',')

linecount = sum(count(f))

print '\n%.3fs: file has %s rows' % (elapsed(), linecount)

# pre-allocate array and load data into array

m = np.zeros(linecount, dtype=[('a', np.uint32), ('b', np.uint32)])

f.seek(0)

f = csv.reader(open('links.csv', 'rb'))

for i, row in enumerate(f):

m[i] = int(row[0]), int(row[1])

print '%.3fs: loaded' % elapsed()

# sort in-place

m.sort(order='b')

print '%.3fs: sorted' % elapsed()

Output on my machine with a sample file similar to what you showed:

6.139s: file has 33253213 lines

238.130s: read into memory

517.669s: sorted

The default in numpy is Quicksort. The ndarray.sort() routine (which sorts in-place) can also take keyword argument kind="mergesort" or kind="heapsort" but it appears neither of these is capable of sorting on a Record Array which, incidentally, I used as the only way I could see to sort the columns together as opposed to the default which would sort them independently (totally messing up your data).

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值