python内存分配失败_Python MemoryError:无法分配数组内存

I've got a 250 MB CSV file I need to read with ~7000 rows and ~9000 columns. Each row represents an image, and each column is a pixel (greyscale value 0-255)

I started with a simple np.loadtxt("data/training_nohead.csv",delimiter=",") but this gave me a memory error. I thought this was strange since I'm running 64-bit Python with 8 gigs of memory installed and it died after using only around 512 MB.

I've since tried SEVERAL other tactics, including:

import fileinput and read one line at a time, appending them to an array

np.fromstring after reading in the entire file

np.genfromtext

Manual parsing of the file (since all data is integers, this was fairly easy to code)

Every method gave me the same result. MemoryError around 512 MB. Wondering if there was something special about 512MB, I created a simple test program which filled up memory until python crashed:

str = " " * 511000000 # Start at 511 MB

while 1:

str = str + " " * 1000 # Add 1 KB at a time

Doing this didn't crash until around 1 gig. I also, just for fun, tried: str = " " * 2048000000 (fill 2 gigs) - this ran without a hitch. Filled the RAM and never complained. So the issue isn't the total amount of RAM I can allocate, but seems to be how many TIMES I can allocate memory...

I google'd around fruitlessly until I found this post: Python out of memory on large CSV file (numpy)

I copied the code from the answer exactly:

def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):

def iter_func():

with open(filename, 'r') as infile:

for _ in range(skiprows):

next(infile)

for line in infile:

line = line.rstrip().split(delimiter)

for item in line:

yield dtype(item)

iter_loadtxt.rowlength = len(line)

data = np.fromiter(iter_func(), dtype=dtype)

data = data.reshape((-1, iter_loadtxt.rowlength))

return data

Calling iter_loadtxt("data/training_nohead.csv") gave a slightly different error this time:

MemoryError: cannot allocate array memory

As I'm running Python 2.7, this was not my issue. Any help would be appreciated.

解决方案

With some help from @J.F. Sebastian I developed the following answer:

train = np.empty([7049,9246])

row = 0

for line in open("data/training_nohead.csv")

train[row] = np.fromstring(line, sep=",")

row += 1

Of course this answer assumed prior knowledge of the number of rows and columns. Should you not have this information before-hand, the number of rows will always take a while to calculate as you have to read the entire file and count the \n characters. Something like this will suffice:

num_rows = 0

for line in open("data/training_nohead.csv")

num_rows += 1

For number of columns if every row has the same number of columns then you can just count the first row, otherwise you need to keep track of the maximum.

num_rows = 0

max_cols = 0

for line in open("data/training_nohead.csv")

num_rows += 1

tmp = line.split(",")

if len(tmp) > max_cols:

max_cols = len(tmp)

This solution works best for numerical data, as a string containing a comma could really complicate things.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值