python numpy读取csv文件_使用numpy读取csv文件中的主要内存问题

I grabbed the KDD track1 dataset from Kaggle and decided to load a ~2.5GB 3-column CSV file into memory, on my 16GB high-memory EC2 instance:

data = np.loadtxt('rec_log_train.txt')

the python session ate up all my memory (100%), and then got killed.

I then read the same file using R (via read.table) and it used less than 5GB of ram, which collapsed to less than 2GB after I called the garbage collector.

My question is why did this fail under numpy, and what's the proper way of reading a file into memory. Yes I can use generators and avoid the problem, but that's not the goal.

解决方案import pandas, re, numpy as np

def load_file(filename, num_cols, delimiter='\t'):

data = None

try:

data = np.load(filename + '.npy')

except:

splitter = re.compile(delimiter)

def items(infile):

for line in infile:

for item in splitter.split(line):

yield item

with open(filename, 'r') as infile:

data = np.fromiter(items(infile), float64, -1)

data = data.reshape((-1, num_cols))

np.save(filename, data)

return pandas.DataFrame(data)

This reads in the 2.5GB file, and serializes the output matrix. The input file is read in "lazily", so no intermediate data-structures are built and minimal memory is used. The initial load takes a long time, but each subsequent load (of the serialized file) is fast. Please let me if you have tips!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值