python创建百万个文件,Python:慢读&为数百万个小文件写

Conclusion:

It seems that HDF5 is the way to go for my purposes. Basically "HDF5 is a data model, library, and file format for storing and managing data." and is designed to handle incredible amounts of data. It has a Python module called python-tables. (The link is in the answer below)

HDF5 does the job done 1000% better in saving tons and tons of data. Reading/modifying the data from 200 million rows is a pain though, so that's the next problem to tackle.

I am building directory tree which has tons of subdirectories and files. There are about 10 million files spread around a hundred thousand directories. Each file is under 32 subdirectories.

I have a python script that builds this filesystem and reads & writes those files. The problem is that when I reach more than a million files, the read and write methods become extremely slow.

Here's the function I have that reads the contents of a file (the file contains an integer string), adds a certain number to it, then writes it back to the original file.

def addInFile(path, scoreToAdd):

num = scoreToAdd

try:

shutil.copyfile(path, '/tmp/tmp.txt')

fp = open('/tmp/tmp.txt', 'r')

num += int(fp.readlines()[0])

fp.close()

except:

pass

fp = open('/tmp/tmp.txt', 'w')

fp.write(str(num))

fp.close()

shutil.copyfile('/tmp/tmp.txt', path)

Relational databases seem too slow for accessing these data, so I opted for a filesystem approach.

I previously tried performing linux console commands for these but it was way slower.

I copy the file to a temporary file first then access/modify it then copy it back because i found this was faster than directly accessing the file.

Putting all the files into 1 directory (in reiserfs format) caused too much slowdown when accessing the files.

I think the cause of the slowdown is because there're tons of files. Performing this function 1000 times clocked at less than a second.. but now it's reaching 1 minute.

How do you suggest I fix this? Do I change my directory tree structure?

All I need is to quickly access each file in this very huge pool of files*

解决方案

I know this isn't a direct answer to your question, but it is a direct solution to your problem.

You need to research using something like HDF5. It is designed for just the type of hierarchical data with millions of individual data points.

You are REALLY in luck because there are awesome Python bindings for HDF5 called pytables.

I have used it in a very similar way and had tremendous success.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值