Conclusion:
It seems that HDF5 is the way to go for my purposes. Basically "HDF5 is a data model, library, and file format for storing and managing data." and is designed to handle incredible amounts of data. It has a Python module called python-tables. (The link is in the answer below)
HDF5 does the job done 1000% better in saving tons and tons of data. Reading/modifying the data from 200 million rows is a pain though, so that's the next problem to tackle.
I am building directory tree which has tons of subdirectories and files. There are about 10 million files spread around a hundred thousand directories. Each file is under 32 subdirectories.
I have a python script that builds this filesystem and reads & writes those files. The problem is that when I reach more than a million files, the read and write methods become extremely slow.
Here's the function I have that reads the contents of a file (the file contains an integer string), adds a certain number to it, then writes it back to the original file.
def addInFile(path, scoreToAdd):
num = scoreToAdd
try:
shutil.copyfile(path, '/tmp/tmp.txt')
fp = open('/tmp/tmp.txt', 'r')
num += int(fp.readlines()[0])
fp.close()
except:
pass
fp = open('/tmp/tmp.txt', 'w')
fp.write(str(num))
fp.close()
shutil.copyfile('/tmp/tmp.txt', path)
Relational databases seem too slow for accessing these data, so I opted for a filesystem approach.
I previously tried performing linux console commands for these but it was way slower.
I copy the file to a temporary file first then access/modify it then copy it back because i found this was faster than directly accessing the file.
Putting all the files into 1 directory (in reiserfs format) caused too much slowdown when accessing the files.
I think the cause of the slowdown is because there're tons of files. Performing this function 1000 times clocked at less than a second.. but now it's reaching 1 minute.
How do you suggest I fix this? Do I change my directory tree structure?
All I need is to quickly access each file in this very huge pool of files*
解决方案
I know this isn't a direct answer to your question, but it is a direct solution to your problem.
You need to research using something like HDF5. It is designed for just the type of hierarchical data with millions of individual data points.
You are REALLY in luck because there are awesome Python bindings for HDF5 called pytables.
I have used it in a very similar way and had tremendous success.