I am currently pulling .txt files from the path list of FileNameList, which is working. But my main problem is, it is too slow when the files is too many.
I am using this code to print list of txt files,
import os
import sys
#FileNameList is my set of files from my path
for filefolder in FileNameList:
for file in os.listdir(filefolder):
if "txt" in file:
filename = filefolder + "\\" + file
print filename
Any help or suggestion to have thread/multiprocess and make it fast reading will accept. Thanks in advance.
解决方案
So you mean there is no way to speed this up?, because my scenario is to read bunch of files then read each lines of it and store it to the database
The first rule of optimization is to ask yourself if you should bother. If your program is run only once or a couple of times optimizing it is a waste of time.
The second rule is that before you do anything else, measure where the problem lies;
Write a simple program that sequentially reads files, splits them into lines and stuffs those in a database.
Run that program under a profiler to see where the program is spending most of its time.
Only then do you know which part of the program needs speeding up.
Here are some pointers nevertheless.
Speading up the reading of files can be done using mmap.
You could use multiprocessing.Pool to spread out the reading of multiple files over different cores. But then the data from those files will end up in different processes and would have to be sent back to the parent process using IPC. This has significant overhead for large amounts of data.
In the CPython implementation of Python, only one thread at a time can be executing Python bytecode. While the actual reading from files isn't inhibited by that, processing the results is. So it is questionable if threads would offer improvement.
Stuffing the lines into a database will probably always be a major bottleneck, because that is where everything comes together. How much of a problem this is depends on the database. Is it in-memory or on disk, does it allow multiple programs to update it simultaneously, et cetera.