python 多进程读取五千万个文件,使用线程/多进程读取多个文件

I am currently pulling .txt files from the path list of FileNameList, which is working. But my main problem is, it is too slow when the files is too many.

I am using this code to print list of txt files,

import os

import sys

#FileNameList is my set of files from my path

for filefolder in FileNameList:

for file in os.listdir(filefolder):

if "txt" in file:

filename = filefolder + "\\" + file

print filename

Any help or suggestion to have thread/multiprocess and make it fast reading will accept. Thanks in advance.

解决方案

So you mean there is no way to speed this up?, because my scenario is to read bunch of files then read each lines of it and store it to the database

The first rule of optimization is to ask yourself if you should bother. If your program is run only once or a couple of times optimizing it is a waste of time.

The second rule is that before you do anything else, measure where the problem lies;

Write a simple program that sequentially reads files, splits them into lines and stuffs those in a database.

Run that program under a profiler to see where the program is spending most of its time.

Only then do you know which part of the program needs speeding up.

Here are some pointers nevertheless.

Speading up the reading of files can be done using mmap.

You could use multiprocessing.Pool to spread out the reading of multiple files over different cores. But then the data from those files will end up in different processes and would have to be sent back to the parent process using IPC. This has significant overhead for large amounts of data.

In the CPython implementation of Python, only one thread at a time can be executing Python bytecode. While the actual reading from files isn't inhibited by that, processing the results is. So it is questionable if threads would offer improvement.

Stuffing the lines into a database will probably always be a major bottleneck, because that is where everything comes together. How much of a problem this is depends on the database. Is it in-memory or on disk, does it allow multiple programs to update it simultaneously, et cetera.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值