python 处理数据越来越慢_Python运行速度越来越慢，垃圾回收问题？

最新推荐文章于 2024-03-04 16:12:37 发布

weixin_39843093

最新推荐文章于 2024-03-04 16:12:37 发布

阅读量909

点赞数

文章标签： python 处理数据越来越慢

本文链接：https://blog.csdn.net/weixin_39843093/article/details/113965552

版权

在处理大量文件的Python代码中，程序运行速度逐渐变慢，从最初的90秒处理1000个文件增加到15分钟。问题可能与垃圾回收或内存管理有关，尽管内存使用量仅为1.2GB。代码通过遍历文件列表，读取内容，根据时间戳移动文件。初步观察排除了文件大小和I/O操作作为主要瓶颈，怀疑可能是循环过程中的内存占用导致性能下降。

摘要由CSDN通过智能技术生成

所以我有一个代码，从一个最初有1400多万个文件的目录中获取一个文件列表。这是一台运行Ubuntu14.04桌面的十六进制内核机器，它的内存为20GB，只需抓取一个文件列表就需要几个小时——我还没有真正计时。在

在过去的一周左右的时间里，我运行的代码只不过是收集这些文件的列表，打开每个文件以确定它是何时创建的，然后根据创建的月份和年份将其移动到一个目录中。(这些文件都是scp'd和rsync'd的，因此操作系统提供的时间戳在这一点上没有意义，因此打开了文件。)

当我第一次开始运行这个循环时，它在90秒内移动了1000个文件。几个小时后，90秒变成了2.5分钟，然后是4秒，5秒，9秒，最后是15分钟。所以我关闭了它，重新开始。在

我注意到今天收集了超过900万个文件的列表后，移动1000个文件需要15分钟的时间。我只是再次关闭进程并重新启动机器，因为移动1000个文件的时间已经攀升到90分钟以上

我曾希望找到一些方法来执行while + list.pop()风格的策略，以便在循环进行时释放内存。然后发现了几个这样的帖子，说可以用for i in list: ... list.remove(...)来完成，但这是个糟糕的主意。在

代码如下：from basicconfig.startup_config import *

arc_dir = '/var/www/data/visits/'

def step1_move_files_to_archive_dirs(files):

"""

:return:

"""

cntr = 0

for f in files:

cntr += 1

if php_basic_files.file_exists(f) is False:

continue

try:

visit = json.loads(php_basic_files.file_get_contents(f))

except:

continue

fname = php_basic_files.basename(f)

try:

dt = datetime.fromtimestamp(visit['Entrance Time'])

except KeyError:

continue

mYr = dt.strftime("%B_%Y")

# Move the lead to Monthly archive

arc_path = arc_dir + mYr + '//'

if not os.path.exists(arc_path):

os.makedirs(arc_path, 0777)

if not os.path.exists(arc_path):

print "Directory: {} was not created".format(arc_path)

else:

# Move the file to the archive

newFile = arc_path + fname

#print "File moved to {}".format(newFile)

os.rename(f, newFile)

if cntr % 1000 is 0:

print "{} files moved ({})".format(cntr, datetime.fromtimestamp(time.time()).isoformat())

def step2_combine_visits_into_1_file():

"""

:return:

"""

file_dirs = php_basic_files.glob(arc_dir + '*')

for fd in file_dirs:

arc_files = php_basic_files.glob(fd + '*.raw')

arc_fname = arc_dir + php_basic_str.str_replace('/', '', php_basic_str.str_replace(arc_dir, '', fd)) + '.arc'

try:

arc_file_data = php_basic_files.file_get_contents(arc_fname)

except:

arc_file_data = {}

for f in arc_files:

uniqID = moduleName = php_adv_str.fetchBefore('.', php_basic_files.basename(f))

if uniqID not in arc_file_data:

visit = json.loads(php_basic_files.file_get_contents(f))

arc_file_data[uniqID] = visit

php_basic_files.file_put_contents(arc_fname, json.dumps(arc_file_data))

def main():

"""

:return:

"""

files = php_basic_files.glob('/var/www/html/ver1/php/VisitorTracking/data/raw/*')

print "Num of Files: {}".format(len(files))

step1_move_files_to_archive_dirs(files)

step2_combine_visits_into_1_file()

注意事项：

basicconfig基本上是我为环境和一些常用库(如所有php_basic_x*库)准备的常量。(在使用Python之前，我使用PHP多年，因此我构建了一个库来模拟我使用的更常见的函数，以便更快地启动和运行Python)

step1 def是程序所能达到的范围。step2def可以并且很可能应该并行运行。但是，我认为I/O是瓶颈，如果并行执行更多的I/O，可能会使所有函数的速度慢很多。(我曾尝试过将归档目录rsync到另一台机器上进行聚合，从而获得并行速度而不存在I/O瓶颈，但我认为rsync也会相当慢。)

文件本身都是3kb，所以不是很大。在

——最后的想法—————————————————————————————————————————————————————

我说，至少在打开文件的时候，我并没有这样说。因此，内存不应该是一个问题。但是，我注意到现在只使用了1.2GB的RAM，而以前使用的RAM超过了12GB。这12个文件中的很大一部分可能存储了1400万个文件名和路径。我刚刚再次开始处理，所以在接下来的几个小时内，python将收集一个文件列表，而这个列表还没有在内存中。在

所以我想知道是垃圾回收问题还是我遗漏了什么。为什么它在循环过程中变慢了？在