python 中遍历文件夹一般用如下代码:
import os
from os.path import join, getsize
for root, dirs, files in os.walk('python/Lib/email'):
print root, "consumes",
print sum([getsize(join(root, name)) for name in files]),
print "bytes in", len(files), "non-directory files"
if 'CVS' in dirs:
dirs.remove('CVS') # don't visit CVS directories
root是最外层文件夹名,dirs是该root文件夹下的所有子文件夹,files是该root文件夹下的所有文件。
今天看源码的时候,有点懵逼,因为用到了 生成器yield 和 递归 :
def walk(top, topdown=True, οnerrοr=None, followlinks=False):
import pdb # 这两行是博主自己加的,目的是开启单步调试。
pdb.set_trace() # n:下一步, p xxx:观察xxx , l:查看所在代码行
islink, join, isdir = path.islink, path.join, path.isdir
try:
# Note that listdir and error are globals in this module due
# to earlier import-*.
names = listdir(top)
except error, err:
if onerror is not None:
onerror(err)
return
dirs, nondirs = [], []
for name in names:
if isdir(join(top, name)):
dirs.append(name)
else:
nondirs.append(name)
if topdown:
yield top, dirs, nondirs
for name in dirs:
new_path = join(top, name)
if followlinks or not islink(new_path):
for x in walk(new_path, topdown, onerror, followlinks):
yield x
if not topdown:
yield top, dirs, nondirs
博主的疑问主要在于,topdown这个参数:
When topdown is true, the caller can modify the dirnames list in-place
(e.g., via del or slice assignment), and walk will only recurse into the
subdirectories whose names remain in dirnames; this can be used to prune the
search, or to impose a specific order of visiting.
看文档topdown = True的时候,可以原地修改文件夹们,然后只会递归那些还留着的文件夹,可以减少查询次数??
好吧,看得我一愣一愣的,什么鬼嘛,只好单步下看看。
try:
# Note that listdir and error are globals in this module due
# to earlier import-*.
names = listdir(top)
except error, err:
if onerror is not None:
onerror(err)
return
dirs, nondirs = [], []
for name in names:
if isdir(join(top, name)):
dirs.append(name)
else:
nondirs.append(name)
这一段讲的是把文件夹root下的子文件夹dirs 和 文件nondirs 分别找出来。没啥难度。
我的目录如下:
E:\projects\myApp_emits\myApp
E:\projects\myApp_emits\myApp\a.jnt
E:\projects\myApp_emits\myApp\b
E:\projects\myApp_emits\myApp\b\c.txt
我的调用函数如下:
import os
des_folder = 'e:/projects/myApp_emits/myApp'
a = os.walk(des_folder, topdown=True)
parent, dir, files = a.next()
print parent, dir, files
parent, dir, files = a.next()
print parent, dir, files
接下来这段先看topdown=True的情况,源代码简化为:
if topdown:
yield top, dirs, nondirs
for name in dirs:
new_path = join(top, name)
if followlinks or not islink(new_path):
for x in walk(new_path, topdown, onerror, followlinks):
yield x
稍微解释下:
因为os.walk这个函数带有yield,那么它就不再是函数啦,是个生成器,记为a, 不断得a.next() 就可以不断返回yield后面的参数。
eg: yield top, dirs, nondirs ,那么每次a.next() 就会返回top, dirs, nondirs, 然后整个生成器挂起,直到下一个next() 触发,从yield top, dirs, nondirs这一句后的下一句继续执行,直到再次遇到yield,若没有遇到就结束啦。(奇怪,博主怎么来了一波yield讲解。。)
所以按照我们的代码结果如下:
e:/projects/myApp_emits/myApp ['b'] ['a.jnt']
e:/projects/myApp_emits/myApp\b [] ['c.txt']
for name in dirs:
new_path = join(top, name)
if followlinks or not islink(new_path):
for x in walk(new_path, topdown, onerror, followlinks):
yield x
if not topdown:
yield top, dirs, nondirs
e:/projects/myApp_emits/myApp\b [] ['c.txt']
e:/projects/myApp_emits/myApp ['b'] ['a.jnt']
对比结果我们能知道,topdown参数其实作用很简单,True则先扫顶级目录,False则从子目录开扫,最后再扫顶级目录。
Ps:
单步遇到递归要慢点,不然容易晕,这个例子还算好的,不晕,看tornado那个yield+装饰器,分分钟让你迷失在人生道路。