递归练习题-doc文件目录解析-Python语言实现

最新推荐文章于 2023-03-24 11:20:03 发布

嘤鸣求友

最新推荐文章于 2023-03-24 11:20:03 发布

阅读量232

点赞数 2

分类专栏： python 文章标签：递归 Python

本文链接：https://blog.csdn.net/weiran2009/article/details/86723566

版权

python 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

解决思路：

解决过程：

结果输出示例：

附加问题：如果我们需要更多的目录状态信息呢？

完整代码及示例文件获取：

最近接触一些超大word文档，其目录就有五六页，但发现目录使用标准格式，就想开发一个目录解析工具，用于查看每个目录级中包含多少最小子目录。
目录示例如下（最后的数字是页码）：

4.1 数学习题      14
4.1.1 选择题      14
4.1.1.1 矩阵分析      14
4.1.1.1.1 矩阵分析习题1      14
4.1.1.1.2 矩阵分析习题2      15
4.1.1.2 微积分      16
4.1.1.2.1 微积分习题1 [已完成]      16
4.1.1.2.2 微积分习题2 [已完成]      16
4.1.1.2.3 微积分习题3 [已完成]      17
4.1.1.2.4 微积分习题4      17
4.1.1.3 概率论      18
4.1.1.3.1 概率论习题1      18
4.1.2 填空题      19
4.1.2.1 矩阵分析      19
4.1.2.1.1 矩阵分析练习题1      19
4.2 英语
4.2.1 历年真题      19
4.2.1.1 完形填空      19
4.2.1.1.1 完形填空2016年真题      19
4.2.1.1.2 完形填空2017年真题 [已完成]      19
4.2.1.1.3 完形填空2018年真题 [已完成]      19
4.2.1.2 阅读理解      19
4.2.1.2.1 阅读理解2016年真题 [已完成]      19
4.2.1.2.2 阅读理解2017年真题 [已完成]      19
4.2.1.2.3 阅读理解2018年真题 [已完成]      19

可以看出，上面这个示例，最小子目录是5级目录，则按照我们的要求规则，输出统计为
4.1 数学习题      8
4.1.1 选择题      7
4.1.1.1 矩阵分析      2

解决思路：

1、构建合适的数据结构，用于存储各个目录的存储信息；
2、目录层级（深度）获取，用于知晓当前目录属于哪一级别；
3、使用递归完成目录数据结构构建；
4、使用递归完成目录数据结构输出；

解决过程：


可以看出，我们这个习题最难的地方在于 
3、使用递归完成目录数据结构构建；
首先，我们先解决几个简单的问题:
1、构建合适的数据结构，用于存储各个目录的存储信息；
这里我考虑使用Python的字典内置类型，经过大量反复测试后，发现此种结构比较适合：

{'this_content': this_line, 'sub_content_lst': [], 'count': 0}

'this_content' 字段用于保存当前行信息，便于输出；
'sub_content_lst' 字段用于保存当前目录行的子目录，为列表类型，保存的内容为下一层目录，字典类型，即当前字典结构（如果当前为最底层目录则为空）
'count' 字段为数字类型，用于保存当前目录行包含最底层子目录个数（注意最底层目录与下一层目录的区别）

2、目录层级（深度）获取，用于知晓当前目录属于哪一级别；
这个比较简单，使用正则匹配即可：

def find_content_deep(line):
    pat_deep_1 = r'^\d+ '
    pat_deep_2 = r'^\d+\.\d+ '
    pat_deep_3 = r'^\d+\.\d+.\d+ '
    pat_deep_4 = r'^\d+\.\d+.\d+.\d+ '
    pat_deep_5 = r'^\d+\.\d+.\d+.\d+.\d+ '
    pat_lst = [pat_deep_1, pat_deep_2, pat_deep_3, pat_deep_4, pat_deep_5]
    for index, pat_deep in enumerate(pat_lst):
        if re.search(pat_deep, line):
            return index + 1
    raise Exception('未知的目录标题参数')

当然，你也可以根据自己的目录格式进行正则匹配；

3、使用递归完成目录数据结构构建；
这里我不直接给出答案，如果大家有兴趣可以自己试试看，我会给出自己解决的编写步骤；

这是我能运行的第一个（保留）版本：

def iter_calc(this_line, f, clac_max_deep):
    # this_line = f.readline().strip().split('\t')[0]
    # 获取当前目录行深度
    cur_deep = find_content_deep(this_line)
    # 构造当前行目录结构
    cur_content_info_dct = {'this_content': this_line, 'sub_content_lst': [], 'count': 0}
    # 如果为最大深度则个数赋值为1，并返回
    if cur_deep == clac_max_deep:
        cur_content_info_dct['count'] = 1
        return cur_content_info_dct, None
    while True:
        next_line = f.readline().strip().split('\t')[0]
        next_deep = find_content_deep(next_line)
        if cur_deep < next_deep:
            sub_content_info_dct, return_line = iter_calc(next_line, f, clac_max_deep)
            if return_line is None:
                cur_content_info_dct['sub_content_lst'].append(sub_content_info_dct)
                cur_content_info_dct['count'] += sub_content_info_dct['count']
            else:
                cur_content_info_dct['sub_content_lst'].append(sub_content_info_dct)
                cur_content_info_dct['count'] += sub_content_info_dct['count']
                return sub_content_info_dct, return_line
        elif cur_deep == next_deep:
            return cur_content_info_dct, next_line
        elif cur_deep > next_deep:
            return cur_content_info_dct, next_line

这个版本问题在于，只能实现单层递归，非最底层只能保留一个……

下面这个是我第二个保留运行版本，问题在于递归不完全：

def iter_calc_2(this_line, f, clac_max_deep):
    cur_deep = find_content_deep(this_line)
    cur_content_info_dct = {'this_content': this_line, 'sub_content_lst': [], 'count': 0}
    next_line = f.readline().strip().split('\t')[0]
    next_deep = find_content_deep(next_line)
    while True:
        if cur_deep == clac_max_deep:
            cur_content_info_dct['count'] = 1
            return cur_content_info_dct, next_line
        elif cur_deep >= next_deep:
            return cur_content_info_dct, next_line
        elif cur_deep < next_deep:
            sub_content_info_dct, next_line = iter_calc_2(next_line, f, clac_max_deep)
            next_deep = find_content_deep(next_line)
            cur_content_info_dct['sub_content_lst'].append(sub_content_info_dct)
            cur_content_info_dct['count'] += sub_content_info_dct['count']
            print(cur_content_info_dct)

下面是能够实现功能的版本，但写的太杂糅了：


def iter_calc_3(this_line, f, clac_max_deep):
    cur_deep = find_content_deep(this_line)
    cur_content_info_dct = {"this_content": this_line, "sub_content_lst": [], "count": 0}
    next_line = f.readline()
    if next_line:
        next_line = next_line.strip().split('\t')[0]
        next_deep = find_content_deep(next_line)
    else:
        if cur_deep == clac_max_deep:
            cur_content_info_dct['count'] = 1
            return cur_content_info_dct, next_line
        else:
            return cur_content_info_dct, next_line
    while True:
        if cur_deep == clac_max_deep:
            cur_content_info_dct['count'] = 1
            return cur_content_info_dct, next_line
        elif cur_deep >= next_deep:
            return cur_content_info_dct, next_line
        elif cur_deep < next_deep:
            sub_content_info_dct, next_line = iter_calc_3(next_line, f, clac_max_deep)
            if not next_line:
                cur_content_info_dct['sub_content_lst'].append(sub_content_info_dct)
                cur_content_info_dct['count'] += sub_content_info_dct['count']
                return cur_content_info_dct, next_line
            next_deep = find_content_deep(next_line)
            cur_content_info_dct['sub_content_lst'].append(sub_content_info_dct)
            cur_content_info_dct['count'] += sub_content_info_dct['count']
            # print(cur_content_info_dct)

好吧，经过美化得到一个优化版本：


def iter_calc_4(this_line, f, clac_max_deep):
    # 获取当前目录行深度
    cur_deep = find_content_deep(this_line)
    # 构建当前目录行数据结构
    cur_content_info_dct = {"this_content": this_line, "sub_content_lst": [], "count": 0}
    # 读取下一行目录信息
    next_line = f.readline()
    # 判断是否为空，并去除页码，获取深度信息
    if next_line:
        next_line = next_line.strip().split('\t')[0]
        next_deep = find_content_deep(next_line)
    else:
        # 如果为最大深度，则返回
        if cur_deep == clac_max_deep:
            cur_content_info_dct['count'] = 1
            return cur_content_info_dct, next_line
        else:
            return cur_content_info_dct, next_line
    while True:
        # 如果下一行目录为上一级或同级，则直接返回
        if cur_deep >= next_deep:
            return cur_content_info_dct, next_line
        # 如果下一行目录为下一级，则进行数据统计，填充 "sub_content_lst" 字段并继续递归
        elif cur_deep < next_deep:
            sub_content_info_dct, next_line = iter_calc_4(next_line, f, clac_max_deep)
            cur_content_info_dct['sub_content_lst'].append(sub_content_info_dct)
            cur_content_info_dct['count'] += sub_content_info_dct['count']
            # 注意这里 next_line 是递归新产生的，需要判断是否为空
            if not next_line:
                return cur_content_info_dct, next_line
            next_deep = find_content_deep(next_line)
            # print(cur_content_info_dct)

需要注意的是，到目前为止，我们还没有对超级层（即目录最高层也组成一个列表类型）的处理，为了统一结构，我们仍然用递归中的数据类型，但我们手工完成字段内容：

def anaylse_doc_content_2():
    p_input_content = './content_file.txt'
    p_output_content = './output_content_file.txt'
    clac_max_deep = 5   # 指定统计目录的最大深度
    super_content_info_dct = {"this_content": '[content]', "sub_content_lst": [], "count": 0}
    with open(p_input_content, encoding='utf-8') as f:
        # cur_content_info_dct = {"this_content": this_line, "sub_content_lst": [], "count": 0}
        next_line = f.readline()
        if next_line:
            next_line = next_line.strip().split('\t')[0]
        else:
            raise Exception('空目录')
        while True:
            sub_content_info_dct, next_line = iter_calc_4(next_line, f, clac_max_deep)
            super_content_info_dct['sub_content_lst'].append(sub_content_info_dct)
            super_content_info_dct['count'] += sub_content_info_dct['count']
            if not next_line:
                break
    print(super_content_info_dct)
    super_content_info_json= json.dumps(super_content_info_dct, ensure_ascii=False)
    print(super_content_info_json)
    pass



if __name__ == '__main__':
    anaylse_doc_content_2()
    print('all done, weiran 2019年2月10日')

结果输出示例：

{'this_content': '[content]', 'sub_content_lst': [{'this_content': '4.1 数学习题 \xa0 \xa0 \xa014', 'sub_content_lst': [{'this_content': '4.1.1 选择题 \xa0 \xa0 \xa014', 'sub_content_lst': [{'this_content': '4.1.1.1 矩阵分析 \xa0 \xa0 \xa014', 'sub_content_lst': [{'this_content': '4.1.1.1.1 矩阵分析习题1 \xa0 \xa0 \xa014', 'sub_content_lst': [], 'count': 1}, {'this_content': '4.1.1.1.2 矩阵分析习题2 \xa0 \xa0 \xa015', 'sub_content_lst': [], 'count': 1}], 'count': 2}, {'this_content': '4.1.1.2 微积分 \xa0 \xa0 \xa016', 'sub_content_lst': [{'this_content': '4.1.1.2.1 微积分习题1 [已完成] \xa0 \xa0 \xa016', 'sub_content_lst': [], 'count': 1}, {'this_content': '4.1.1.2.2 微积分习题2 [已完成] \xa0 \xa0 \xa016', 'sub_content_lst': [], 'count': 1}, {'this_content': '4.1.1.2.3 微积分习题3 [已完成] \xa0 \xa0 \xa017', 'sub_content_lst': [], 'count': 1}, {'this_content': '4.1.1.2.4 微积分习题4 \xa0 \xa0 \xa017', 'sub_content_lst': [], 'count': 1}], 'count': 4}, {'this_content': '4.1.1.3 概率论 \xa0 \xa0 \xa018', 'sub_content_lst': [{'this_content': '4.1.1.3.1 概率论习题1 \xa0 \xa0 \xa018', 'sub_content_lst': [], 'count': 1}], 'count': 1}], 'count': 7}, {'this_content': '4.1.2 填空题 \xa0 \xa0 \xa019', 'sub_content_lst': [{'this_content': '4.1.2.1 矩阵分析 \xa0 \xa0 \xa019', 'sub_content_lst': [{'this_content': '4.1.2.1.1 矩阵分析练习题1 \xa0 \xa0 \xa019', 'sub_content_lst': [], 'count': 1}], 'count': 1}], 'count': 1}], 'count': 8}, {'this_content': '4.2 英语', 'sub_content_lst': [{'this_content': '4.2.1 历年真题 \xa0 \xa0 \xa019', 'sub_content_lst': [{'this_content': '4.2.1.1 完形填空 \xa0 \xa0 \xa019', 'sub_content_lst': [{'this_content': '4.2.1.1.1 完形填空2016年真题 \xa0 \xa0 \xa019', 'sub_content_lst': [], 'count': 1}, {'this_content': '4.2.1.1.2 完形填空2017年真题 [已完成] \xa0 \xa0 \xa019', 'sub_content_lst': [], 'count': 1}, {'this_content': '4.2.1.1.3 完形填空2018年真题 [已完成] \xa0 \xa0 \xa019', 'sub_content_lst': [], 'count': 1}], 'count': 3}, {'this_content': '4.2.1.2 阅读理解 \xa0 \xa0 \xa019', 'sub_content_lst': [{'this_content': '4.2.1.2.1 阅读理解2016年真题 [已完成] \xa0 \xa0 \xa019', 'sub_content_lst': [], 'count': 1}, {'this_content': '4.2.1.2.2 阅读理解2017年真题 [已完成] \xa0 \xa0 \xa019', 'sub_content_lst': [], 'count': 1}, {'this_content': '4.2.1.2.3 阅读理解2018年真题 [已完成] \xa0 \xa0 \xa019', 'sub_content_lst': [], 'count': 1}], 'count': 3}], 'count': 6}], 'count': 6}], 'count': 14}
{"this_content": "[content]", "sub_content_lst": [{"this_content": "4.1 数学习题      14", "sub_content_lst": [{"this_content": "4.1.1 选择题      14", "sub_content_lst": [{"this_content": "4.1.1.1 矩阵分析      14", "sub_content_lst": [{"this_content": "4.1.1.1.1 矩阵分析习题1      14", "sub_content_lst": [], "count": 1}, {"this_content": "4.1.1.1.2 矩阵分析习题2      15", "sub_content_lst": [], "count": 1}], "count": 2}, {"this_content": "4.1.1.2 微积分      16", "sub_content_lst": [{"this_content": "4.1.1.2.1 微积分习题1 [已完成]      16", "sub_content_lst": [], "count": 1}, {"this_content": "4.1.1.2.2 微积分习题2 [已完成]      16", "sub_content_lst": [], "count": 1}, {"this_content": "4.1.1.2.3 微积分习题3 [已完成]      17", "sub_content_lst": [], "count": 1}, {"this_content": "4.1.1.2.4 微积分习题4      17", "sub_content_lst": [], "count": 1}], "count": 4}, {"this_content": "4.1.1.3 概率论      18", "sub_content_lst": [{"this_content": "4.1.1.3.1 概率论习题1      18", "sub_content_lst": [], "count": 1}], "count": 1}], "count": 7}, {"this_content": "4.1.2 填空题      19", "sub_content_lst": [{"this_content": "4.1.2.1 矩阵分析      19", "sub_content_lst": [{"this_content": "4.1.2.1.1 矩阵分析练习题1      19", "sub_content_lst": [], "count": 1}], "count": 1}], "count": 1}], "count": 8}, {"this_content": "4.2 英语", "sub_content_lst": [{"this_content": "4.2.1 历年真题      19", "sub_content_lst": [{"this_content": "4.2.1.1 完形填空      19", "sub_content_lst": [{"this_content": "4.2.1.1.1 完形填空2016年真题      19", "sub_content_lst": [], "count": 1}, {"this_content": "4.2.1.1.2 完形填空2017年真题 [已完成]      19", "sub_content_lst": [], "count": 1}, {"this_content": "4.2.1.1.3 完形填空2018年真题 [已完成]      19", "sub_content_lst": [], "count": 1}], "count": 3}, {"this_content": "4.2.1.2 阅读理解      19", "sub_content_lst": [{"this_content": "4.2.1.2.1 阅读理解2016年真题 [已完成]      19", "sub_content_lst": [], "count": 1}, {"this_content": "4.2.1.2.2 阅读理解2017年真题 [已完成]      19", "sub_content_lst": [], "count": 1}, {"this_content": "4.2.1.2.3 阅读理解2018年真题 [已完成]      19", "sub_content_lst": [], "count": 1}], "count": 3}], "count": 6}], "count": 6}], "count": 14}
all done, weiran 2019年2月10日

Process finished with exit code 0

通过json解析工具我们也可以观察输出结果符合我们的需求：

{
	"this_content": "[content]",
	"sub_content_lst": [{
		"this_content": "4.1 数学习题",
		"sub_content_lst": [{
			"this_content": "4.1.1 选择题",
			"sub_content_lst": [{
				"this_content": "4.1.1.1 矩阵分析",
				"sub_content_lst": [{
					"this_content": "4.1.1.1.1 矩阵分析习题1",
					"sub_content_lst": [],
					"count": 1
				}, {
					"this_content": "4.1.1.1.2 矩阵分析习题2",
					"sub_content_lst": [],
					"count": 1
				}],
				"count": 2
			}, {
				"this_content": "4.1.1.2 微积分",
				"sub_content_lst": [{
					"this_content": "4.1.1.2.1 微积分习题1 [已完成]",
					"sub_content_lst": [],
					"count": 1
				}, {
					"this_content": "4.1.1.2.2 微积分习题2 [已完成]",
					"sub_content_lst": [],
					"count": 1
				}, {
					"this_content": "4.1.1.2.3 微积分习题3 [已完成]",
					"sub_content_lst": [],
					"count": 1
				}, {
					"this_content": "4.1.1.2.4 微积分习题4",
					"sub_content_lst": [],
					"count": 1
				}],
				"count": 4
			}, {
				"this_content": "4.1.1.3 概率论",
				"sub_content_lst": [{
					"this_content": "4.1.1.3.1 概率论习题1",
					"sub_content_lst": [],
					"count": 1
				}],
				"count": 1
			}],
			"count": 7
		}, {
			"this_content": "4.1.2 填空题",
			"sub_content_lst": [{
				"this_content": "4.1.2.1 矩阵分析",
				"sub_content_lst": [{
					"this_content": "4.1.2.1.1 矩阵分析练习题1",
					"sub_content_lst": [],
					"count": 1
				}],
				"count": 1
			}],
			"count": 1
		}],
		"count": 8
	}, {
		"this_content": "4.2 英语",
		"sub_content_lst": [{
			"this_content": "4.2.1 历年真题",
			"sub_content_lst": [{
				"this_content": "4.2.1.1 完形填空",
				"sub_content_lst": [{
					"this_content": "4.2.1.1.1 完形填空2016年真题",
					"sub_content_lst": [],
					"count": 1
				}, {
					"this_content": "4.2.1.1.2 完形填空2017年真题 [已完成]",
					"sub_content_lst": [],
					"count": 1
				}, {
					"this_content": "4.2.1.1.3 完形填空2018年真题 [已完成]",
					"sub_content_lst": [],
					"count": 1
				}],
				"count": 3
			}, {
				"this_content": "4.2.1.2 阅读理解",
				"sub_content_lst": [{
					"this_content": "4.2.1.2.1 阅读理解2016年真题 [已完成]",
					"sub_content_lst": [],
					"count": 1
				}, {
					"this_content": "4.2.1.2.2 阅读理解2017年真题 [已完成]",
					"sub_content_lst": [],
					"count": 1
				}, {
					"this_content": "4.2.1.2.3 阅读理解2018年真题 [已完成]",
					"sub_content_lst": [],
					"count": 1
				}],
				"count": 3
			}],
			"count": 6
		}],
		"count": 6
	}],
	"count": 14
}

好了，最后只剩下一个输出递归，实现相当简单：

4、使用递归完成目录数据结构输出；

def anaylse_doc_content_2():
    p_input_content = './content_file.txt'
    p_output_content = './output_content_file.txt'
    clac_max_deep = 5   # 指定统计目录的最大深度
    super_content_info_dct = {"this_content": '[content]', "sub_content_lst": [], "count": 0}
    with open(p_input_content, encoding='utf-8') as f:
        # cur_content_info_dct = {"this_content": this_line, "sub_content_lst": [], "count": 0}
        next_line = f.readline()
        if next_line:
            next_line = next_line.strip().split('      ')[0]
        else:
            raise Exception('空目录')
        while True:
            sub_content_info_dct, next_line = iter_calc_4(next_line, f, clac_max_deep)
            super_content_info_dct['sub_content_lst'].append(sub_content_info_dct)
            super_content_info_dct['count'] += sub_content_info_dct['count']
            if not next_line:
                break
    print(super_content_info_dct)
    super_content_info_json= json.dumps(super_content_info_dct, ensure_ascii=False)
    print(super_content_info_json)
    with open(p_output_content, 'w', encoding='utf-8') as f_output:
        formatting_output_content(super_content_info_dct, f_output)
    pass


# cur_content_info_dct = {'this_content': this_line, 'sub_content_lst': [], 'count': 0}
def formatting_output_content(dct_content, f_output):
    print('-' * 50)
    # print(dct_content)
    # print(dct_content['this_content'])
    # print(dct_content['count'])
    f_output.write(f"{dct_content['this_content']} :: {dct_content['count']}\n")
    for sub_dct in dct_content['sub_content_lst']:
        formatting_output_content(sub_dct, f_output)


if __name__ == '__main__':
    anaylse_doc_content_2()
    print('all done, weiran 2019年2月10日')

那么，输出结果为：

[content] :: 14
4.1 数学习题 :: 8
4.1.1 选择题 :: 7
4.1.1.1 矩阵分析 :: 2
4.1.1.1.1 矩阵分析习题1 :: 1
4.1.1.1.2 矩阵分析习题2 :: 1
4.1.1.2 微积分 :: 4
4.1.1.2.1 微积分习题1 [已完成] :: 1
4.1.1.2.2 微积分习题2 [已完成] :: 1
4.1.1.2.3 微积分习题3 [已完成] :: 1
4.1.1.2.4 微积分习题4 :: 1
4.1.1.3 概率论 :: 1
4.1.1.3.1 概率论习题1 :: 1
4.1.2 填空题 :: 1
4.1.2.1 矩阵分析 :: 1
4.1.2.1.1 矩阵分析练习题1 :: 1
4.2 英语 :: 6
4.2.1 历年真题 :: 6
4.2.1.1 完形填空 :: 3
4.2.1.1.1 完形填空2016年真题 :: 1
4.2.1.1.2 完形填空2017年真题 [已完成] :: 1
4.2.1.1.3 完形填空2018年真题 [已完成] :: 1
4.2.1.2 阅读理解 :: 3
4.2.1.2.1 阅读理解2016年真题 [已完成] :: 1
4.2.1.2.2 阅读理解2017年真题 [已完成] :: 1
4.2.1.2.3 阅读理解2018年真题 [已完成] :: 1

满足我们的要求哈！

`附加问题：如果我们需要更多的目录状态信息呢？`

如，在上面的目录中，我特意的增加  [已完成] ，即不仅要输出目录统计数，还要输出完成/未完成/总数的统计，如何处理呢？

其实这个问题也简单，稍微改变一下字典类型就行了（我当时花了10分钟就编写完成）：主要修改count字段，把这个属性也变成一个字典类型：代码片段为：

def iter_calc_4(this_line, f, clac_max_deep):
    cur_deep = find_content_deep(this_line)
    cur_content_info_dct = {"this_content": this_line, "sub_content_lst": [], "count_detail": {'finished_count': 0, 'unfinished_count': 0, 'all_count': 0}}
    next_line = f.readline()
    if next_line:
        # next_line = next_line.strip().split('\t')[0]
        next_line = next_line.strip().split('      ')[0]
        next_deep = find_content_deep(next_line)
    else:
        if cur_deep == clac_max_deep:
            cur_content_info_dct['count_detail']['all_count'] = 1
            if '[已完成]' in this_line:
                cur_content_info_dct['count_detail']['finished_count'] = 1
            else:
                cur_content_info_dct['count_detail']['unfinished_count'] = 1
            return cur_content_info_dct, next_line
        else:
            return cur_content_info_dct, next_line
    while True:
        if cur_deep == clac_max_deep:
            cur_content_info_dct['count_detail']['all_count'] = 1
            if '[已完成]' in this_line:
                cur_content_info_dct['count_detail']['finished_count'] = 1
            else:
                cur_content_info_dct['count_detail']['unfinished_count'] = 1
            return cur_content_info_dct, next_line
        elif cur_deep >= next_deep:
            return cur_content_info_dct, next_line
        elif cur_deep < next_deep:
            sub_content_info_dct, next_line = iter_calc_4(next_line, f, clac_max_deep)
            cur_content_info_dct['sub_content_lst'].append(sub_content_info_dct)
            cur_content_info_dct['count_detail']['all_count'] += sub_content_info_dct['count_detail']['all_count']
            cur_content_info_dct['count_detail']['finished_count'] += sub_content_info_dct['count_detail']['finished_count']
            cur_content_info_dct['count_detail']['unfinished_count'] += sub_content_info_dct['count_detail']['unfinished_count']
            if not next_line:
                return cur_content_info_dct, next_line
            next_deep = find_content_deep(next_line)
            # print(cur_content_info_dct)


# [task_study] 正则表达式处理多级目录
def anaylse_doc_content_2():
    p_input_content = './input_content.txt'    # D:\Program Files\weiran_tools\programming_python\content_anaylse\input_content.txt
    p_output_content = './output_content_file.txt'
    clac_max_deep = 5
    super_content_info_dct = {"this_content": '[content]', "sub_content_lst": [], "count_detail": {'finished_count': 0, 'unfinished_count': 0, 'all_count': 0}}
    with open(p_input_content, encoding='utf-8') as f:
        # cur_content_info_dct = {"this_content": this_line, "sub_content_lst": [], "count": 0}
        next_line = f.readline()
        if next_line:
            # next_line = next_line.strip().split('\t')[0]
            next_line = next_line.strip().split(r'      ')[0]
        else:
            raise Exception('空目录')
        while True:
            sub_content_info_dct, next_line = iter_calc_4(next_line, f, clac_max_deep)
            super_content_info_dct['sub_content_lst'].append(sub_content_info_dct)
            super_content_info_dct['count_detail']['all_count'] += sub_content_info_dct['count_detail']['all_count']
            super_content_info_dct['count_detail']['finished_count'] += sub_content_info_dct['count_detail']['finished_count']
            super_content_info_dct['count_detail']['unfinished_count'] += sub_content_info_dct['count_detail']['unfinished_count']
            if not next_line:
                break
    print(super_content_info_dct)
    super_content_info_json= json.dumps(super_content_info_dct, ensure_ascii=False)
    print(super_content_info_json)
    # f_output = open(p_output_content, 'w', encoding='UTF-8')
    f_output = open(p_output_content, 'w')
    # with open(p_output_content, 'w', encoding='utf-8') as f_output:
    formatting_output_content(super_content_info_dct, f_output)
    f_output.close()
    pass


# cur_content_info_dct = {'this_content': this_line, 'sub_content_lst': [], 'count': 0}
def formatting_output_content(dct_content, f_output):
    print('-' * 50)
    # print(dct_content)
    # print(dct_content['this_content'])
    # print(dct_content['count'])
    print(f"{dct_content['this_content']}     合计：{dct_content['count_detail']['all_count']}    已完成：{dct_content['count_detail']['finished_count']}    未完成：{dct_content['count_detail']['unfinished_count']}\n")
    f_output.write(f"{dct_content['this_content']}     合计：{dct_content['count_detail']['all_count']}    已完成：{dct_content['count_detail']['finished_count']}    未完成：{dct_content['count_detail']['unfinished_count']}\n")
    for sub_dct in dct_content['sub_content_lst']:
        formatting_output_content(sub_dct, f_output)


if __name__ == '__main__':
    anaylse_doc_content_2()
    print('all done. weiran 20190206')

最终输出结果：


[content]     合计：14    已完成：8    未完成：6
4.1 数学习题     合计：8    已完成：3    未完成：5
4.1.1 选择题     合计：7    已完成：3    未完成：4
4.1.1.1 矩阵分析     合计：2    已完成：0    未完成：2
4.1.1.1.1 矩阵分析习题1     合计：1    已完成：0    未完成：1
4.1.1.1.2 矩阵分析习题2     合计：1    已完成：0    未完成：1
4.1.1.2 微积分     合计：4    已完成：3    未完成：1
4.1.1.2.1 微积分习题1 [已完成]     合计：1    已完成：1    未完成：0
4.1.1.2.2 微积分习题2 [已完成]     合计：1    已完成：1    未完成：0
4.1.1.2.3 微积分习题3 [已完成]     合计：1    已完成：1    未完成：0
4.1.1.2.4 微积分习题4     合计：1    已完成：0    未完成：1
4.1.1.3 概率论     合计：1    已完成：0    未完成：1
4.1.1.3.1 概率论习题1     合计：1    已完成：0    未完成：1
4.1.2 填空题     合计：1    已完成：0    未完成：1
4.1.2.1 矩阵分析     合计：1    已完成：0    未完成：1
4.1.2.1.1 矩阵分析练习题1     合计：1    已完成：0    未完成：1
4.2 英语     合计：6    已完成：5    未完成：1
4.2.1 历年真题     合计：6    已完成：5    未完成：1
4.2.1.1 完形填空     合计：3    已完成：2    未完成：1
4.2.1.1.1 完形填空2016年真题     合计：1    已完成：0    未完成：1
4.2.1.1.2 完形填空2017年真题 [已完成]     合计：1    已完成：1    未完成：0
4.2.1.1.3 完形填空2018年真题 [已完成]     合计：1    已完成：1    未完成：0
4.2.1.2 阅读理解     合计：3    已完成：3    未完成：0
4.2.1.2.1 阅读理解2016年真题 [已完成]     合计：1    已完成：1    未完成：0
4.2.1.2.2 阅读理解2017年真题 [已完成]     合计：1    已完成：1    未完成：0
4.2.1.2.3 阅读理解2018年真题 [已完成]     合计：1    已完成：1    未完成：0

如果你需要使用，修改代码需要如下几点注意：

1、目录深度获取，不同的目录层级格式需要不同的正则表达式，修改
find_content_deep(line)
2、按需输出信息，需要修改目录属性字典

另外，本解答应该还有很多不完善的地方，还望大家多批评指正^^

完整代码及示例文件获取：

https://gitee.com/YingMingQiuYou/doc_content_analysis

嘤鸣求友

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录