代码库里找特定关键字的使用情况

御风之

已于 2024-07-09 13:52:38 修改

阅读量415

点赞数 5

分类专栏：数据分析文章标签： python3 数据分析

于 2024-07-08 20:34:04 首次发布

本文链接：https://blog.csdn.net/qq994327432/article/details/140276773

版权

数据分析专栏收录该内容

5 篇文章 0 订阅

订阅专栏

这个脚本的主要用途是在指定的根目录下搜索特定关键词的出现次数，并将结果分类统计。它特别适用于代码库分析，帮助开发者理解特定关键词（如命名空间、函数名等）在项目中的使用情况。使用时，用户需要指定搜索的根目录、关键词列表、特定的子目录路径列表，以及要包含在搜索中的文件扩展名。

使用方法

设置参数：

用户需要设置default_base_path（搜索的根目录），default_specific_paths（特定的子目录路径列表），default_keywords（关键词列表），以及可选的valid_extensions（文件扩展名，默认为.hpp）。

执行函数：

通过调用count_keyword_usage函数并传入上述参数，启动搜索过程。

脚本内部关键代码解释

初始化计数器：

counts = {keyword: {path: 0 for path in specific_paths + ['other']} for keyword in keywords}

为每个关键词和路径（包括一个名为’other’的特殊路径，用于统计不在specific_paths列表中的文件）初始化计数器。

收集目标文件：

all_files = [os.path.join(root, file) for root, dirs, files in os.walk(base_path) for file in files if file.endswith(valid_extensions)]

使用os.walk遍历base_path下的所有文件，并通过文件扩展名过滤，收集需要处理的文件列表。

读取文件并计数：

with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
    content = f.read()

以只读模式打开每个文件，忽略编码错误。然后对文件内容进行读取。

occurrences = content.count(keyword)

对于每个关键词，使用字符串的count方法统计其在文件内容中的出现次数。

更新计数器：

if path in file_path:
    counts[keyword][path] += occurrences

如果文件路径包含特定的子目录路径，则更新该路径下的关键词计数器。如果不包含，则更新’other’分类的计数器。

进度输出：

sys.stdout.write('\x1b[{}A'.format(len(keywords)))
sys.stdout.write('\x1b[0J')

使用ANSI转义序列来控制终端光标，实现进度更新的效果。这段代码首先将光标向上移动若干行（对应关键词的数量），然后清除这些行的内容，以便输出新的进度信息。

异常处理：

except Exception as e:
    continue

在读取文件或处理过程中遇到异常时，脚本会跳过当前文件并继续处理下一个文件，确保程序的健壮性。

通过上述步骤，脚本能够有效地在大型代码库中搜索关键词的使用情况，并提供详细的统计信息，帮助开发者进行代码分析和优化。

输出效果:

提供了进度条, 如果代码库比较大的时候, 不至于等不及可以看进度(实际上也没有快), 并且刷新时只会更新计数.

files:13807/20919 (66.00%) namespace A Total occurrences: 1087 | path1: 0 | path2: 1087 | path3: 0 | path4: 0 | other directories: 0
files:13807/20919 (66.00%) namespace B Total occurrences: 225 | path1: 179 | path2: 46  |  path3: 0 | path4: 0 | other directories: 0
files:13807/20919 (66.00%) namespace C Total occurrences: 356 | path1: 0   | path2: 356 | path3: 0 | path4: 0 | other directories: 0

代码块

import os
import sys

def count_keyword_usage(base_path, keywords, specific_paths, valid_extensions=('.hpp',)):
    """
    Count occurrences of keywords in files within specific paths and other directories.

    Parameters:
    - base_path: The root directory to search in.
    - keywords: A list of keywords to search for.
    - specific_paths: A list of directories to specifically track occurrences in.
    - valid_extensions: A tuple of file extensions to include in the search. Defaults to ('.hpp',).
    """
    counts = {keyword: {path: 0 for path in specific_paths + ['other']} for keyword in keywords}

    all_files = [os.path.join(root, file)
                 for root, dirs, files in os.walk(base_path)
                 for file in files if file.endswith(valid_extensions)]
    total_files = len(all_files)
    processed_files = 0

    for keyword in keywords:
        print(f"files:{0}/{total_files} ({0:.2f}%) {keyword} Total occurrences: {0} | " + " | ".join([f"{os.path.basename(path) if path != 'other' else 'other directories'}: {counts[keyword][path]}" for path in specific_paths + ['other']]))

    for file_path in all_files:
        try:
            with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
                content = f.read()
            for keyword in keywords:
                occurrences = content.count(keyword)
                found = False
                for path in specific_paths:
                    if path in file_path:
                        counts[keyword][path] += occurrences
                        found = True
                        break
                if not found:
                    counts[keyword]['other'] += occurrences

            processed_files += 1
            percentage_done = (processed_files / total_files) * 100

            sys.stdout.write('\x1b[{}A'.format(len(keywords)))
            sys.stdout.write('\x1b[0J')

            for keyword in keywords:
                total_occurrences = sum(counts[keyword].values())
                print(f"files:{processed_files}/{total_files} ({percentage_done:.2f}%) {keyword} Total occurrences: {total_occurrences} | " + " | ".join([f"{os.path.basename(path) if path != 'other' else 'other directories'}: {counts[keyword][path]}" for path in specific_paths + ['other']]))

            sys.stdout.flush()
        except Exception as e:
            # print(f"Error processing file {file_path}: {e}", file=sys.stderr)
            continue
 # 默认参数
default_base_path = r'C:\a'
default_specific_paths = [
    r'C:\a\1',
    r'C:\a\2',
    r'C:\a\3',
    r'C:\a\4',
]
default_keywords = ['key_word_1', 'key_word_2', 'key_word_3']

# 执行函数
count_keyword_usage(default_base_path, default_keywords, default_specific_paths)

欢迎关注我的微信公众号, 一起交流学习!!!
在这里插入图片描述

御风之

关注

5
点赞
踩
6

收藏

觉得还不错? 一键收藏
打赏
0
评论
代码库里找特定关键字的使用情况

这个脚本的主要用途是在指定的根目录下搜索特定关键词的出现次数，并将结果分类统计。它特别适用于代码库分析，帮助开发者理解特定关键词（如命名空间、函数名等）在项目中的使用情况。使用时，用户需要指定搜索的根目录、关键词列表、特定的子目录路径列表，以及要包含在搜索中的文件扩展名。
复制链接

扫一扫