文件内容里的字符串查重，打包exe_exe地址反馈遍文本字符串-CSDN博客

本文链接：https://blog.csdn.net/zsprb1/article/details/133141145

最近遇到个棘手的问题，几千个标签文件中，可能某些标签名打错了，张冠李戴，明明是Tag1,打成了Tag2，于是总的Tag1数量少于预期，Tag2的数量多于预期。错误的标签影响到模型训练的性能。回到Labelimg/Labelme/roLabelimg里面去一个个核对，不现实，于是做了一个查找工具。

实现功能：在一个文件夹中遍历所有文档(包括.txt .xml等格式)，查找字符串列表中任意元素出现次数为2的文件名，打印这些文件名。升级，自定义次数为x。

然后auto-py-to-exe打包

pip install auto-py-to-exe

将需求输入到ChatGPT，帮忙实现代码思路。创建以下代码code.py

import os
import tkinter as tk
from tkinter import filedialog

def find_files_with_duplicate_strings():
    folder_path = filedialog.askdirectory()
    string1 = entry1.get()
    string2 = entry2.get()
    string3 = entry3.get()
    string_list = [string1, string2, string3]
    x = int(entry4.get())  # 字符串转整数类型
    files_with_duplicates = []
    result_text.delete('1.0', tk.END)  # 清空结果文本框
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.endswith((".txt", ".xml", ".doc", ".docx")):
                file_path = os.path.join(root, file)
                with open(file_path, "r", encoding="utf-8") as f:
                    content = f.read()
                    for string in string_list:
                        count = content.count(string)
                        if count == x:
                            files_with_duplicates.append(file_path)
                            break

    return files_with_duplicates,x

def show_result():

    result, x = find_files_with_duplicate_strings()

    if result:
        result_text.insert("end", "以下文件包含重复" + entry4.get() +"次字符串："+ "\n")
        for file in result:
            result_text.insert("end", file + "\n")
    else:
        result_text.insert("end", "未找到包含重复" + entry4.get() +"次字符串:"+ "的文件。")


app = tk.Tk()
app.title("查找包含重复字符串的文件")
app.geometry("300x600")

label1 = tk.Label(app, text="字符串1：")
label1.pack()
entry1 = tk.Entry(app)
entry1.pack()

label2 = tk.Label(app, text="字符串2：")
label2.pack()
entry2 = tk.Entry(app)
entry2.pack()

label3 = tk.Label(app, text="字符串3：")
label3.pack()
entry3 = tk.Entry(app)
entry3.pack()

label4 = tk.Label(app, text="重复次数x：")
label4.pack()
entry4 = tk.Entry(app)
entry4.pack()

button = tk.Button(app, text="开始查找", command=show_result)  # command=show_result
button.pack(pady=20)
# 创建结果文本框
result_text = tk.Text(app, height=100, width=80)
result_text.pack()

app.mainloop()

CLI环境下输入