最近遇到个棘手的问题,几千个标签文件中,可能某些标签名打错了,张冠李戴,明明是Tag1,打成了Tag2,于是总的Tag1数量少于预期,Tag2的数量多于预期。错误的标签影响到模型训练的性能。回到Labelimg/Labelme/roLabelimg里面去一个个核对,不现实,于是做了一个查找工具。
实现功能:在一个文件夹中遍历所有文档(包括.txt .xml等格式),查找字符串列表中任意元素出现次数为2的文件名,打印这些文件名。升级,自定义次数为x。
然后auto-py-to-exe打包
pip install auto-py-to-exe
将需求输入到ChatGPT,帮忙实现代码思路。创建以下代码code.py
import os
import tkinter as tk
from tkinter import filedialog
def find_files_with_duplicate_strings():
folder_path = filedialog.askdirectory()
string1 = entry1.get()
string2 = entry2.get()
string3 = entry3.get()
string_list = [string1, string2, string3]
x = int(entry4.get()) # 字符串转整数类型
files_with_duplicates = []
result_text.delete('1.0', tk.END) # 清空结果文本框
for root, dirs, files in os.walk(folder_path):
for file in files:
if file.endswith((".txt", ".xml", ".doc", ".docx")):
file_path = os.path.join(root, file)
with open(file_path, "r", encoding="utf-8") as f:
content = f.read()
for string in string_list:
count = content.count(string)
if count == x:
files_with_duplicates.append(file_path)
break
return files_with_duplicates,x
def show_result():
result, x = find_files_with_duplicate_strings()
if result:
result_text.insert("end", "以下文件包含重复" + entry4.get() +"次字符串:"+ "\n")
for file in result:
result_text.insert("end", file + "\n")
else:
result_text.insert("end", "未找到包含重复" + entry4.get() +"次字符串:"+ "的文件。")
app = tk.Tk()
app.title("查找包含重复字符串的文件")
app.geometry("300x600")
label1 = tk.Label(app, text="字符串1:")
label1.pack()
entry1 = tk.Entry(app)
entry1.pack()
label2 = tk.Label(app, text="字符串2:")
label2.pack()
entry2 = tk.Entry(app)
entry2.pack()
label3 = tk.Label(app, text="字符串3:")
label3.pack()
entry3 = tk.Entry(app)
entry3.pack()
label4 = tk.Label(app, text="重复次数x:")
label4.pack()
entry4 = tk.Entry(app)
entry4.pack()
button = tk.Button(app, text="开始查找", command=show_result) # command=show_result
button.pack(pady=20)
# 创建结果文本框
result_text = tk.Text(app, height=100, width=80)
result_text.pack()
app.mainloop()
CLI环境下输入
auto-py-to-exe
输入code.py的路径,选择One File,Window Based,然后执行就可以了。
最终打包文件约9MB
找到这些bug标签文件,之后再逐个核对就容易了。