python-批量提取srt文件中的纯文本

最新推荐文章于 2024-04-22 22:46:33 发布

冷漩

最新推荐文章于 2024-04-22 22:46:33 发布

阅读量833

点赞数 3

分类专栏： python小工具系列文章标签： python

本文链接：https://blog.csdn.net/lengxuan001/article/details/130951206

版权

python小工具系列专栏收录该内容

3 篇文章 0 订阅

订阅专栏

python-批量提取srt文件中的纯文本

1.功能介绍

为了方便日常的使用，我将批量提取 srt 文件中纯文本的程序打包成了 exe 文件，这样就不用安装 python 环境和相关的库了。
现在版本的程序可以选择指定路径下的多个 srt 文件，批量提取里面的文本内容。
输出的 txt 文件和原来的 srt 文件在同一目录下。

值得一提的是，由于 srt 文件存在不同的编码方式。现在版本的程序兼容了utf-8 、utf-16 和 gbk 三种编码，其他的编码格式未加入编码列表中，会提示编码错误，而不进行任何操作。不影响编码正确的文件的文字提取。

2.可执行程序

如果需要可执行程序，可以下载文件：批量提取srt文件中的纯文本

3.python源代码

下面是所以的源代码：

import tkinter as tk
from tkinter import filedialog
import os

root = tk.Tk()  # 创建程序主窗口
root.withdraw()  # 隐藏该窗口

file_types = [('Text Files', '*.srt')]  # 指定要筛选的文件格式
files = filedialog.askopenfilenames(filetypes=file_types)  # 弹出文件选择对话框

progress_window = tk.Toplevel(root)  # 创建新的顶层窗口对象，作为 root 窗口的子窗口
progress_window.title("文件提取进度")
progress_window.grab_set()  # 设置为模态窗口，阻止用户操作其他窗口


def center_window(window):
    window_width = 600  # 设置弹窗的宽度
    window_height = 400  # 设置弹窗的高度
    screen_width = window.winfo_screenwidth()  # 获取屏幕宽度
    screen_height = window.winfo_screenheight()  # 获取屏幕高度
    x = (screen_width - window_width) // 2  # 计算窗口的水平位置
    y = (screen_height - window_height) // 2  # 计算窗口的垂直位置
    window.geometry(f"{window_width}x{window_height}+{x}+{y}")  # 设置窗口位置和大小


center_window(progress_window)  # 将窗口置于屏幕中央

file_list_label = tk.Label(progress_window, text="已处理的文件：", font=("TkDefaultFont", 13, "bold"))
file_list_label.pack()

file_list_text = tk.Text(progress_window, height=20, width=60, font=("TkDefaultFont", 13))
file_list_text.tag_config("error", foreground="red")
file_list_text.pack()
file_error_num = 0  # 处理失败的文件数量
encodings = ["utf-8", "utf-16", "gbk"]

for file in files:
    coding_flag = False
    coding_error_num = 0
    texts = []

    for encoding in encodings:
        try:
            with open(file, encoding=encoding) as f:
                for line in f.readlines():
                    if line[0] not in {'\n', *'0123456789'} and line[-2] not in {'0123456789'}:
                        texts.append(line)
                coding_flag = True  # 编码正确，退出编码列表的循环
            if coding_flag:
                break
        except UnicodeError as e:
            coding_error_num += 1

    if coding_error_num == len(encodings):    # 不满足编码列表中的所有编码
        file_error_num += 1
        file_list_text.insert(tk.END, f"{os.path.basename(file)}, file encoding error!" + "\n", "error")
        # 弹窗列表中显示错误信息，添加 "error" 的标签，使其颜色为红色
        continue
    texts = [text.strip() for text in texts]
    with open(file.rsplit(".", 1)[0] + ".txt", 'w', encoding='utf-8') as f:
        for line in texts:
            f.write(line + '\n')
        print(f'文件：{f.name} 写入完成！')
    file_list_text.insert(tk.END, os.path.basename(file) + ",successful!\n")

file_list_text.insert(tk.END, "\n" + f"处理完成，成功 {len(files) - file_error_num} 个，失败 {file_error_num}个！" + "\n")

progress_window.wait_window()  # 等待用户关闭进度窗口

4.源代码解析

import tkinter as tk
from tkinter import filedialog
import os
root = tk.Tk()  # 创建程序主窗口
root.withdraw()  # 隐藏该窗口

这部分代码导入了 tkinter 模块，并创建了程序的主窗口 root ，然后使用 withdraw() 方法隐藏了主窗口。

file_types = [('Text Files', '*.srt')]  # 指定要筛选的文件格式
files = filedialog.askopenfilenames(filetypes=file_types)  # 弹出文件选择对话框

定义了文件类型筛选条件 file_types ，其中指定了以 .srt 为扩展名的文本文件。然后使用 askopenfilenames() 方法弹出文件选择对话框，让用户选择要处理的文件，选中的文件路径将保存在 files 变量中。

progress_window = tk.Toplevel(root)  # 创建新的顶层窗口对象，作为 root 窗口的子窗口
progress_window.title("文件提取进度")
progress_window.grab_set()  # 设置为模态窗口，阻止用户操作其他窗口

创建了一个新的顶层窗口 progress_window ，它作为 root 窗口的子窗口。设置了窗口标题为"文件提取进度"，并使用 grab_set() 方法将窗口设置为模态窗口，这样阻止用户操作其他窗口。

def center_window(window):
    window_width = 600  # 设置弹窗的宽度
    window_height = 400  # 设置弹窗的高度
    screen_width = window.winfo_screenwidth()  # 获取屏幕宽度
    screen_height = window.winfo_screenheight()  # 获取屏幕高度
    x = (screen_width - window_width) // 2  # 计算窗口的水平位置
    y = (screen_height - window_height) // 2  # 计算窗口的垂直位置
    window.geometry(f"{window_width}x{window_height}+{x}+{y}")  # 设置窗口位置和大小

定义了一个函数 center_window(window) ，用于将窗口居中显示。根据屏幕的宽度和高度计算出窗口的水平和垂直位置，然后使用 geometry() 方法设置窗口的位置和大小。

center_window(progress_window)  # 将窗口置于屏幕中央

调用 center_window() 函数，将 progress_window 窗口置于屏幕中央。

file_list_label = tk.Label(progress_window, text="已处理的文件：", font=("TkDefaultFont", 13, "bold"))
file_list_label.pack()

创建一个标签部件 file_list_label ，显示文本"已处理的文件："，并设置了字体为 TkDefaultFont ，大小为 13 ，加粗。然后使用 pack() 方法将标签部件添加到窗口中进行布局。

file_list_text = tk.Text(progress_window, height=20, width=60, font=("TkDefaultFont", 13))
file_list_text.tag_config("error", foreground="red")
file_list_text.pack()

创建一个文本框部件 file_list_text ，设置高度为 20 行，宽度为 60 个字符，字体为 TkDefaultFont ，大小为 13 。使用 tag_config() 方法配置一个名为 "error" 的标签，设置其前景色为红色。然后使用 pack() 方法将文本框部件添加到窗口中进行布局。

file_error_num = 0  # 处理失败的文件数量
encodings = ["utf-8", "utf-16", "gbk"]

for file in files:
    coding_flag = False
    coding_error_num = 0
    texts = []

    for encoding in encodings:
        try:
            with open(file, encoding=encoding) as f:
                for line in f.readlines():
                    if line[0] not in {'\n', *'0123456789'} and line[-2] not in {'0123456789'}:
                        texts.append(line)
                coding_flag = True  # 编码正确，退出编码列表的循环
            if coding_flag:
                break
        except UnicodeError as e:
            coding_error_num += 1

    if coding_error_num == len(encodings):    # 不满足编码列表中的所有编码
        file_error_num += 1
        file_list_text.insert(tk.END, f"{os.path.basename(file)}, file encoding error!" + "\n", "error")
        # 弹窗列表中显示错误信息，添加 "error" 的标签，使其颜色为红色
        continue
    texts = [text.strip() for text in texts]
    with open(file.rsplit(".", 1)[0] + ".txt", 'w', encoding='utf-8') as f:
        for line in texts:
            f.write(line + '\n')
        print(f'文件：{f.name} 写入完成！')
    file_list_text.insert(tk.END, os.path.basename(file) + ",successful!\n")

遍历 files 中的每个文件，依次处理。对于每个文件，首先定义一些变量，如 coding_flag 表示编码是否正确，coding_error_num 表示编码错误的数量，texts 保存提取的文本内容。

然后使用一个内嵌的循环，遍历编码列表 encodings中的每个编码。尝试使用当前编码打开文件，并逐行读取文件内容。如果满足提取条件，将文本内容添加到 texts 列表中，并将 coding_flag 标记为 True 表示编码正确，然后退出编码列表的循环。

关于文本提取的条件，可以参考我之前的博文：python提取字幕文件中的纯文字

如果在所有编码中都无法成功打开文件（捕获到 UnicodeError 异常），则将 coding_error_num 加 1 。

如果 coding_error_num 等于编码列表的长度，说明文件不满足编码列表中的任何一种编码，此时将 file_error_num 加 1 ，并将文件名和错误信息插入到文本框中，使用 "error" 标签使其颜色变为红色。然后使用 continue 跳过当前文件的处理。

如果文件的编码正确，将 texts 中的文本进行处理（去除首尾空白字符），然后将处理后的文本写入新的以 .txt 为扩展名的文件中，文件名与原文件相同（去除扩展名部分）。同时在控制台输出文件写入完成的信息。

最后，在文本框中插入当前文件名和 "successful!" 的信息。

file_list_text.insert(tk.END, "\n" + f"处理完成，成功 {len(files) - file_error_num} 个，失败 {file_error_num}个！" + "\n")
progress_window.wait_window()  # 等待用户关闭进度窗口

在文本框中插入处理完成的统计信息，显示成功处理的文件数量和失败的文件数量。

最后，使用 wait_window() 方法等待用户关闭进度窗口，程序进入等待状态，直到用户关闭窗口。

5.总结

本程序主要使用 tkinter 库创建了一个图形界面，允许用户选择要处理的多个 srt 文件，然后根据文件的编码，提取其中的纯文本内容。使用了 tk.Tk() 创建程序主窗口，用 withdraw() 方法来隐藏主窗口，用 Toplevel(root) 来创建新的顶窗口，用了 tk.Label() 、 tk.Text() 等标签，在窗口中显示已经处理的文件列表， tag_config() 属性可以设置文本的颜色， insert() 函数可以向文本标签中插入新的内容。定义了 center_window 函数，使窗口居于屏幕的中央。还使用了编码正误标签 coding_flag 来退出编码列表的循环，变量 coding_error_num 来统计处理失败的文件，使用了 file.rsplit(".", 1)[0] + ".txt" 来保证新生成的 .txt 文件和原来的 .srt 文件同名。总的来说，这也是一次不错的实践练习。

冷漩

关注

3
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python-批量提取srt文件中的纯文本

python提取字幕文件中的纯文字，选择指定路径下的多个srt文件，批量提取里面的文本内容。输出的txt文件和原来的srt文件在同一目录下。兼容了utf-8 、utf-16 和 gbk 三种编码。
复制链接

扫一扫