python用正则获取字幕中的文本

哈斯卡:no pain no gain

已于 2025-02-05 23:11:48 修改

阅读量272

点赞数

文章标签： python

于 2023-05-04 13:53:14 首次发布

本文链接：https://blog.csdn.net/sfwwdd/article/details/130484896

版权

本文介绍如何使用Python通过正则表达式从srt、ass等字幕文件中提取文本内容。通过分析字幕文件的格式，使用findall()匹配序号、时间线和字幕文本，并处理中英双字幕的情况，最终得到一个包含中英文字幕对应关系的列表。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

首先去字幕库或字幕网站得到字幕文件，格式一般为srt、ass、Webvtt、STL等，将格式修改为后缀txt文件，可以在python中读取。
打开字幕观察它的格式，一般为序号，时间线，字幕文本，回车。使用正则表达式中的findall()函数，应用.*贪婪模式匹配所有的字幕文本，从而舍弃不需要的时间线和序号。
如纯英文字幕格式：

1
00:00:48,422 --> 00:00:53,177
Once upon a time there was a lovely princess.

2
00:00:53,427 --> 00:01:02,019
But she had an enchantment upon her of a fearful sort,
which could only be broken by Love's first kiss.

3
00:01:02,270 --> 00:01:08,609
She was locked away in a castle guarded by a terrible
fire breathing dragon.

4
00:01:08,860 --> 00:01:15,992
Many brave knights had attempted to free her from
this dreadful prison, but none prevailed.

5
00:01:16,242 --> 00:01:23,040
She waited in the dragon's keep in the highest
room of the tallest tower for her true love

6
00:01:23,291 --> 00:01:27,253
and true love's first kiss.

7
00:01:27,503 --> 00:01:29,714
Like that's ever going to happen.

8
00:01:29,964 --> 00:01:34,552
What a loony.

9
00:01:51,485 --> 00:01:57,658
Shrek

\d+\n是第一行的序号，.?–>.?,\d{3} 是第二行的时间线，最后是逗号加三个毫秒数字，
字幕文本有时是一行，有时是两行，所以使用 (.[\n]?.) ，？表示零次或一次，即零次是一行字幕，一次是两行文本。
第四行是空行 \s*\n

file = open('Shrek.txt', 'r')
try:
    while True:
        text_line = file.read()
        if text_line:
            pattern = re.compile(r'\d+\n.*?-->.*?,\d{3}\n(.*[\n]?.*)\s*\n')
            content= re.findall(pattern,text_line)
            subtitles="".join(content).strip()
           # subtitles= re.sub(r"\n+", " ", subtitles, flags=re.MULTILINE)  #可以将所有的回车符都去除
            path = r'E:\work\pycharm\python projects\subtitles.txt'
            file2 = open(path, 'w+')
            file2.write(subtitles)
        else:
            break
finally:
    file.close()
    file2.close()

方法2，字幕中的格式一般都是标准统一的，第一行为序号，第二行为时间轴，第三行以上的都是字幕内容。
序号\n
时间 --> 时间\n
字幕内容行1\n
字幕内容行2\n
\s\s\s\s\s\s\s\s\s\s\s\s\s\s\n
序号\n+1
时间 --> 时间\n
字幕内容行1\n
字幕内容行2\n
\s\s\s\s\s\s\s\s\s\s\s\s\s\s\n
…

input_path = 'shrek2.txt'
output_path = r'E:\work\pycharm\python projects\subtitles.txt'

with open(input_path, 'r') as f_in:
    # 按空行分割字幕块（兼容不同换行符）
    blocks = re.split(r'\n\s*\n', f_in.read())

subtitles = []
for block in blocks:
    lines = [line.strip() for line in block.split('\n') if line.strip()]

    # 验证有效字幕块结构（序号 + 时间轴 + 内容）
    if len(lines) >= 3 and re.match(r'\d+', lines[0]) and '-->' in lines[1]:
        # 合并多行字幕内容为单行文本
        text = ' '.join(lines[2:]) # 想要多行使用 text = '\n'.join(lines[2:])
        subtitles.append(text)

# 将所有字幕内容合并为连续文本
final_text = ' '.join(subtitles)

# 写入处理结果（自动处理文件关闭）
with open(output_path, 'w', encoding='utf-8') as f_out:
    f_out.write(final_text)

如果字幕中是中英双字幕，则将它们一起匹配出来，再拼凑在一起。因为pattern 中有两个括号，第一个括号是中文字幕，第二个括号索取的是英文字幕，得到的结果是一个列表中包含元组的格式如下：
[（‘中文字幕1’，‘英文字幕1’），（‘中文字幕2’，‘英文字幕2’），（‘中文字幕3’，‘英文字幕3’）…]

Dialogue: 0,0:01:09.76,0:01:12.03,Default,,0000,0000,0000,,有时候 我觉得自己受到了诅咒\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}Sometimes, I think I'm cursed.{\r}
Dialogue: 0,0:01:12.15,0:01:15.12,Default,,0000,0000,0000,,因为在我出生前发生了一些事\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}'Cause of something that happened before I was even born.{\r}
Dialogue: 0,0:01:15.69,0:01:19.38,Default,,0000,0000,0000,,是这样 很久以前 我太奶奶的家里\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}See, a long time ago, there was this family.{\r}
Dialogue: 0,0:01:20.08,0:01:22.46,Default,,0000,0000,0000,,他爸爸 是一位音乐家\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}The Papá, he was a musician.{\r}
Dialogue: 0,0:01:22.82,0:01:25.99,Default,,0000,0000,0000,,一家人都喜欢唱歌跳舞\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}He and his family would sing and dance{\r}
Dialogue: 0,0:01:26.01,0:01:27.40,Default,,0000,0000,0000,,过的很幸福\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}and count their blessings.{\r}
Dialogue: 0,0:01:28.07,0:01:29.73,Default,,0000,0000,0000,,可是爸爸有个梦想\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}But he also had a dream.{\r}
Dialogue: 0,0:01:30.02,0:01:32.02,Default,,0000,0000,0000,,让自己的歌声传遍整个世界\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}To play for the world.{\r}
Dialogue: 0,0:01:34.52,0:01:36.15,Default,,0000,0000,0000,,有一天\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}And one day...{\r}
Dialogue: 0,0:01:36.80,0:01:38.80,Default,,0000,0000,0000,,他背着吉他走了\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}he left with his guitar...{\r}
Dialogue: 0,0:01:39.62,0:01:41.65,Default,,0000,0000,0000,,再也没有回来\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}and never returned.{\r}
Dialogue: 0,0:01:48.99,0:01:50.58,Default,,0000,0000,0000,,我太奶奶\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}And my Mamá...{\r}
Dialogue: 0,0:01:50.83,0:01:53.87,Default,,0000,0000,0000,,可没心情为离家出走的爸爸难过\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}She didn't have time to cry over that walk-away musician.{\r}
Dialogue: 0,0:01:55.80,0:01:59.05,Default,,0000,0000,0000,,她把音乐彻底赶出了生活\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}After banishing all music from her life...{\r}
Dialogue: 0,0:02:00.05,0:02:02.36,Default,,0000,0000,0000,,她要想办法赚钱养活女儿\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}she found a way to provide for her daughter.{\r}
Dialogue: 0,0:02:06.52,0:02:08.18,Default,,0000,0000,0000,,她卷起袖子\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}She rolled up her sleeves...{\r}
Dialogue: 0,0:02:08.35,0:02:10.22,Default,,0000,0000,0000,,学会了做鞋\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}and she learned to make shoes.{\r}
Dialogue: 0,0:02:15.31,0:02:16.82,Default,,0000,0000,0000,,她明明可以做糖果\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}She could have made candy.{\r}
Dialogue: 0,0:02:16.84,0:02:18.67,Default,,0000,0000,0000,,或者做烟花\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}O-o-or fireworks.{\r}
Dialogue: 0,0:02:19.34,0:02:21.56,Default,,0000,0000,0000,,或者做亮闪闪的摔跤服\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}Or sparkly underwear for wrestlers.{\r}

把打印的结果通过ctrl+a全选复制出来，粘贴出来即可。

with open("Coco.txt", 'r', encoding='utf-8') as f:
    text_line = f.read()
    pattern = re.compile(r'Dialogue:.*?0000,,(.*)\\N\{\\fnCronos.*?H3CF1F3&\}(.*)\{\\r\}')
    content = re.findall(pattern, text_line)
subtitles = ''
for i in content:
    subtitles += i[0] + "\t" + i[1] + "\n"
print(subtitles)

结果如下：

有时候 我觉得自己受到了诅咒	Sometimes, I think I'm cursed.
因为在我出生前发生了一些事	'Cause of something that happened before I was even born.
是这样 很久以前 我太奶奶的家里	See, a long time ago, there was this family.
他爸爸 是一位音乐家	The Papá, he was a musician.
一家人都喜欢唱歌跳舞	He and his family would sing and dance
过的很幸福	and count their blessings.
可是爸爸有个梦想	But he also had a dream.
让自己的歌声传遍整个世界	To play for the world.
有一天	And one day...
他背着吉他走了	he left with his guitar...
再也没有回来	and never returned.
我太奶奶	And my Mamá...

参考网站：
https://blog.csdn.net/xbb224007/article/details/94590683
https://www.jb51.net/article/221385.htm