首先去字幕库或字幕网站得到字幕文件,格式一般为srt、ass、Webvtt、STL等,将格式修改为后缀txt文件,可以在python中读取。
打开字幕观察它的格式,一般为序号,时间线,字幕文本,回车。使用正则表达式中的findall()函数,应用.*贪婪模式匹配所有的字幕文本,从而舍弃不需要的时间线和序号。
如纯英文字幕格式:
1
00:00:48,422 --> 00:00:53,177
Once upon a time there was a lovely princess.
2
00:00:53,427 --> 00:01:02,019
But she had an enchantment upon her of a fearful sort,
which could only be broken by Love's first kiss.
3
00:01:02,270 --> 00:01:08,609
She was locked away in a castle guarded by a terrible
fire breathing dragon.
4
00:01:08,860 --> 00:01:15,992
Many brave knights had attempted to free her from
this dreadful prison, but none prevailed.
5
00:01:16,242 --> 00:01:23,040
She waited in the dragon's keep in the highest
room of the tallest tower for her true love
6
00:01:23,291 --> 00:01:27,253
and true love's first kiss.
7
00:01:27,503 --> 00:01:29,714
Like that's ever going to happen.
8
00:01:29,964 --> 00:01:34,552
What a loony.
9
00:01:51,485 --> 00:01:57,658
Shrek
\d+\n是第一行的序号,.?–>.?,\d{3} 是第二行的时间线,最后是逗号加三个毫秒数字,
字幕文本有时是一行,有时是两行,所以使用 (.[\n]?.) , ?表示零次或一次,即零次是一行字幕,一次是两行文本。
第四行是空行 \s*\n
file = open('Shrek.txt', 'r')
try:
while True:
text_line = file.read()
if text_line:
pattern = re.compile(r'\d+\n.*?-->.*?,\d{3}\n(.*[\n]?.*)\s*\n')
content= re.findall(pattern,text_line)
subtitles="".join(content).strip()
# subtitles= re.sub(r"\n+", " ", subtitles, flags=re.MULTILINE) #可以将所有的回车符都去除
path = r'E:\work\pycharm\python projects\subtitles.txt'
file2 = open(path, 'w+')
file2.write(subtitles)
else:
break
finally:
file.close()
file2.close()
方法2,字幕中的格式一般都是标准统一的,第一行为序号,第二行为时间轴,第三行以上的都是字幕内容。
序号\n
时间 --> 时间\n
字幕内容行1\n
字幕内容行2\n
\s\s\s\s\s\s\s\s\s\s\s\s\s\s\n
序号\n+1
时间 --> 时间\n
字幕内容行1\n
字幕内容行2\n
\s\s\s\s\s\s\s\s\s\s\s\s\s\s\n
…
input_path = 'shrek2.txt'
output_path = r'E:\work\pycharm\python projects\subtitles.txt'
with open(input_path, 'r') as f_in:
# 按空行分割字幕块(兼容不同换行符)
blocks = re.split(r'\n\s*\n', f_in.read())
subtitles = []
for block in blocks:
lines = [line.strip() for line in block.split('\n') if line.strip()]
# 验证有效字幕块结构(序号 + 时间轴 + 内容)
if len(lines) >= 3 and re.match(r'\d+', lines[0]) and '-->' in lines[1]:
# 合并多行字幕内容为单行文本
text = ' '.join(lines[2:]) # 想要多行使用 text = '\n'.join(lines[2:])
subtitles.append(text)
# 将所有字幕内容合并为连续文本
final_text = ' '.join(subtitles)
# 写入处理结果(自动处理文件关闭)
with open(output_path, 'w', encoding='utf-8') as f_out:
f_out.write(final_text)
如果字幕中是中英双字幕,则将它们一起匹配出来,再拼凑在一起。因为pattern 中有两个括号,第一个括号是中文字幕,第二个括号索取的是英文字幕,得到的结果是一个列表中包含元组的格式如下:
[(‘中文字幕1’,‘英文字幕1’),(‘中文字幕2’,‘英文字幕2’),(‘中文字幕3’,‘英文字幕3’)…]
Dialogue: 0,0:01:09.76,0:01:12.03,Default,,0000,0000,0000,,有时候 我觉得自己受到了诅咒\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}Sometimes, I think I'm cursed.{\r}
Dialogue: 0,0:01:12.15,0:01:15.12,Default,,0000,0000,0000,,因为在我出生前发生了一些事\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}'Cause of something that happened before I was even born.{\r}
Dialogue: 0,0:01:15.69,0:01:19.38,Default,,0000,0000,0000,,是这样 很久以前 我太奶奶的家里\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}See, a long time ago, there was this family.{\r}
Dialogue: 0,0:01:20.08,0:01:22.46,Default,,0000,0000,0000,,他爸爸 是一位音乐家\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}The Papá, he was a musician.{\r}
Dialogue: 0,0:01:22.82,0:01:25.99,Default,,0000,0000,0000,,一家人都喜欢唱歌跳舞\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}He and his family would sing and dance{\r}
Dialogue: 0,0:01:26.01,0:01:27.40,Default,,0000,0000,0000,,过的很幸福\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}and count their blessings.{\r}
Dialogue: 0,0:01:28.07,0:01:29.73,Default,,0000,0000,0000,,可是爸爸有个梦想\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}But he also had a dream.{\r}
Dialogue: 0,0:01:30.02,0:01:32.02,Default,,0000,0000,0000,,让自己的歌声传遍整个世界\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}To play for the world.{\r}
Dialogue: 0,0:01:34.52,0:01:36.15,Default,,0000,0000,0000,,有一天\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}And one day...{\r}
Dialogue: 0,0:01:36.80,0:01:38.80,Default,,0000,0000,0000,,他背着吉他走了\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}he left with his guitar...{\r}
Dialogue: 0,0:01:39.62,0:01:41.65,Default,,0000,0000,0000,,再也没有回来\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}and never returned.{\r}
Dialogue: 0,0:01:48.99,0:01:50.58,Default,,0000,0000,0000,,我太奶奶\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}And my Mamá...{\r}
Dialogue: 0,0:01:50.83,0:01:53.87,Default,,0000,0000,0000,,可没心情为离家出走的爸爸难过\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}She didn't have time to cry over that walk-away musician.{\r}
Dialogue: 0,0:01:55.80,0:01:59.05,Default,,0000,0000,0000,,她把音乐彻底赶出了生活\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}After banishing all music from her life...{\r}
Dialogue: 0,0:02:00.05,0:02:02.36,Default,,0000,0000,0000,,她要想办法赚钱养活女儿\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}she found a way to provide for her daughter.{\r}
Dialogue: 0,0:02:06.52,0:02:08.18,Default,,0000,0000,0000,,她卷起袖子\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}She rolled up her sleeves...{\r}
Dialogue: 0,0:02:08.35,0:02:10.22,Default,,0000,0000,0000,,学会了做鞋\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}and she learned to make shoes.{\r}
Dialogue: 0,0:02:15.31,0:02:16.82,Default,,0000,0000,0000,,她明明可以做糖果\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}She could have made candy.{\r}
Dialogue: 0,0:02:16.84,0:02:18.67,Default,,0000,0000,0000,,或者做烟花\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}O-o-or fireworks.{\r}
Dialogue: 0,0:02:19.34,0:02:21.56,Default,,0000,0000,0000,,或者做亮闪闪的摔跤服\N{\fnCronos Pro Subhead\fs14\1c&H3CF1F3&}Or sparkly underwear for wrestlers.{\r}
把打印的结果通过ctrl+a全选复制出来,粘贴出来即可。
with open("Coco.txt", 'r', encoding='utf-8') as f:
text_line = f.read()
pattern = re.compile(r'Dialogue:.*?0000,,(.*)\\N\{\\fnCronos.*?H3CF1F3&\}(.*)\{\\r\}')
content = re.findall(pattern, text_line)
subtitles = ''
for i in content:
subtitles += i[0] + "\t" + i[1] + "\n"
print(subtitles)
结果如下:
有时候 我觉得自己受到了诅咒 Sometimes, I think I'm cursed.
因为在我出生前发生了一些事 'Cause of something that happened before I was even born.
是这样 很久以前 我太奶奶的家里 See, a long time ago, there was this family.
他爸爸 是一位音乐家 The Papá, he was a musician.
一家人都喜欢唱歌跳舞 He and his family would sing and dance
过的很幸福 and count their blessings.
可是爸爸有个梦想 But he also had a dream.
让自己的歌声传遍整个世界 To play for the world.
有一天 And one day...
他背着吉他走了 he left with his guitar...
再也没有回来 and never returned.
我太奶奶 And my Mamá...
参考网站:
https://blog.csdn.net/xbb224007/article/details/94590683
https://www.jb51.net/article/221385.htm