项目概述
由于近期免费的语音转文件/字幕的网站开始收费了,我作为个人用户,实在不想为计算付费,遍寻开源语音识别项目不到,后来看到了孙亖的csdn博客,就仿照他的思路,完成了本次项目。
大致思路就是把语音分段,每段短语音输入到讯飞输入法里,然后把文字拷贝,用来生成srt字幕文件
所用工具
2.讯飞输入法(安卓版本)
3.大象笔记本(从安卓模拟器里的应用宝下载的)
5.pydub
6.python3.6
成果展示
python 脚本根据指定的音频文件,生成srt字幕文件。
演示:使用自制脚本自动生成字幕
实现步骤
1.分割音频
使用pydub来分割音频,具体安卓过程要参考pydub的安装介绍部分,包括安装音频回放功能,设置ffmpeg,设置完环境变量还要重启电脑一次。
from pydub import AudioSegment
from pydub.silence import detect_nonsilent
from pydub.playback import play
import time, datetime
import os
# convert millisecond to '00:00:39,770'
def format_time(ms):
# ms = millisecond%1000
# second = millisecond/1000
# hour = millisecond/
td = datetime.timedelta(milliseconds= ms)
return str(td).replace('.',',')
# audio dir
file_path = "4. 雅思模考卷1S4.mp3"
file_suffix = os.path.splitext(file_path)[-1][1:]
print("file path:",file_path,"suffix",file_suffix)
sound = AudioSegment.from_file(file_path, file_suffix)
time.sleep(0.5)
print("start")
# adapt parameter
idx = 0
min_silence_len = 500
previous_end = 0
timestamp_list = detect_nonsilent(sound, min_silence_len, sound.dBFS * 1.3, 10)
for i in range(len(timestamp_list)):
d = timestamp_list[i][1] - timestamp_list[i][0]
a = timestamp_list[i][0]
b = timestamp_list[i][1]
# input index and timestamp
idx +=1
# input index and timestamp
index_time = '{1} --> {2}'.format(idx, format_time(a), format_time(b))
print(index_time, "duration is:", d,'ms')
# soft the voice, add the period which is around the threshold
start = max(0,a-min_silence_len/2,previous_end)
if i == len(timestamp_list)-1:
end = min(len(sound),b+min_silence_len)
else:
end = min(timestamp_list[i+1][0],b+min_silence_len)
play(sound[start: end])
time.sleep(2)
previous_end = b
print('dBFS: {0}, max_dBFS: {1}, duration: {2}, split: {3}'.format(round(sound.dBFS,2),round(sound.max_dBFS,2),
sound.duration_seconds,len(timestamp_list)))
print('audio time:',str(datetime.timedelta(milliseconds=len(sound)) ) )
关键函数为:
detect_nonsilent(audio_segment, min_silence_len=1000, silence_thresh=-16, seek_step=1):
音频分割函数
该函数返回分割好的声音片段,audio_segment表示要处理的声音,min_silence_len表示每次处理的声音段的长度,单位ms,silence_thresh表示小于该阈值的声音段会被认为静音,单位为dBFS,是个负数,seek_step表示两次处理的时间段的间隔。
该函数会把min_silence_len长度内的声音计算均方根,然后和silence_thresh比较,如果小于该阈值,则认为该段声音为静音声段,把声音段向后滑动seek_step,继续计算声音段,判断是否静音。把静音的声音段都找出来了,那么整段声音也就裁好了。
min_silence_len越小,声音段被裁分的越多,silence_thresh越大,被裁分的声音段数量越多。
2.操作键鼠自动化
使用pyautogui来实现鼠标的点击操作,用它来点击讯飞输入法的语音输入键。
import pyautogui
import time
import pyperclip
space_loc = (239, 927)
time.sleep(2)
print("start")
print("current location:",pyautogui.position())
pyautogui.moveTo(space_loc[0],space_loc[1])
time.sleep(1)
pyautogui.click()
pyautogui.press('a')
# pyautogui.click()
pyperclip.copy('要输入的汉字') # 先复制a要输入的汉字Hello world!a要输入的汉字Hello world!
pyperclip.paste()
pyautogui.mouseDown()
time.sleep(4)
pyautogui.mouseUp()
当然要使用其中的
print("current location:",pyautogui.position())
来打印出语音输入键的位置。
这里注意,要使用管理员权限启动pycharm才能够点击鼠标。最好不要启动类似lingos的取词软件,以防发生剪切板冲突
3.合成生成srt文件
'''
function:conver the audio to text
note:
1.start pycharm with acministrator role
2.put MUMU on the left top.
3.set input device to stereo in Windows
4.open ifly input method in MUMU in advance
requirement:
1.MUMU Android simulator
2.install yinxiang and ifly input method in MUMU
3.install pydub refer https://github.com/jiaaro/pydub
4.pip install pyautogui
limited:
1.audio length should less than 24 hours
2.lingos app and some other copy words app would cause typing error
'''
from pydub import AudioSegment
from pydub.silence import detect_nonsilent
from pydub.playback import play
import pyautogui
import pyperclip
import time, datetime
import os
# convert millisecond to '00:00:39,770'
def format_time(ms):
# ms = millisecond%1000
# second = millisecond/1000
# hour = millisecond/
td = datetime.timedelta(milliseconds= ms)
return str(td).replace('.',',')
# 模拟器位于左上时,键位的坐标
space_loc = (239, 927)
enter_loc = (484, 936)
edit_loc = (310, 607)
select_all_loc = (150, 918)
cut_loc = (374, 919)
back_loc = (479, 922)
# audio dir
file_path = "2. 雅思模考卷1S2.mp3"
file_suffix = os.path.splitext(file_path)[-1][1:]
print("file path:",file_path,"suffix",file_suffix)
# write a file
srt_file = os.path.splitext(file_path)[0]+'.srt'
f = open(file=srt_file, mode="w",encoding='utf8')
sound = AudioSegment.from_file(file_path, file_suffix)
start_time = time.localtime()
print("start",time.strftime('%H:%M:%S',start_time))
idx = 0
min_silence_len = 500
previous_end = 0
timestamp_list = detect_nonsilent(sound, 500, sound.dBFS * 1.3, 10)
for i in range(len(timestamp_list)):
d = timestamp_list[i][1] - timestamp_list[i][0]
a = timestamp_list[i][0]
b = timestamp_list[i][1]
# srt file's index
idx +=1
# soft the voice, add the period which is around the threshold
start = max(0, a - min_silence_len / 2, previous_end)
if i == len(timestamp_list) - 1:
end = min(len(sound), b + min_silence_len)
else:
end = min(timestamp_list[i + 1][0], b + min_silence_len)
previous_end = b
# input index and timestamp
index_time = '{0}\n{1} --> {2}\n'.format(idx, format_time(start), format_time(end))
# press space
pyautogui.moveTo(space_loc[0], space_loc[1])
pyautogui.mouseDown()
time.sleep(0.05)
play(sound[start: end])
time.sleep(0.05)
pyautogui.mouseUp()
time.sleep(0.5)
# cut
delay_time = 1 #second
pyautogui.click(edit_loc[0], edit_loc[1])
time.sleep(delay_time)
pyautogui.click(select_all_loc[0], select_all_loc[1])
time.sleep(delay_time)
pyautogui.click(cut_loc[0], cut_loc[1])
time.sleep(delay_time)
pyautogui.click(back_loc[0], back_loc[1])
time.sleep(delay_time)
text = pyperclip.paste()
f.write(index_time+text+'\n')
print("Section is :", timestamp_list[i], "duration is:", d,'text:',text)
f.close()
# end
end_time = time.localtime()
print('end',time.strftime('%H:%M:%S',end_time),'processing time:',
format_time(1000*(time.mktime(end_time)-time.mktime(start_time) ) ),
'audio time:',str(datetime.timedelta(milliseconds=len(sound)) ) )
结语
一顿操作猛如虎,转换时长150%,总共花费的时间比音频总时长还要长50%,这是第一个缺陷,音频分段的效果感觉还有点缺陷,字幕前后两句有时会重叠在一起,这个以后可稍微修改一下。
看上去是违规使用讯飞输入法,但其实效率很低,况且我只是个人用户,没有太多的音频要处理,不会对输入法造成太大的影响。