薅羊毛之讯飞输入法助我生成字幕文件_语音生成字薅羊毛-CSDN博客

本文链接：https://blog.csdn.net/u014804795/article/details/104948496

本项目通过Python脚本实现语音文件的自动分段、转换为文字并生成srt字幕文件。利用pydub进行音频处理，结合讯飞输入法及pyautogui实现自动化操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

项目概述

由于近期免费的语音转文件/字幕的网站开始收费了，我作为个人用户，实在不想为计算付费，遍寻开源语音识别项目不到，后来看到了孙亖的csdn博客，就仿照他的思路，完成了本次项目。

大致思路就是把语音分段，每段短语音输入到讯飞输入法里，然后把文字拷贝，用来生成srt字幕文件

所用工具

1.MUMU安卓模拟器

2.讯飞输入法（安卓版本）

3.大象笔记本（从安卓模拟器里的应用宝下载的）

4.pyautogui

5.pydub

6.python3.6

成果展示

python 脚本根据指定的音频文件，生成srt字幕文件。

演示：使用自制脚本自动生成字幕

实现步骤

1.分割音频

使用pydub来分割音频，具体安卓过程要参考pydub的安装介绍部分，包括安装音频回放功能，设置ffmpeg，设置完环境变量还要重启电脑一次。

from pydub import AudioSegment
from pydub.silence import detect_nonsilent
from pydub.playback import play
import time, datetime
import os


# convert millisecond to '00:00:39,770'
def format_time(ms):
    # ms = millisecond%1000
    # second = millisecond/1000
    # hour = millisecond/
    td = datetime.timedelta(milliseconds= ms)
    return str(td).replace('.',',')


# audio dir
file_path = "4. 雅思模考卷1S4.mp3"
file_suffix = os.path.splitext(file_path)[-1][1:]
print("file path:",file_path,"suffix",file_suffix)

sound = AudioSegment.from_file(file_path, file_suffix)

time.sleep(0.5)
print("start")
# adapt parameter
idx = 0
min_silence_len = 500
previous_end = 0
timestamp_list = detect_nonsilent(sound, min_silence_len, sound.dBFS * 1.3, 10)
for i in range(len(timestamp_list)):
    d = timestamp_list[i][1] - timestamp_list[i][0]
    a = timestamp_list[i][0]
    b = timestamp_list[i][1]
    # input index and timestamp
    idx +=1
    # input index and timestamp
    index_time = '{1} --> {2}'.format(idx, format_time(a), format_time(b))
    print(index_time, "duration is:", d,'ms')
    # soft the voice, add the period which is around the threshold
    start = max(0,a-min_silence_len/2,previous_end)
    if i == len(timestamp_list)-1:
        end = min(len(sound),b+min_silence_len)
    else:
        end = min(timestamp_list[i+1][0],b+min_silence_len)
    play(sound[start: end])
    time.sleep(2)
    previous_end = b
print('dBFS: {0}, max_dBFS: {1}, duration: {2}, split: {3}'.format(round(sound.dBFS,2),round(sound.max_dBFS,2),
                                                                   sound.duration_seconds,len(timestamp_list)))

print('audio time:',str(datetime.timedelta(milliseconds=len(sound)) ) )

关键函数为：

detect_nonsilent(audio_segment, min_silence_len=1000, silence_thresh=-16, seek_step=1):

音频分割函数

该函数返回分割好的声音片段，audio_segment表示要处理的声音，min_silence_len表示每次处理的声音段的长度，单位ms，silence_thresh表示小于该阈值的声音段会被认为静音，单位为dBFS，是个负数，seek_step表示两次处理的时间段的间隔。

该函数会把min_silence_len长度内的声音计算均方根，然后和silence_thresh比较，如果小于该阈值，则认为该段声音为静音声段，把声音段向后滑动seek_step，继续计算声音段，判断是否静音。把静音的声音段都找出来了，那么整段声音也就裁好了。

min_silence_len越小，声音段被裁分的越多，silence_thresh越大，被裁分的声音段数量越多。

2.操作键鼠自动化

使用pyautogui来实现鼠标的点击操作，用它来点击讯飞输入法的语音输入键。

import pyautogui
import time
import pyperclip

space_loc = (239, 927)
time.sleep(2)
print("start")
print("current location:",pyautogui.position())

pyautogui.moveTo(space_loc[0],space_loc[1])
time.sleep(1)
pyautogui.click()
pyautogui.press('a')
# pyautogui.click()
pyperclip.copy('要输入的汉字')  # 先复制a要输入的汉字Hello world!a要输入的汉字Hello world!
pyperclip.paste()
pyautogui.mouseDown()
time.sleep(4)
pyautogui.mouseUp()

当然要使用其中的

print("current location:",pyautogui.position())

来打印出语音输入键的位置。

这里注意，要使用管理员权限启动pycharm才能够点击鼠标。最好不要启动类似lingos的取词软件，以防发生剪切板冲突

3.合成生成srt文件

'''
function:conver the audio to text
note:
1.start pycharm with acministrator role
2.put MUMU on the left top.
3.set input device to stereo in Windows
4.open ifly input method in MUMU in advance
requirement:
1.MUMU Android simulator
2.install yinxiang and ifly input method in MUMU
3.install pydub refer https://github.com/jiaaro/pydub
4.pip install pyautogui
limited:
1.audio length should less than 24 hours
2.lingos app and some other copy words app would cause typing error
'''
from pydub import AudioSegment
from pydub.silence import detect_nonsilent
from pydub.playback import play
import pyautogui
import pyperclip
import time, datetime
import os


# convert millisecond to '00:00:39,770'
def format_time(ms):
    # ms = millisecond%1000
    # second = millisecond/1000
    # hour = millisecond/
    td = datetime.timedelta(milliseconds= ms)
    return str(td).replace('.',',')


# 模拟器位于左上时，键位的坐标
space_loc = (239, 927)
enter_loc = (484, 936)
edit_loc = (310, 607)
select_all_loc = (150, 918)
cut_loc = (374, 919)
back_loc = (479, 922)
# audio dir
file_path = "2. 雅思模考卷1S2.mp3"
file_suffix = os.path.splitext(file_path)[-1][1:]
print("file path:",file_path,"suffix",file_suffix)
# write a file
srt_file = os.path.splitext(file_path)[0]+'.srt'
f = open(file=srt_file, mode="w",encoding='utf8')
sound = AudioSegment.from_file(file_path, file_suffix)

start_time = time.localtime()
print("start",time.strftime('%H:%M:%S',start_time))

idx = 0
min_silence_len = 500
previous_end = 0
timestamp_list = detect_nonsilent(sound, 500, sound.dBFS * 1.3, 10)
for i in range(len(timestamp_list)):
    d = timestamp_list[i][1] - timestamp_list[i][0]
    a = timestamp_list[i][0]
    b = timestamp_list[i][1]
    # srt file's index
    idx +=1
    # soft the voice, add the period which is around the threshold
    start = max(0, a - min_silence_len / 2, previous_end)
    if i == len(timestamp_list) - 1:
        end = min(len(sound), b + min_silence_len)
    else:
        end = min(timestamp_list[i + 1][0], b + min_silence_len)
    previous_end = b
    # input index and timestamp
    index_time = '{0}\n{1} --> {2}\n'.format(idx, format_time(start), format_time(end))
    # press space
    pyautogui.moveTo(space_loc[0], space_loc[1])
    pyautogui.mouseDown()
    time.sleep(0.05)
    play(sound[start: end])
    time.sleep(0.05)
    pyautogui.mouseUp()
    time.sleep(0.5)
    # cut
    delay_time = 1 #second
    pyautogui.click(edit_loc[0], edit_loc[1])
    time.sleep(delay_time)
    pyautogui.click(select_all_loc[0], select_all_loc[1])
    time.sleep(delay_time)
    pyautogui.click(cut_loc[0], cut_loc[1])
    time.sleep(delay_time)
    pyautogui.click(back_loc[0], back_loc[1])
    time.sleep(delay_time)
    text = pyperclip.paste()
    f.write(index_time+text+'\n')
    print("Section is :", timestamp_list[i], "duration is:", d,'text:',text)
f.close()
# end
end_time = time.localtime()
print('end',time.strftime('%H:%M:%S',end_time),'processing time:',
      format_time(1000*(time.mktime(end_time)-time.mktime(start_time) ) ),
      'audio time:',str(datetime.timedelta(milliseconds=len(sound)) ) )