在第二周个人工作中已经调研了微软的TTS模型,指定人物、语言和需要的文本,就可以生成对应真实自然人声。全球公认最佳的微软语音模型,该模型生成的语音非常流畅,发音标准,注重语气和连贯性,与口语训练助手的定位非常契合
在第二周个人工作的基础上,本次工作在于封装tts,将其与大模型,数字人模型一起,串联成一个完整的步骤,实现英语口语助手的可视化
import azure.cognitiveservices.speech as speechsdk
def tts(voicename="en-US-AvaMultilingualNeural", path_to_save='../files/audio.wav', text='Hello, I am SpeakSpark! Nice to meet you!'):
speech_key, service_region = "", ""
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
speech_config.speech_synthesis_voice_name = voicename
audio_config = speechsdk.audio.AudioOutputConfig(filename=path_to_save)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
speech_synthesizer.speak_text_async(text).get()
def test_tts():
speech_key, service_region = "", ""
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
speech_config.speech_synthesis_language = "en-US"
# Set the voice name, refer to https://aka.ms/speech/voices/neural for full list.
speech_config.speech_synthesis_voice_name = "en-US-AvaMultilingualNeural"
# 使用默认扬声器作为音频输出创建语音合成器
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
# 接收来自控制台输入的文本
print("Type some text that you want to speak...")
text = input()
# 将接收到的文本合成为语音
# 在执行该行的情况下,期望在扬声器上听到合成语音。
result = speech_synthesizer.speak_text_async(text).get()
print(result)
# Checks result.
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print("Speech synthesized to speaker for text [{}]".format(text))
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = result.cancellation_details
print("Speech synthesis canceled: {}".format(cancellation_details.reason))
if cancellation_details.reason == speechsdk.CancellationReason.Error:
if cancellation_details.error_details:
print("Error details: {}".format(cancellation_details.error_details))
print("Did you update the subscription info?")
if __name__ == '__main__':
tts()
由于各个模块都封装好了,串联起来封装成函数
具体来说:输入用户请求(文本和角色),送给大模型推理,得到结果送给TTS模型,生成语音后,送入SadTalker模型,得到数字人视频返回:
@app.route('/full')
def full():
data = request.get_json()
query = data.get('query', 'Hello, I am SpeakSpark! Nice to meet you!')
role = data.get('role', 'girl')
voicename = 'en-US-AvaMultilingualNeural' if role == 'girl' else 'en-US-BrianMultilingualNeural'
response = pipe(prompt.format(query=query))
tts(voicename=voicename, path_to_save='./files/audio.wav', text=response.text)
video = talker.test(
pic_path=pic_path.format(role=role),
crop_pic_path=crop_pic_path.format(role=role),
first_coeff_path=first_coeff_path.format(role=role),
crop_info=crop_info,
driven_audio=audio,
preprocess=preprocess_type,
still_mode=is_still_mode,
use_enhancer=enhancer,
batch_size=batch_size,
size=size_of_image,
pose_style = pose_style,
facerender=facerender,
exp_scale=exp_weight,
use_idle_mode = use_idle_mode,
length_of_audio = length_of_audio,
use_blink=blink_every,
fps=20
)
print(response.text)
return send_from_directory(directory='./files', path=f'{role}_audio.mp4')
其中,大模型和其他模型具体的配置和prompt如下,具体包括配置了 Turbomind 引擎,设计了一个详细的提示(prompt),用来引导模型生成自然的对话,以及配置生成视频所需的参数。这些参数包括图像大小、预处理类型、是否使用增强器、静态模式等:
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.5)
pipe = pipeline('/root/assis/quant-4bit', backend_config=backend_config, temperature=0.8, model_name='internlm2-chat-7b')
prompt = '''
You are an English speaking practice assistant named SpeakSpark. Your task is to engage in natural, conversational English and correct any non-native expressions. Here are your guidelines:
1. When a user asks a question or makes a statement, respond with a natural and conversational reply.
2. If the user's expression needs improvement, give a natural reply first. Then, provide three specific suggestions to enhance their expression, explaining why each change makes the expression more natural.
### Example1:
User: How can I improve my English speaking skills?
Assistant: One great way to improve your English speaking skills is to practice with native speakers as much as possible. You can also try watching English movies or TV shows and repeating the lines to practice pronunciation.
### Example2:
User: I got so black after my vacation.
Assistant: It sounds like you had a lot of sun on your vacation! You must have spent a lot of time outdoors.
Here are some suggestions to improve your expression:
1. Use "tanned" instead of "black" to describe skin darkening from sun exposure. "Tanned" is the appropriate term in English.
2. Replace "so" with "really" or "quite" to make the sentence sound more natural: "I got really tanned after my vacation."
3. Specify "vacation" to make the sentence clear. You could also say "holiday" if you prefer British English: "I got really tanned after my holiday."
### User Input:
{query}
### Assistant:
'''
# testQuery = 'She is good in playing piano.'
# response = pipe(prompt.format(query=testQuery))
# print(response)
blink_every = True
size_of_image = 256
preprocess_type = 'crop'
facerender = 'facevid2vid'
enhancer = False
is_still_mode = False
pic_path = './inputs/{role}.png'
crop_pic_path = './inputs/first_frame_dir_{role}/{role}.png'
first_coeff_path = './inputs/first_frame_dir_{role}/{role}.mat'
crop_info = ((403, 403), (19, 30, 502, 513), [40.05956541381802, 40.17324339233366, 443.7892505041507, 443.9029284826663])
exp_weight = 1
batch_size = 10
pose_style = random.randint(0, 45)
use_ref_video = False
ref_video = None
ref_info = 'pose'
use_idle_mode = False
length_of_audio = 5
audio = './files/audio.wav'
talker = SadTalker()