视频到音频的端到端智能聊天项目(Paraformer + Qwen2-VL + CosyVoice)
项目介绍
这是一个本地部署的从视频到音频的端到端智能聊天项目,可以选择回答的音色(比如丁老爷之类的)。为了简便,选取了本地视频作为输入,在实际项目中可以改为摄像头实时录制的视频,做到真正的实时AI聊天。
PS:本人是超级小白,如有错误请多多谅解😣😣
模型选择
功能 | 模型名称 |
---|---|
语音转文字 | Paraformer-large |
多模态推理 | Qwen2-VL-2B-Instruct-GPTQ-Int4 |
文字转音频(TTS) | CosyVoice-300M |
项目具体实现步骤
1. 语音转文字
- 加载并初始化模型Paraformer-large
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
def funasr_initiation():
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model='speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', model_revision="v2.0.4",
vad_model='speech_fsmn_vad_zh-cn-16k-common-pytorch', vad_model_revision="v2.0.4",
punc_model='punc_ct-transformer_zh-cn-common-vocab272727-pytorch', punc_model_revision="v2.0.4",
# spk_model="iic/speech_campplus_sv_zh-cn_16k-common",
# spk_model_revision="v2.0.2",
)
return inference_pipeline
- 使用moviepy库提取输入视频中的音频并保存到项目文件夹下,并将提取到的音频输入模型中获得输出
from moviepy.editor import VideoFileClip
video = VideoFileClip(video_path)
# 提取音频
audio = video.audio
# 保存为 wav 文件
audio.write_audiofile("extracted_audio.wav")
rec_result = funasr_model(input='extracted_audio.wav')
# 提取字符串输出
response = rec_result[0]['text']
2. 多模态推理
-
加载并初始化多模态模型Qwen2-VL-2B-Instruct-GPTQ-Int4
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor def qwenvl_initiation(): model_dir = 'model_path' # default: Load the model on the available device(s) model = Qwen2VLForConditionalGeneration.from_pretrained( model_dir, torch_dtype="auto", device_map="auto" ) # default processer processor = AutoPro