Faster-whisper+silero-vad 实时语音转录

tol692

已于 2024-05-13 15:36:37 修改

阅读量2.8k

点赞数 6

文章标签： python whisper 语音识别实时音视频

于 2024-05-13 15:10:49 首次发布

本文链接：https://blog.csdn.net/weixin_59401092/article/details/138801644

版权

环境搭建

需要使用cuda

在 cmd 控制台里输入 nvidia-smi.exe 以查看显卡驱动版本和对应的 cuda 版本

前往 NVIDIA-CUDA 官网下载与系统对应的 Cuda 版本
以 Cuda-11.7 版本为例，根据自己的系统和需求选择安装（一般本地 Windows 用户请依次选择Windows, x86_64, 系统版本, exe(local)）
安装成功之后在 cmd 控制台中输入nvcc -V, 出现类似以下内容则安装成功：
pytorch官网查看cuda对应版本，如下给出cuda11.7的

pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117

查看是否成功调用，输出True即可

python
# 回车运行
import torch
# 回车运行
print(torch.cuda.is_available())
# 回车运行

安装 Fastwhisper

pip install faster-whisper

下载模型

silero-vad

下载模型

具体实现

思路就是pyaudio循环录制，silero-vad检测是否有人说话，有人说话则将音频保存转录

import threading
import wave
import numpy as np
import pyaudio
from faster_whisper import WhisperModel
import torch


def int2float(sound):
    abs_max = np.abs(sound).max()
    sound = sound.astype('float32')
    if abs_max > 0:
        sound *= 1 / 32768
    sound = sound.squeeze()
    return sound


def save_audio(audio):
    with wave.open('output.wav', 'wb') as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(16000)
        wf.writeframes(audio)


def audio2Text(audio):
    result = None
    segments, info = whisperModel.transcribe(audio, beam_size=5, language="zh")
    for segment in segments:
        if result is None:
            result = segment.text
        else:
            result += ", " + segment.text
    print(result)


if __name__ == '__main__':
    model, utils = torch.hub.load(
        repo_or_dir='../../silero-vad',
        model='silero_vad',
        trust_repo=None,
        source='local',
    )
    whisperModel = WhisperModel("../../large-v2", device="cuda", compute_type="float16")
    (get_speech_timestamps,
     save_audio,
     read_audio,
     VADIterator,
     collect_chunks) = utils
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    SAMPLE_RATE = 16000
    num_samples = 8192
    audio = pyaudio.PyAudio()
    stream = audio.open(format=FORMAT,
                        channels=CHANNELS,
                        rate=SAMPLE_RATE,
                        input=True,
                        frames_per_buffer=8192)
    data = []
    print("Started Recording")
    audio = None
    countSize = 0
    while True:
        audio_chunk = stream.read(num_samples)
        audio_int16 = np.frombuffer(audio_chunk, np.int16)
        audio_float32 = int2float(audio_int16)
        new_confidence = model(torch.from_numpy(audio_float32), 16000).item()
        if new_confidence > 0.5:
            if audio is None:
                audio = audio_chunk
                countSize = 0
            else:
                audio = audio + audio_chunk
                countSize = 0
        else:
            countSize = countSize + 1
            if audio is not None and countSize < 3:
                audio = audio + audio_chunk
            elif audio is not None and countSize > 3:
                save_audio(audio)
                t = threading.Thread(target=audio2Text(int2float(np.frombuffer(audio, np.int16))), name='LoopThread')
                t.start()
                audio = None
                countSize = 0