神经网络实用工具（整活）系列---使用silero-vad标注语音中的人物对话

_寒潭雁影

已于 2023-06-14 22:36:03 修改

阅读量1.7k

点赞数 2

分类专栏：神经网络实用工具（整活）系列文章标签：神经网络 python 人工智能

于 2023-06-02 00:35:27 首次发布

本文链接：https://blog.csdn.net/weixinhum/article/details/130998559

版权

神经网络实用工具（整活）系列专栏收录该内容

3 篇文章 1 订阅

订阅专栏

当我们使用神经网络来进行音频转文字的操作时，往往需要先把存在语音的音频片段筛选出来再送到音频转文字的神经网络中去筛选，否则总会出现奇奇怪怪的问题。

在本篇文章中，我们介绍一种比较常用的做法，也就是用pytorch提供的silero-vad语音活性检测网络来标记出语音中存在人物对话的部分。

龟速版本

废话不多说，直接上代码（代码的操作很简单，就是将一个名为1.mp3的音频文件进行语音活性检测，然后将检测到的语音片段存在一个叫做1的文件夹中，文件名包含了片段开始的时间戳。如果路径下没有1.mp3文件和1这个文件夹，则需要做出相应修改）：

# 使用silero-vad进行语音端点检测
import torch
import datetime
import utils_vad
from pprint import pprint
import time
import onnxruntime

stime = time.time()
SAMPLING_RATE = 16000
torch.set_num_threads(1)
USE_ONNX = False
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad', onnx=USE_ONNX)
(get_speech_timestamps,save_audio,read_audio,VADIterator,collect_chunks) = utils

wav = utils_vad.read_audio('1.mp3', sampling_rate=SAMPLING_RATE)
# get speech timestamps from full audio file
speech_timestamps = utils_vad.get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE,min_silence_duration_ms=1100,threshold=0.2) # min_silence_duration_ms表示静音的最小时间间隔
# 打印语音时间戳的长度
print(len(speech_timestamps))
pprint(speech_timestamps)
# merge all speech chunks to one audio
# save_audio('only_speech.wav', collect_chunks(speech_timestamps, wav), sampling_rate=SAMPLING_RATE)
# save all speech chunks to separate files
for i, chunk in enumerate(speech_timestamps):
    chunks = []
    chunks.append(chunk)
    startime=round(chunk['start'] * 1000 / SAMPLING_RATE,0)
    # 将startime转为int类型
    startime=int(startime)
    i_formatted = "{:0>4}".format(i)
    save_audio(f'1/{i_formatted}_{startime}.wav', collect_chunks(chunks, wav), sampling_rate=SAMPLING_RATE)

etime = time.time()
print("程序运行时间:%.2f秒"%(etime-stime))

运行上面的代码完成一集24分钟左右的《工作细胞》的语音关键点识别需要的时间是43.66秒（本人电脑CPU为5900X，GPU为4090），可以说非常的慢。

ONNX加速

为了摆脱关键点识别比直接语音识别还慢的困境，可以使用官方提供的onnx模型来加速。操作比较简单。先把onnx的运行时装上：

pip install onnxruntime

然后将上面代码的第12行改成True就OK了。这一改动可以将处理速度缩短为21.04秒，也就是提速1倍多。

C++版本

因为Python版本依赖的东西比较多，且速度一般都会比C++版本慢，因此我们通过改用C++版本来进一步提升速度，另外也增加程序的可移植性。

操作比较简单，先下载ONNX在Windows上的运行时，地址在这：https://github.com/microsoft/onnxruntime/releases/tag/v1.15.0

在这里插入图片描述

下载后用VS2022（2019或者更早的应该也行）创建一个VC++控制台工程，然后把库配置上去。配置方法和OpenCV差不多：

项目→属性→C/C++→附加包含目录→"解压下载的zip文件，然后把里面的include地址填在这"
项目→属性→链接器→附加库目录→"把压缩包中的lib地址填在这"
项目→属性→链接器→输入→附加依赖项→"onnxruntime.lib;onnxruntime_providers_shared.lib;%(AdditionalDependencies)"
项目→属性→C/C++→SDL检测→否(/sdl-)

随后把lib文件夹下的dll拷贝到工程目录即可。

配置好ONNX后，参考silero-vad项目的example/cpp文件夹下的两个文件，修修改改后就能实现和python版本一样的功能啦。代码如下：

wav.h的代码直接照抄example/cpp的同名文件，这里就不贴了。

sileroVad.cpp

#include <iostream>
#include <vector>
#include <sstream>
#include <cstring>
#include <chrono>
#include <windows.h>

#include "onnxruntime_cxx_api.h"
#include "wav.h"
#include < io.h >
class VadIterator
{
    // OnnxRuntime resources
    Ort::Env env;
    Ort::SessionOptions session_options;
    std::shared_ptr<Ort::Session> session = nullptr;
    Ort::AllocatorWithDefaultOptions allocator;
    Ort::MemoryInfo memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeCPU);

public:
    void init_engine_threads(int inter_threads, int intra_threads)
    {
        // The method should be called in each thread/proc in multi-thread/proc work
        session_options.SetIntraOpNumThreads(intra_threads);
        session_options.SetInterOpNumThreads(inter_threads);
        session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
    }

    void init_onnx_model(const std::string& model_path)
    {
        // Init threads = 1 for 
        init_engine_threads(1, 1);
        // Load model
        std::wstring w_model_path = std::wstring(model_path.begin(), model_path.end());
        session = std::make_shared<Ort::Session>(env, w_model_path.c_str(), session_options);
    }

    void reset_states()
    {
        // Call reset before each audio start
        std::memset(_h.data(), 0.0f, _h.size() * sizeof(float));
        std::memset(_c.data(), 0.0f, _c.size() * sizeof(float));
        triggerd = false;
        temp_end = 0;
        current_sample = 0;
    }
    // Call it in predict func. if you prefer raw bytes input.
    void bytes_to_float_tensor(const char* pcm_bytes)
    {
        std::memcpy(input.data(), pcm_bytes, window_size_samples * sizeof(int16_t));
        for (int i = 0; i < window_size_samples; i++)
        {
            input[i] = static_cast<float>(input[i]) / 32768; // int16_t normalized to float
        }
    }

    void predict(const std::vector<float>& data)
    {
        // bytes_to_float_tensor(data); 
        // Infer
        // Create ort tensors
        input.assign(data.begin(), data.end());
        Ort::Value input_ort = Ort::Value::CreateTensor<float>(
            memory_info, input.data(), input.size(), input_node_dims, 2);
        Ort::Value sr_ort = Ort::Value::CreateTensor<int64_t>(
            memory_info, sr.data(), sr.size(), sr_node_dims, 1);
        Ort::Value h_ort = Ort::Value::CreateTensor<float>(
            memory_info, _h.data(), _h.size(), hc_node_dims, 3);
        Ort::Value c_ort = Ort::Value::CreateTensor<float>(
            memory_info, _c.data(), _c.size(), hc_node_dims, 3);
        // Clear and add inputs
        ort_inputs.clear();
        ort_inputs.emplace_back(std::move(input_ort));
        ort_inputs.emplace_back(std::move(sr_ort));
        ort_inputs.emplace_back(std::move(h_ort));
        ort_inputs.emplace_back(std::move(c_ort));
        // Infer
        ort_outputs = session->Run(
            Ort::RunOptions{ nullptr },
            input_node_names.data(), ort_inputs.data(), ort_inputs.size(),
            output_node_names.data(), output_node_names.size());
        // Output probability & update h,c recursively
        float output = ort_outputs[0].GetTensorMutableData<float>()[0];
        float* hn = ort_outputs[1].GetTensorMutableData<float>();
        std::memcpy(_h.data(), hn, size_hc * sizeof(float));
        float* cn = ort_outputs[2].GetTensorMutableData<float>();
        std::memcpy(_c.data(), cn, size_hc * sizeof(float));
        // Push forward sample index
        current_sample += window_size_samples;
        // Reset temp_end when > threshold 
        if ((output >= threshold) && (temp_end != 0))
        {
            temp_end = 0;
        }
        // 1) Silence
        if ((output < threshold) && (triggerd == false))
        {
            //printf("{ silence: %.3f s }\n", 1.0 * current_sample / sample_rate);
        }
        // 2) Speaking 
        if ((output >= (threshold - 0.15)) && (triggerd == true))
        {
            //printf("{ speaking_2: %.3f s }\n", 1.0 * current_sample / sample_rate);
        }
        // 3) Start
        if ((output >= threshold) && (triggerd == false))
        {
            triggerd = true;
            speech_start = current_sample - window_size_samples - speech_pad_samples; // minus window_size_samples to get precise start time point.
            printf("{ start: %.3f s }\n", 1.0 * speech_start / sample_rate);

        }
        // 4) End 
        if ((output < (threshold - 0.15)) && (triggerd == true))
        {
            if (temp_end == 0)
            {
                temp_end = current_sample;
            }
            // a. silence < min_slience_samples, continue speaking 
            if ((current_sample - temp_end) < min_silence_samples)
            {
                // printf("{ speaking_4: %.3f s }\n", 1.0 * current_sample / sample_rate);
                // printf("");
            }
            // b. silence >= min_slience_samples, end speaking
            else
            {
                speech_end = temp_end ? temp_end + speech_pad_samples : current_sample + speech_pad_samples;
                temp_end = 0;
                triggerd = false;
                printf("{ end: %.3f s }\n", 1.0 * speech_end / sample_rate);

                if (speech_start < speech_end)
                {
                    speech_start_list.push_back(speech_start);
                    speech_end_list.push_back(speech_end);
                }
            }
        }
    }

private:
    // model config
    int64_t window_size_samples;  // Assign when init, support 256 512 768 for 8k; 512 1024 1536 for 16k.
    int sample_rate;
    int sr_per_ms;  // Assign when init, support 8 or 16
    float threshold;
    int min_silence_samples; // sr_per_ms * #ms
    int speech_pad_samples; // usually a 
    // model states
    bool triggerd = false;
    unsigned int speech_start = 0;
    unsigned int speech_end = 0;
    unsigned int temp_end = 0;
    unsigned int current_sample = 0;
    // MAX 4294967295 samples / 8sample per ms / 1000 / 60 = 8947 minutes  
    float output;
    // Onnx model
    // Inputs
    std::vector<Ort::Value> ort_inputs;
    std::vector<const char*> input_node_names = { "input", "sr", "h", "c" };
    std::vector<float> input;
    std::vector<int64_t> sr;

    unsigned int size_hc = 2 * 1 * 64; // It's FIXED.
    std::vector<float> _h;
    std::vector<float> _c;
    int64_t input_node_dims[2] = {};
    const int64_t sr_node_dims[1] = { 1 };
    const int64_t hc_node_dims[3] = { 2, 1, 64 };
    // Outputs
    std::vector<Ort::Value> ort_outputs;
    std::vector<const char*> output_node_names = { "output", "hn", "cn" };
public:
    std::vector<unsigned int> speech_start_list;//对话开始位置
    std::vector<unsigned int> speech_end_list;//对话结束位置
    // Construction
    VadIterator(const std::string ModelPath, int Sample_rate, int frame_size,
        float Threshold, int min_silence_duration_ms, int speech_pad_ms)
    {
        init_onnx_model(ModelPath);
        sample_rate = Sample_rate;
        sr_per_ms = sample_rate / 1000;
        threshold = Threshold;
        min_silence_samples = sr_per_ms * min_silence_duration_ms;
        speech_pad_samples = sr_per_ms * speech_pad_ms;
        window_size_samples = frame_size * sr_per_ms;


        input.resize(window_size_samples);
        input_node_dims[0] = 1;
        input_node_dims[1] = window_size_samples;
        // std::cout << "== Input size" << input.size() << std::endl;
        _h.resize(size_hc);
        _c.resize(size_hc);
        sr.resize(1);
        sr[0] = sample_rate;
    }
};

int main(int argc, char** argv)//输入目标检测音频的地址
{
    printf("程序地址：%s", argv[1]);
    // 获取程序的名称
    std::string wavfilesavepath = argv[1];
    wavfilesavepath = wavfilesavepath.substr(0, wavfilesavepath.length() - 4);
    wavfilesavepath = wavfilesavepath + "\\";
    // 创建文件夹
    std::wstring w_wavfilesavepath = std::wstring(wavfilesavepath.begin(), wavfilesavepath.end());
    CreateDirectory(w_wavfilesavepath.c_str(), NULL);

    std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
    // Read wav
    wav::WavReader wav_reader(argv[1]);
    std::vector<int16_t> data(wav_reader.num_samples());
    std::vector<float> input_wav(wav_reader.num_samples());

    for (int i = 0; i < wav_reader.num_samples(); i++)
    {
        data[i] = static_cast<int16_t>(*(wav_reader.data() + i));
    }

    for (int i = 0; i < wav_reader.num_samples(); i++)
    {
        input_wav[i] = static_cast<float>(data[i]) / 32768;
    }
    // ===== Test configs =====
    std::string path = "silero_vad.onnx";
    int test_sr = 16000;
    int test_frame_ms = 32;
    float test_threshold = 0.2f;
    int test_min_silence_duration_ms = 1100;
    int test_speech_pad_ms = 30;
    //int test_window_samples = test_frame_ms * (int(test_sr / 1000.0)-1);
    int test_window_samples = test_frame_ms * ((test_sr / 1000.0));

    VadIterator vad(path, test_sr, test_frame_ms, test_threshold, test_min_silence_duration_ms, test_speech_pad_ms);

    for (int j = 0; j < wav_reader.num_samples(); j += test_window_samples)
    {
        // std::cout << "== 4" << std::endl;
       
        if (j + test_window_samples < wav_reader.num_samples())
        {
            std::vector<float> r{ &input_wav[0] + j, &input_wav[0] + j + test_window_samples };
            // auto start = std::chrono::high_resolution_clock::now();
            // Predict and print throughout process time
            vad.predict(r);
        }
        // auto end = std::chrono::high_resolution_clock::now();
        // auto elapsed_time = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
        // std::cout << "== Elapsed time: " << 1.0*elapsed_time.count()/1000000 << "ms" << " ==" <<std::endl;
    }

    printf("语音片段个数：%d\n", vad.speech_end_list.size());

    FILE* F;
    fopen_s(&F, "run.bat", "w");
    for (int i = 0; i < vad.speech_end_list.size(); i++)
    {
        //调用FFMPEG输出截取视频
        char savewavpath[200];
        sprintf_s(savewavpath, 200, "ffmpeg.exe -i %s -ss %f -to %f -c:a copy %s%04d_%d.wav\n", argv[1], 1.0 * vad.speech_start_list[i] / test_sr, 1.0 * vad.speech_end_list[i] / test_sr, wavfilesavepath.c_str(), i, (int)(vad.speech_start_list[i] * 1000.0 / (float)test_sr));
        fprintf(F, "%s", savewavpath);

        printf("start:%f ms -> end:%f ms\n", 1.0 * vad.speech_start_list[i] / test_sr, 1.0 * vad.speech_end_list[i] / test_sr);
    }
    fclose(F);
    SHELLEXECUTEINFO commend;//命令对象
    memset(&commend, 0, sizeof(SHELLEXECUTEINFO));
    commend.cbSize = sizeof(SHELLEXECUTEINFO);
    commend.fMask = SEE_MASK_NOCLOSEPROCESS;
    commend.lpVerb = L"";
    commend.lpFile = L"run.bat";//执行命令内容
    commend.nShow = SW_SHOWDEFAULT;
    ShellExecuteEx(&commend);//执行命令
    WaitForSingleObject(commend.hProcess, INFINITE);//等待执行结束
    CloseHandle(commend.hProcess);//关闭控制台

    std::chrono::system_clock::time_point end = std::chrono::system_clock::now();
    std::cout << "花费了" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "毫秒" << std::endl;

}

这里有个特别需要注意的地方在于官方给的onnx模型对输入音频的要求是16000khz且位宽为16的音频。因此如果自己的音频不是该格式的话需要先用ffmpeg或者其他手段转换一下，否则得到的结果是错的。我一开始一直检测不正确，然后问了下GPT才知道的。。。

经过测试，用C++实现与上述python版本相同的语音活性检测，需要花费的时间仅仅8.6秒，速度是python的两倍多。

参考文章

https://github.com/snakers4/silero-vad

_寒潭雁影

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
神经网络实用工具（整活）系列---使用silero-vad标注语音中的人物对话

当我们使用神经网络来进行音频转文字的操作时，往往需要先把存在语音的音频片段筛选出来再送到音频转文字的神经网络中去筛选，否则总会出现奇奇怪怪的问题。在本篇文章中，我们介绍一种比较常用的做法，也就是用pytorch提供的silero-vad语音活性检测网络来标记出语音中存在人物对话的部分。废话不多说，直接上代码（代码的操作很简单，就是将一个名为1.mp3的音频文件进行语音活性检测，然后将检测到的语音片段存在一个叫做1的文件夹中，文件名包含了片段开始的时间戳。
复制链接

扫一扫