当我们使用神经网络来进行音频转文字的操作时,往往需要先把存在语音的音频片段筛选出来再送到音频转文字的神经网络中去筛选,否则总会出现奇奇怪怪的问题。
在本篇文章中,我们介绍一种比较常用的做法,也就是用pytorch提供的silero-vad语音活性检测网络来标记出语音中存在人物对话的部分。
龟速版本
废话不多说,直接上代码(代码的操作很简单,就是将一个名为1.mp3的音频文件进行语音活性检测,然后将检测到的语音片段存在一个叫做1的文件夹中,文件名包含了片段开始的时间戳。如果路径下没有1.mp3文件和1这个文件夹,则需要做出相应修改):
# 使用silero-vad进行语音端点检测
import torch
import datetime
import utils_vad
from pprint import pprint
import time
import onnxruntime
stime = time.time()
SAMPLING_RATE = 16000
torch.set_num_threads(1)
USE_ONNX = False
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad', onnx=USE_ONNX)
(get_speech_timestamps,save_audio,read_audio,VADIterator,collect_chunks) = utils
wav = utils_vad.read_audio('1.mp3', sampling_rate=SAMPLING_RATE)
# get speech timestamps from full audio file
speech_timestamps = utils_vad.get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE,min_silence_duration_ms=1100,threshold=0.2) # min_silence_duration_ms表示静音的最小时间间隔
# 打印语音时间戳的长度
print(len(speech_timestamps))
pprint(speech_timestamps)
# merge all speech chunks to one audio
# save_audio('only_speech.wav', collect_chunks(speech_timestamps, wav), sampling_rate=SAMPLING_RATE)
# save all speech chunks to separate files
for i, chunk in enumerate(speech_timestamps):
chunks = []
chunks.append(chunk)
startime=round(chunk['start'] * 1000 / SAMPLING_RATE,0)
# 将startime转为int类型
startime=int(startime)
i_formatted = "{:0>4}".format(i)
save_audio(f'1/{i_formatted}_{startime}.wav', collect_chunks(chunks, wav), sampling_rate=SAMPLING_RATE)
etime = time.time()
print("程序运行时间:%.2f秒"%(etime-stime))
运行上面的代码完成一集24分钟左右的《工作细胞》的语音关键点识别需要的时间是43.66秒(本人电脑CPU为5900X,GPU为4090),可以说非常的慢。
ONNX加速
为了摆脱关键点识别比直接语音识别还慢的困境,可以使用官方提供的onnx模型来加速。操作比较简单。先把onnx的运行时装上:
pip install onnxruntime
然后将上面代码的第12行改成True就OK了。这一改动可以将处理速度缩短为21.04秒,也就是提速1倍多。
C++版本
因为Python版本依赖的东西比较多,且速度一般都会比C++版本慢,因此我们通过改用C++版本来进一步提升速度,另外也增加程序的可移植性。
操作比较简单,先下载ONNX在Windows上的运行时,地址在这:https://github.com/microsoft/onnxruntime/releases/tag/v1.15.0
下载后用VS2022(2019或者更早的应该也行)创建一个VC++控制台工程,然后把库配置上去。配置方法和OpenCV差不多:
项目→属性→C/C++→附加包含目录→"解压下载的zip文件,然后把里面的include地址填在这"
项目→属性→链接器→附加库目录→"把压缩包中的lib地址填在这"
项目→属性→链接器→输入→附加依赖项→"onnxruntime.lib;onnxruntime_providers_shared.lib;%(AdditionalDependencies)"
项目→属性→C/C++→SDL检测→否(/sdl-)
随后把lib文件夹下的dll拷贝到工程目录即可。
配置好ONNX后,参考silero-vad项目的example/cpp文件夹下的两个文件,修修改改后就能实现和python版本一样的功能啦。代码如下:
wav.h的代码直接照抄example/cpp的同名文件,这里就不贴了。
sileroVad.cpp
#include <iostream>
#include <vector>
#include <sstream>
#include <cstring>
#include <chrono>
#include <windows.h>
#include "onnxruntime_cxx_api.h"
#include "wav.h"
#include < io.h >
class VadIterator
{
// OnnxRuntime resources
Ort::Env env;
Ort::SessionOptions session_options;
std::shared_ptr<Ort::Session> session = nullptr;
Ort::AllocatorWithDefaultOptions allocator;
Ort::MemoryInfo memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeCPU);
public:
void init_engine_threads(int inter_threads, int intra_threads)
{
// The method should be called in each thread/proc in multi-thread/proc work
session_options.SetIntraOpNumThreads(intra_threads);
session_options.SetInterOpNumThreads(inter_threads);
session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
}
void init_onnx_model(const std::string& model_path)
{
// Init threads = 1 for
init_engine_threads(1, 1);
// Load model
std::wstring w_model_path = std::wstring(model_path.begin(), model_path.end());
session = std::make_shared<Ort::Session>(env, w_model_path.c_str(), session_options);
}
void reset_states()
{
// Call reset before each audio start
std::memset(_h.data(), 0.0f, _h.size() * sizeof(float));
std::memset(_c.data(), 0.0f, _c.size() * sizeof(float));
triggerd = false;
temp_end = 0;
current_sample = 0;
}
// Call it in predict func. if you prefer raw bytes input.
void bytes_to_float_tensor(const char* pcm_bytes)
{
std::memcpy(input.data(), pcm_bytes, window_size_samples * sizeof(int16_t));
for (int i = 0; i < window_size_samples; i++)
{
input[i] = static_cast<float>(input[i]) / 32768; // int16_t normalized to float
}
}
void predict(const std::vector<float>& data)
{
// bytes_to_float_tensor(data);
// Infer
// Create ort tensors
input.assign(data.begin(), data.end());
Ort::Value input_ort = Ort::Value::CreateTensor<float>(
memory_info, input.data(), input.size(), input_node_dims, 2);
Ort::Value sr_ort = Ort::Value::CreateTensor<int64_t>(
memory_info, sr.data(), sr.size(), sr_node_dims, 1);
Ort::Value h_ort = Ort::Value::CreateTensor<float>(
memory_info, _h.data(), _h.size(), hc_node_dims, 3);
Ort::Value c_ort = Ort::Value::CreateTensor<float>(
memory_info, _c.data(), _c.size(), hc_node_dims, 3);
// Clear and add inputs
ort_inputs.clear();
ort_inputs.emplace_back(std::move(input_ort));
ort_inputs.emplace_back(std::move(sr_ort));
ort_inputs.emplace_back(std::move(h_ort));
ort_inputs.emplace_back(std::move(c_ort));
// Infer
ort_outputs = session->Run(
Ort::RunOptions{ nullptr },
input_node_names.data(), ort_inputs.data(), ort_inputs.size(),
output_node_names.data(), output_node_names.size());
// Output probability & update h,c recursively
float output = ort_outputs[0].GetTensorMutableData<float>()[0];
float* hn = ort_outputs[1].GetTensorMutableData<float>();
std::memcpy(_h.data(), hn, size_hc * sizeof(float));
float* cn = ort_outputs[2].GetTensorMutableData<float>();
std::memcpy(_c.data(), cn, size_hc * sizeof(float));
// Push forward sample index
current_sample += window_size_samples;
// Reset temp_end when > threshold
if ((output >= threshold) && (temp_end != 0))
{
temp_end = 0;
}
// 1) Silence
if ((output < threshold) && (triggerd == false))
{
//printf("{ silence: %.3f s }\n", 1.0 * current_sample / sample_rate);
}
// 2) Speaking
if ((output >= (threshold - 0.15)) && (triggerd == true))
{
//printf("{ speaking_2: %.3f s }\n", 1.0 * current_sample / sample_rate);
}
// 3) Start
if ((output >= threshold) && (triggerd == false))
{
triggerd = true;
speech_start = current_sample - window_size_samples - speech_pad_samples; // minus window_size_samples to get precise start time point.
printf("{ start: %.3f s }\n", 1.0 * speech_start / sample_rate);
}
// 4) End
if ((output < (threshold - 0.15)) && (triggerd == true))
{
if (temp_end == 0)
{
temp_end = current_sample;
}
// a. silence < min_slience_samples, continue speaking
if ((current_sample - temp_end) < min_silence_samples)
{
// printf("{ speaking_4: %.3f s }\n", 1.0 * current_sample / sample_rate);
// printf("");
}
// b. silence >= min_slience_samples, end speaking
else
{
speech_end = temp_end ? temp_end + speech_pad_samples : current_sample + speech_pad_samples;
temp_end = 0;
triggerd = false;
printf("{ end: %.3f s }\n", 1.0 * speech_end / sample_rate);
if (speech_start < speech_end)
{
speech_start_list.push_back(speech_start);
speech_end_list.push_back(speech_end);
}
}
}
}
private:
// model config
int64_t window_size_samples; // Assign when init, support 256 512 768 for 8k; 512 1024 1536 for 16k.
int sample_rate;
int sr_per_ms; // Assign when init, support 8 or 16
float threshold;
int min_silence_samples; // sr_per_ms * #ms
int speech_pad_samples; // usually a
// model states
bool triggerd = false;
unsigned int speech_start = 0;
unsigned int speech_end = 0;
unsigned int temp_end = 0;
unsigned int current_sample = 0;
// MAX 4294967295 samples / 8sample per ms / 1000 / 60 = 8947 minutes
float output;
// Onnx model
// Inputs
std::vector<Ort::Value> ort_inputs;
std::vector<const char*> input_node_names = { "input", "sr", "h", "c" };
std::vector<float> input;
std::vector<int64_t> sr;
unsigned int size_hc = 2 * 1 * 64; // It's FIXED.
std::vector<float> _h;
std::vector<float> _c;
int64_t input_node_dims[2] = {};
const int64_t sr_node_dims[1] = { 1 };
const int64_t hc_node_dims[3] = { 2, 1, 64 };
// Outputs
std::vector<Ort::Value> ort_outputs;
std::vector<const char*> output_node_names = { "output", "hn", "cn" };
public:
std::vector<unsigned int> speech_start_list;//对话开始位置
std::vector<unsigned int> speech_end_list;//对话结束位置
// Construction
VadIterator(const std::string ModelPath, int Sample_rate, int frame_size,
float Threshold, int min_silence_duration_ms, int speech_pad_ms)
{
init_onnx_model(ModelPath);
sample_rate = Sample_rate;
sr_per_ms = sample_rate / 1000;
threshold = Threshold;
min_silence_samples = sr_per_ms * min_silence_duration_ms;
speech_pad_samples = sr_per_ms * speech_pad_ms;
window_size_samples = frame_size * sr_per_ms;
input.resize(window_size_samples);
input_node_dims[0] = 1;
input_node_dims[1] = window_size_samples;
// std::cout << "== Input size" << input.size() << std::endl;
_h.resize(size_hc);
_c.resize(size_hc);
sr.resize(1);
sr[0] = sample_rate;
}
};
int main(int argc, char** argv)//输入目标检测音频的地址
{
printf("程序地址:%s", argv[1]);
// 获取程序的名称
std::string wavfilesavepath = argv[1];
wavfilesavepath = wavfilesavepath.substr(0, wavfilesavepath.length() - 4);
wavfilesavepath = wavfilesavepath + "\\";
// 创建文件夹
std::wstring w_wavfilesavepath = std::wstring(wavfilesavepath.begin(), wavfilesavepath.end());
CreateDirectory(w_wavfilesavepath.c_str(), NULL);
std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
// Read wav
wav::WavReader wav_reader(argv[1]);
std::vector<int16_t> data(wav_reader.num_samples());
std::vector<float> input_wav(wav_reader.num_samples());
for (int i = 0; i < wav_reader.num_samples(); i++)
{
data[i] = static_cast<int16_t>(*(wav_reader.data() + i));
}
for (int i = 0; i < wav_reader.num_samples(); i++)
{
input_wav[i] = static_cast<float>(data[i]) / 32768;
}
// ===== Test configs =====
std::string path = "silero_vad.onnx";
int test_sr = 16000;
int test_frame_ms = 32;
float test_threshold = 0.2f;
int test_min_silence_duration_ms = 1100;
int test_speech_pad_ms = 30;
//int test_window_samples = test_frame_ms * (int(test_sr / 1000.0)-1);
int test_window_samples = test_frame_ms * ((test_sr / 1000.0));
VadIterator vad(path, test_sr, test_frame_ms, test_threshold, test_min_silence_duration_ms, test_speech_pad_ms);
for (int j = 0; j < wav_reader.num_samples(); j += test_window_samples)
{
// std::cout << "== 4" << std::endl;
if (j + test_window_samples < wav_reader.num_samples())
{
std::vector<float> r{ &input_wav[0] + j, &input_wav[0] + j + test_window_samples };
// auto start = std::chrono::high_resolution_clock::now();
// Predict and print throughout process time
vad.predict(r);
}
// auto end = std::chrono::high_resolution_clock::now();
// auto elapsed_time = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
// std::cout << "== Elapsed time: " << 1.0*elapsed_time.count()/1000000 << "ms" << " ==" <<std::endl;
}
printf("语音片段个数:%d\n", vad.speech_end_list.size());
FILE* F;
fopen_s(&F, "run.bat", "w");
for (int i = 0; i < vad.speech_end_list.size(); i++)
{
//调用FFMPEG输出截取视频
char savewavpath[200];
sprintf_s(savewavpath, 200, "ffmpeg.exe -i %s -ss %f -to %f -c:a copy %s%04d_%d.wav\n", argv[1], 1.0 * vad.speech_start_list[i] / test_sr, 1.0 * vad.speech_end_list[i] / test_sr, wavfilesavepath.c_str(), i, (int)(vad.speech_start_list[i] * 1000.0 / (float)test_sr));
fprintf(F, "%s", savewavpath);
printf("start:%f ms -> end:%f ms\n", 1.0 * vad.speech_start_list[i] / test_sr, 1.0 * vad.speech_end_list[i] / test_sr);
}
fclose(F);
SHELLEXECUTEINFO commend;//命令对象
memset(&commend, 0, sizeof(SHELLEXECUTEINFO));
commend.cbSize = sizeof(SHELLEXECUTEINFO);
commend.fMask = SEE_MASK_NOCLOSEPROCESS;
commend.lpVerb = L"";
commend.lpFile = L"run.bat";//执行命令内容
commend.nShow = SW_SHOWDEFAULT;
ShellExecuteEx(&commend);//执行命令
WaitForSingleObject(commend.hProcess, INFINITE);//等待执行结束
CloseHandle(commend.hProcess);//关闭控制台
std::chrono::system_clock::time_point end = std::chrono::system_clock::now();
std::cout << "花费了" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "毫秒" << std::endl;
}
这里有个特别需要注意的地方在于官方给的onnx模型对输入音频的要求是16000khz且位宽为16的音频。因此如果自己的音频不是该格式的话需要先用ffmpeg或者其他手段转换一下,否则得到的结果是错的。我一开始一直检测不正确,然后问了下GPT才知道的。。。
经过测试,用C++实现与上述python版本相同的语音活性检测,需要花费的时间仅仅8.6秒,速度是python的两倍多。