VAD算法(Voice Activity Detection)

The VAD algorithm works as follows:

  1. Sample rate conversion is performed on input audio so that the processed audio has a sample rate of 16000.
  2. The converted samples are batched into "frames" of size frameSamples samples.
  3. The Silero vad model is run on each frame and produces a number between 0 and 1 indicating the probability that the sample contains speech.
  4. If the algorithm has not detected speech lately, then it is in a state of not speaking. Once it encounters a frame with speech probability greater than positiveSpeechThreshold, it is changed into a state of speaking. When it encounters redemptionFrames frames with speech probability less than negativeSpeechThreshold without having encountered a frame with speech probability greater than positiveSpeechThreshold, the speech audio segment is considered to have ended and the algorithm returns to a state of not speaking. Frames with speech probability in between negativeSpeechThreshold and positiveSpeechThreshold are effectively ignored.
  5. When the algorithm detects the end of a speech audio segment (i.e. goes from the state of speaking to not speaking), it counts the number of frames with speech probability greater than positiveSpeechThreshold in the audio segment. If the count is less than minSpeechFrames, then the audio segment is considered a false positive. Otherwise, preSpeechPadFrames frames are prepended to the audio segment and the segment is made accessible through the higher-level API.

Configuration

All of the main APIs accept certain common configuration parameters that modify the VAD algorithm.

  • positiveSpeechThreshold: number - determines the threshold over which a probability is considered to indicate the presence of speech.
  • negativeSpeechThreshold: number - determines the threshold under which a probability is considered to indicate the absence of speech.
  • redemptionFrames: number - number of speech-negative frames to wait before ending a speech segment.
  • frameSamples: number - the size of a frame in samples - 1536 by default and probably should not be changed.
  • preSpeechPadFrames: number - number of audio frames to prepend to a speech segment.
  • minSpeechFrames: number - minimum number of speech-positive frames for a speech segment.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值