The SoX of Silence

本文详细介绍了SoX静音过滤器的工作原理及使用方法,包括如何通过参数设置去除或保留不同长度的静音段,适用于音频后期制作及现场录音。

SoX is, by their own definition, the Swiss Army knife of audio manipulation.

And no doubt it’s full of fun with slicing and dicing and playback and recording and filtering and effects capabilities.

But SoX is a command line tool, which means obscure syntax and parameters in order to get things done.

I’ve been trying off and on for months to try to understand the silence filter from within SoX, which allows one to remove silence from the beginning, middle, or end of the audio. Sounds, simple, doesn’t it?  Well, it should be.

Below is the man page for the silence filter:

silence [-l] above-periods [duration threshold[d|%] [below-periods duration threshold[d|%]]

Removes silence from the beginning, middle, or end of the audio. Silence is anything below a specified threshold.

The above-periods value is used to indicate if audio should be trimmed at the beginning of the audio. A value of zero indicates no silence should be trimmed from the beginning. When specifying an non-zero above-periods, it trims audio up until it finds non-silence. Normally, when trimming silence from beginning of audio the above-periods will be 1 but it can be increased to higher values to trim all audio up to a specific count of non-silence periods. For example, if you had an audio file with two songs that each contained 2 seconds of silence before the song, you could specify an above-period of 2 to strip out both silence periods and the first song.

When above-periods is non-zero, you must also specify a duration and threshold. Duration indications the amount of time that non-silence must be detected before it stops trimming audio. By increasing the duration, burst of noise can be treated as silence and trimmed off.

Threshold is used to indicate what sample value you should treat as silence. For digital audio, a value of 0 may be fine but for audio recorded from analog, you may wish to increase the value to account for background noise.

When optionally trimming silence from the end of the audio, you specify a below-periods count. In this case, below-period means to remove all audio after silence is detected. Normally, this will be a value 1 of but it can be increased to skip over periods of silence that are wanted. For example, if you have a song with 2 seconds of silence in the middle and 2 second at the end, you could set below-period to a value of 2 to skip over the silence in the middle of the audio.

For below-periods, duration specifies a period of silence that must exist before audio is not copied any more. By specifying a higher duration, silence that is wanted can be left in the audio. For example, if you have a song with an expected 1 second of silence in the middle and 2 seconds of silence at the end, a duration of 2 seconds could be used to skip over the middle silence.

Unfortunately, you must know the length of the silence at the end of your audio file to trim off silence reliably. A work around is to use the silence effect in combination with the reverse effect. By first reversing the audio, you can use the above-periods to reliably trim all audio from what looks like the front of the file. Then reverse the file again to get back to normal.

To remove silence from the middle of a file, specify a below-periods that is negative. This value is then treated as a positive value and is also used to indicate the effect should restart processing as specified by the above-periods, making it suitable for removing periods of silence in the middle of the audio.

The option -l indicates that below-periods duration length of audio should be left intact at the beginning of each period of silence. For example, if you want to remove long pauses between words but do not want to remove the pauses completely.

The period counts are in units of samples. Duration counts may be in the format of hh:mm:ss.frac, or the exact count of samples. Threshold numbers may be suffixed with d to indicate the value is in decibels, or % to indicate a percentage of maximum value of the sample value (0% specifies pure digital silence).

The following example shows how this effect can be used to start a recording that does not contain the delay at the start which usually occurs between `pressing the record button’ and the start of the performance:

rec parameters filename other-effects silence 1 5 2%

Huh?

So lets try to clarify some of the mess from the man page.  First a couple of important notes:

  • When specifying duration, use a trailing zero for whole numbers of seconds (ie, 1.0 instead of 1 to specify 1 second). If you don’t, SoX assumes you’re specifying a number of samples.  Who on earth would want to specify samples instead seconds? You got me. Alternatively, you can specify durations of time in the formathh:mm:ss.frac.
  • Use at 0.1% at a minimum for an audio threshold. Even though 0% is supposed to be pure digital silence, with my test file I couldn’t get silence to trim unless I used a threshold larger than 0%. If you’d like, you can specify the threshold in decibels using d (such as -96d or -55d).
  • The realistic values for the above-period parameter are 0 and 1 and values for thebelow-period parameter are pretty much just -1 and 1. The documentation states that values larger than 1 can be used, but it only really makes sense for files with consistent audio breaks. Just trust me, it’s weird. I’ll get into what those values actually mean in the examples.

Now onto some examples! I’ll be showing you visually what happens to a sound file when we apply the various parameters to thesilence filter.

I generated a test sound file with 60 seconds of white noise and then silenced various parts of the clip, leaving me with an audio file that looks like this:

SoX Silence Example (Original File)


Example 1: Trimming silence at the beginning

sox in.wav out1.wav silence 1 0.1 1%

The above-period parameter is first after the silence parameter, and for the sake of this article, it should be set to1 if you want to use the filter. This example roughly translates to: trim silence (anything less than 1% volume) until we encounter sound lasting more than 0.1 seconds in duration. The output of this command produces the following:

sox in.wav out1.wav silence 1 0.1 1%

We’ve lopped off the silence at the beginning of the clip. For simplicity’s sake, we’ll refer to the 1% threshold as silence from now on.


Example 2: Ignoring noise bursts

sox in.wav out2.wav silence 1 0.3 1%

By changing the duration parameter to 0.3, we tell SoX to ignore the burst of noise at the beginning of the example clip. This produces the following:

sox in.wav out2.wav silence 1 0.3 1%

We can ignore short pops and clicks in audio by adjusting this duration parameter.


Example 3: Stopping recording when no sound detected

sox in.wav out3.wav silence 1 0.3 1% 1 0.3 1%

Now we introduce the below-period parameter it’s respective sub-parameters.  Just like the above-period parameter, just set it to1 and call it good.  The command above translates to: trim silence until we detect at least 0.3 seconds of noise, and then trim everything after we detect at least 0.3 seconds of silence.

sox in.wav out3.wav silence 1 0.3 1% 1 0.3 1%

This returns a file with just the first 4 seconds of noise (note that we ignore that 0.25 sec burst of noise at the beginning). Where’s the rest of the clip?  Well, it’s gone. Not super practical for post-production of audio, but can be useful when recording live audio, so that SoX stops when it doesn’t encounter sound for a certain number of seconds.

So an aside: if you’re looking to trim silence from the beginning and the end of a audio file, you’ll need to utilize thereverse filter and a temp file like so:

sox in.wav temp.wav silence 1 0.1 1% reverse
sox temp.wav out.wav silence 1 0.1 1% reverse

Don’t forget to delete that temp.wav file when you’re done.

Jakob points out in the comments that you can trim silence from both ends in one fell swoop by chaining the effects like so:

sox in.wav out.wav silence 1 0.1 1% reverse silence 1 0.1 1% reverse

Example 4: Trimming all silence

sox in.wav out4.wav silence 1 0.1 1% -1 0.1 1%

By changing the below-period parameter to -1, we can trim instances of silence in themiddle of the clip, by allowing the filter to restart after it detects noise of the specified duration.

sox in.wav out4.wav silence 1 0.1 1% -1 0.1 1%

In my example clip, it’s impossible to detect where the silence used to be, but with an actual podcast or other audio, it should be easier to tell.


Example 5: Ignoring short periods of silence

sox in.wav out5.wav silence 1 0.1 1% -1 0.5 1%

In similar fashion as Example 2, we can instruct SoX to ignore small moments of silence (1/2 second in this example).

sox in.wav out5.wav silence 1 0.1 1% -1 0.5 1%

When trimming silence from podcasts and the like, this prevents you from removing moments when someone stops to take a breath and making the conversation sound too rushed.


Example 6: Shortening long periods of silence

sox in.wav out6.wav silence -l 1 0.1 1% -1 2.0 1%

So what if you wanted to just shorten long moments of silence rather than remove them entirely?  Well, you need to add the-l parameter, but it needs to be placed first, before the other parameters for the filter effect. The example above results in trimming all silence longer than 2 seconds down to only 2 seconds long.

sox in.wav out6.wav silence -l 1 0.1 1% -1 2.0 1%

Note that SoX does nothing to bits of silence shorter than 2 seconds.


Example 7: Shortening long periods of silence and ignoring noise bursts

sox in.wav out7.wav silence -l 1 0.3 1% -1 2.0 1%

Finally, let’s tie it all together by trimming silence longer than 2 seconds down to 2 seconds long, but ignore noise such as pops and clicks amidst the moments of silence.

sox in.wav out7.wav silence -l 1 0.3 1% -1 2.0 1%

As a result you’ll see that we’ve cropped out the 0.25 seconds of noise at the beginning of the clip, but left the 0.5 seconds of noise in the middle.

For actual usage, you’ll probably want to specify something shorter than 0.3 seconds for the duration if you’re just trying to filter out pops and clicks.


Bonus Example 8: Splitting audio based on silence

sox in.wav out.wav silence 1 0.5 1% 1 5.0 1% : newfile : restart

Using SoX’s newfile pseudo-effect allows us to split an audio file based on periods of silence, and then callingrestart starts the effects chain over from the beginning. In this example, SoX will split audio when it detects 5 or more seconds of silence. You’ll end up with output files namedout001.wav,out002.wav, and so on.


Final Thoughts

There you have it.  This is what I know about the silence filter effect in SoX.  Example 7–where we trim some but not all of the silence and ignore pops and clicks–is ultimately what I was trying to figure out when writing this article, but I figure the other examples have got to be a good reference for somebody me.

The above and below-period values are still mostly a mystery to me.  I may address them in another post, but for now, I’m just going to use this as a cheat sheet in case I forget.

And don’t forget to use the trailing zero when specifying whole seconds. Even while writing this I forgot multiple times.

I welcome thoughts, ideas, comments, and corrections. Please.

(edit 11/14/10 to add names to each of the examples for clarification)
(edit 04/28/11 to add audio splitting example)
(edit 12/06/12 to add one line silence trimming) 

    • sox in.wav out.wav silence 1 0.8 1% 1 1.0 1% : newfile : restart

This entry was posted in Software and tagged Audio, SoX, Tutorial. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
import torch import torchaudio from typing import Callable, List import warnings languages = ['ru', 'en', 'de', 'es'] class OnnxWrapper(): def __init__(self, path, force_onnx_cpu=False): import numpy as np global np import onnxruntime opts = onnxruntime.SessionOptions() opts.inter_op_num_threads = 1 opts.intra_op_num_threads = 1 if force_onnx_cpu and 'CPUExecutionProvider' in onnxruntime.get_available_providers(): self.session = onnxruntime.InferenceSession(path, providers=['CPUExecutionProvider'], sess_options=opts) else: self.session = onnxruntime.InferenceSession(path, sess_options=opts) self.reset_states() if '16k' in path: warnings.warn('This model support only 16000 sampling rate!') self.sample_rates = [16000] else: self.sample_rates = [8000, 16000] def _validate_input(self, x, sr: int): if x.dim() == 1: x = x.unsqueeze(0) if x.dim() > 2: raise ValueError(f"Too many dimensions for input audio chunk {x.dim()}") if sr != 16000 and (sr % 16000 == 0): step = sr // 16000 x = x[:,::step] sr = 16000 if sr not in self.sample_rates: raise ValueError(f"Supported sampling rates: {self.sample_rates} (or multiply of 16000)") if sr / x.shape[1] > 31.25: raise ValueError("Input audio chunk is too short") return x, sr def reset_states(self, batch_size=1): self._state = torch.zeros((2, batch_size, 128)).float() self._context = torch.zeros(0) self._last_sr = 0 self._last_batch_size = 0 def __call__(self, x, sr: int): x, sr = self._validate_input(x, sr) num_samples = 512 if sr == 16000 else 256 if x.shape[-1] != num_samples: raise ValueError(f"Provided number of samples is {x.shape[-1]} (Supported values: 256 for 8000 sample rate, 512 for 16000)") batch_size = x.shape[0] context_size = 64 if sr == 16000 else 32 if not self._last_batch_size: self.reset_states(batch_size) if (self._last_sr) and (self._last_sr != sr): self.reset_states(batch_size) if (self._last_batch_size) and (self._last_batch_size != batch_size): self.reset_states(batch_size) if not len(self._context): self._context = torch.zeros(batch_size, context_size) x = torch.cat([self._context, x], dim=1) if sr in [8000, 16000]: ort_inputs = {'input': x.numpy(), 'state': self._state.numpy(), 'sr': np.array(sr, dtype='int64')} ort_outs = self.session.run(None, ort_inputs) out, state = ort_outs self._state = torch.from_numpy(state) else: raise ValueError() self._context = x[..., -context_size:] self._last_sr = sr self._last_batch_size = batch_size out = torch.from_numpy(out) return out def audio_forward(self, x, sr: int): outs = [] x, sr = self._validate_input(x, sr) self.reset_states() num_samples = 512 if sr == 16000 else 256 if x.shape[1] % num_samples: pad_num = num_samples - (x.shape[1] % num_samples) x = torch.nn.functional.pad(x, (0, pad_num), 'constant', value=0.0) for i in range(0, x.shape[1], num_samples): wavs_batch = x[:, i:i+num_samples] out_chunk = self.__call__(wavs_batch, sr) outs.append(out_chunk) stacked = torch.cat(outs, dim=1) return stacked.cpu() class Validator(): def __init__(self, url, force_onnx_cpu): self.onnx = True if url.endswith('.onnx') else False torch.hub.download_url_to_file(url, 'inf.model') if self.onnx: import onnxruntime if force_onnx_cpu and 'CPUExecutionProvider' in onnxruntime.get_available_providers(): self.model = onnxruntime.InferenceSession('inf.model', providers=['CPUExecutionProvider']) else: self.model = onnxruntime.InferenceSession('inf.model') else: self.model = init_jit_model(model_path='inf.model') def __call__(self, inputs: torch.Tensor): with torch.no_grad(): if self.onnx: ort_inputs = {'input': inputs.cpu().numpy()} outs = self.model.run(None, ort_inputs) outs = [torch.Tensor(x) for x in outs] else: outs = self.model(inputs) return outs def read_audio(path: str, sampling_rate: int = 16000): list_backends = torchaudio.list_audio_backends() assert len(list_backends) > 0, 'The list of available backends is empty, please install backend manually. \ \n Recommendations: \n \tSox (UNIX OS) \n \tSoundfile (Windows OS, UNIX OS) \n \tffmpeg (Windows OS, UNIX OS)' try: effects = [ ['channels', '1'], ['rate', str(sampling_rate)] ] wav, sr = torchaudio.sox_effects.apply_effects_file(path, effects=effects) except: wav, sr = torchaudio.load(path) if wav.size(0) > 1: wav = wav.mean(dim=0, keepdim=True) if sr != sampling_rate: transform = torchaudio.transforms.Resample(orig_freq=sr, new_freq=sampling_rate) wav = transform(wav) sr = sampling_rate assert sr == sampling_rate return wav.squeeze(0) def save_audio(path: str, tensor: torch.Tensor, sampling_rate: int = 16000): torchaudio.save(path, tensor.unsqueeze(0), sampling_rate, bits_per_sample=16) def init_jit_model(model_path: str, device=torch.device('cpu')): model = torch.jit.load(model_path, map_location=device) model.eval() return model def make_visualization(probs, step): import pandas as pd pd.DataFrame({'probs': probs}, index=[x * step for x in range(len(probs))]).plot(figsize=(16, 8), kind='area', ylim=[0, 1.05], xlim=[0, len(probs) * step], xlabel='seconds', ylabel='speech probability', colormap='tab20') @torch.no_grad() def get_speech_timestamps(audio: torch.Tensor, model, threshold: float = 0.5, sampling_rate: int = 16000, min_speech_duration_ms: int = 250, max_speech_duration_s: float = float('inf'), min_silence_duration_ms: int = 100, speech_pad_ms: int = 30, return_seconds: bool = False, time_resolution: int = 1, visualize_probs: bool = False, progress_tracking_callback: Callable[[float], None] = None, neg_threshold: float = None, window_size_samples: int = 512, min_silence_at_max_speech: float = 98, use_max_poss_sil_at_max_speech: bool = True): """ This method is used for splitting long audios into speech chunks using silero VAD Parameters ---------- audio: torch.Tensor, one dimensional One dimensional float torch.Tensor, other types are casted to torch if possible model: preloaded .jit/.onnx silero VAD model threshold: float (default - 0.5) Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH. It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets. sampling_rate: int (default - 16000) Currently silero VAD models support 8000 and 16000 (or multiply of 16000) sample rates min_speech_duration_ms: int (default - 250 milliseconds) Final speech chunks shorter min_speech_duration_ms are thrown out max_speech_duration_s: int (default - inf) Maximum duration of speech chunks in seconds Chunks longer than max_speech_duration_s will be split at the timestamp of the last silence that lasts more than 100ms (if any), to prevent agressive cutting. Otherwise, they will be split aggressively just before max_speech_duration_s. min_silence_duration_ms: int (default - 100 milliseconds) In the end of each speech chunk wait for min_silence_duration_ms before separating it speech_pad_ms: int (default - 30 milliseconds) Final speech chunks are padded by speech_pad_ms each side return_seconds: bool (default - False) whether return timestamps in seconds (default - samples) time_resolution: bool (default - 1) time resolution of speech coordinates when requested as seconds visualize_probs: bool (default - False) whether draw prob hist or not progress_tracking_callback: Callable[[float], None] (default - None) callback function taking progress in percents as an argument neg_threshold: float (default = threshold - 0.15) Negative threshold (noise or exit threshold). If model's current state is SPEECH, values BELOW this value are considered as NON-SPEECH. min_silence_at_max_speech: float (default - 98ms) Minimum silence duration in ms which is used to avoid abrupt cuts when max_speech_duration_s is reached use_max_poss_sil_at_max_speech: bool (default - True) Whether to use the maximum possible silence at max_speech_duration_s or not. If not, the last silence is used. window_size_samples: int (default - 512 samples) !!! DEPRECATED, DOES NOTHING !!! Returns ---------- speeches: list of dicts list containing ends and beginnings of speech chunks (samples or seconds based on return_seconds) """ if not torch.is_tensor(audio): try: audio = torch.Tensor(audio) except: raise TypeError("Audio cannot be casted to tensor. Cast it manually") if len(audio.shape) > 1: for i in range(len(audio.shape)): # trying to squeeze empty dimensions audio = audio.squeeze(0) if len(audio.shape) > 1: raise ValueError("More than one dimension in audio. Are you trying to process audio with 2 channels?") if sampling_rate > 16000 and (sampling_rate % 16000 == 0): step = sampling_rate // 16000 sampling_rate = 16000 audio = audio[::step] warnings.warn('Sampling rate is a multiply of 16000, casting to 16000 manually!') else: step = 1 if sampling_rate not in [8000, 16000]: raise ValueError("Currently silero VAD models support 8000 and 16000 (or multiply of 16000) sample rates") window_size_samples = 512 if sampling_rate == 16000 else 256 hop_size_samples = int(window_size_samples) model.reset_states() min_speech_samples = sampling_rate * min_speech_duration_ms / 1000 speech_pad_samples = sampling_rate * speech_pad_ms / 1000 max_speech_samples = sampling_rate * max_speech_duration_s - window_size_samples - 2 * speech_pad_samples min_silence_samples = sampling_rate * min_silence_duration_ms / 1000 min_silence_samples_at_max_speech = sampling_rate * min_silence_at_max_speech / 1000 audio_length_samples = len(audio) speech_probs = [] for current_start_sample in range(0, audio_length_samples, hop_size_samples): chunk = audio[current_start_sample: current_start_sample + window_size_samples] if len(chunk) < window_size_samples: chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk)))) try: speech_prob = model(chunk, sampling_rate).item() except Exception as e: import ipdb; ipdb.set_trace() speech_probs.append(speech_prob) # caculate progress and seng it to callback function progress = current_start_sample + hop_size_samples if progress > audio_length_samples: progress = audio_length_samples progress_percent = (progress / audio_length_samples) * 100 if progress_tracking_callback: progress_tracking_callback(progress_percent) triggered = False speeches = [] current_speech = {} if neg_threshold is None: neg_threshold = max(threshold - 0.15, 0.01) temp_end = 0 # to save potential segment end (and tolerate some silence) prev_end = next_start = 0 # to save potential segment limits in case of maximum segment size reached possible_ends = [] for i, speech_prob in enumerate(speech_probs): if (speech_prob >= threshold) and temp_end: if temp_end != 0: sil_dur = (hop_size_samples * i) - temp_end if sil_dur > min_silence_samples_at_max_speech: possible_ends.append((temp_end, sil_dur)) temp_end = 0 if next_start < prev_end: next_start = hop_size_samples * i if (speech_prob >= threshold) and not triggered: triggered = True current_speech['start'] = hop_size_samples * i continue if triggered and (hop_size_samples * i) - current_speech['start'] > max_speech_samples: if possible_ends: if use_max_poss_sil_at_max_speech: prev_end, dur = max(possible_ends, key=lambda x: x[1]) # use the longest possible silence segment in the current speech chunk else: prev_end, dur = possible_ends[-1] # use the last possible silence segement current_speech['end'] = prev_end speeches.append(current_speech) current_speech = {} next_start = prev_end + dur if next_start < prev_end + hop_size_samples * i: # previously reached silence (< neg_thres) and is still not speech (< thres) #triggered = False current_speech['start'] = next_start else: triggered = False #current_speech['start'] = next_start prev_end = next_start = temp_end = 0 possible_ends = [] else: current_speech['end'] = hop_size_samples * i speeches.append(current_speech) current_speech = {} prev_end = next_start = temp_end = 0 triggered = False possible_ends = [] continue if (speech_prob < neg_threshold) and triggered: if not temp_end: temp_end = hop_size_samples * i # if ((hop_size_samples * i) - temp_end) > min_silence_samples_at_max_speech: # condition to avoid cutting in very short silence # prev_end = temp_end if (hop_size_samples * i) - temp_end < min_silence_samples: continue else: current_speech['end'] = temp_end if (current_speech['end'] - current_speech['start']) > min_speech_samples: speeches.append(current_speech) current_speech = {} prev_end = next_start = temp_end = 0 triggered = False possible_ends = [] continue if current_speech and (audio_length_samples - current_speech['start']) > min_speech_samples: current_speech['end'] = audio_length_samples speeches.append(current_speech) for i, speech in enumerate(speeches): if i == 0: speech['start'] = int(max(0, speech['start'] - speech_pad_samples)) if i != len(speeches) - 1: silence_duration = speeches[i+1]['start'] - speech['end'] if silence_duration < 2 * speech_pad_samples: speech['end'] += int(silence_duration // 2) speeches[i+1]['start'] = int(max(0, speeches[i+1]['start'] - silence_duration // 2)) else: speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples)) speeches[i+1]['start'] = int(max(0, speeches[i+1]['start'] - speech_pad_samples)) else: speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples)) if return_seconds: audio_length_seconds = audio_length_samples / sampling_rate for speech_dict in speeches: speech_dict['start'] = max(round(speech_dict['start'] / sampling_rate, time_resolution), 0) speech_dict['end'] = min(round(speech_dict['end'] / sampling_rate, time_resolution), audio_length_seconds) elif step > 1: for speech_dict in speeches: speech_dict['start'] *= step speech_dict['end'] *= step if visualize_probs: make_visualization(speech_probs, hop_size_samples / sampling_rate) return speeches class VADIterator: def __init__(self, model, threshold: float = 0.5, sampling_rate: int = 16000, min_silence_duration_ms: int = 100, speech_pad_ms: int = 30 ): """ Class for stream imitation Parameters ---------- model: preloaded .jit/.onnx silero VAD model threshold: float (default - 0.5) Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH. It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets. sampling_rate: int (default - 16000) Currently silero VAD models support 8000 and 16000 sample rates min_silence_duration_ms: int (default - 100 milliseconds) In the end of each speech chunk wait for min_silence_duration_ms before separating it speech_pad_ms: int (default - 30 milliseconds) Final speech chunks are padded by speech_pad_ms each side """ self.model = model self.threshold = threshold self.sampling_rate = sampling_rate if sampling_rate not in [8000, 16000]: raise ValueError('VADIterator does not support sampling rates other than [8000, 16000]') self.min_silence_samples = sampling_rate * min_silence_duration_ms / 1000 self.speech_pad_samples = sampling_rate * speech_pad_ms / 1000 self.reset_states() def reset_states(self): self.model.reset_states() self.triggered = False self.temp_end = 0 self.current_sample = 0 @torch.no_grad() def __call__(self, x, return_seconds=False, time_resolution: int = 1): """ x: torch.Tensor audio chunk (see examples in repo) return_seconds: bool (default - False) whether return timestamps in seconds (default - samples) time_resolution: int (default - 1) time resolution of speech coordinates when requested as seconds """ if not torch.is_tensor(x): try: x = torch.Tensor(x) except: raise TypeError("Audio cannot be casted to tensor. Cast it manually") window_size_samples = len(x[0]) if x.dim() == 2 else len(x) self.current_sample += window_size_samples speech_prob = self.model(x, self.sampling_rate).item() if (speech_prob >= self.threshold) and self.temp_end: self.temp_end = 0 if (speech_prob >= self.threshold) and not self.triggered: self.triggered = True speech_start = max(0, self.current_sample - self.speech_pad_samples - window_size_samples) return {'start': int(speech_start) if not return_seconds else round(speech_start / self.sampling_rate, time_resolution)} if (speech_prob < self.threshold - 0.15) and self.triggered: if not self.temp_end: self.temp_end = self.current_sample if self.current_sample - self.temp_end < self.min_silence_samples: return None else: speech_end = self.temp_end + self.speech_pad_samples - window_size_samples self.temp_end = 0 self.triggered = False return {'end': int(speech_end) if not return_seconds else round(speech_end / self.sampling_rate, time_resolution)} return None def collect_chunks(tss: List[dict], wav: torch.Tensor, seconds: bool = False, sampling_rate: int = None) -> torch.Tensor: """Collect audio chunks from a longer audio clip This method extracts audio chunks from an audio clip, using a list of provided coordinates, and concatenates them together. Coordinates can be passed either as sample numbers or in seconds, in which case the audio sampling rate is also needed. Parameters ---------- tss: List[dict] Coordinate list of the clips to collect from the audio. wav: torch.Tensor, one dimensional One dimensional float torch.Tensor, containing the audio to clip. seconds: bool (default - False) Whether input coordinates are passed as seconds or samples. sampling_rate: int (default - None) Input audio sampling rate. Required if seconds is True. Returns ------- torch.Tensor, one dimensional One dimensional float torch.Tensor of the concatenated clipped audio chunks. Raises ------ ValueError Raised if sampling_rate is not provided when seconds is True. """ if seconds and not sampling_rate: raise ValueError('sampling_rate must be provided when seconds is True') chunks = list() _tss = _seconds_to_samples_tss(tss, sampling_rate) if seconds else tss for i in _tss: chunks.append(wav[i['start']:i['end']]) return torch.cat(chunks) def drop_chunks(tss: List[dict], wav: torch.Tensor, seconds: bool = False, sampling_rate: int = None) -> torch.Tensor: """Drop audio chunks from a longer audio clip This method extracts audio chunks from an audio clip, using a list of provided coordinates, and drops them. Coordinates can be passed either as sample numbers or in seconds, in which case the audio sampling rate is also needed. Parameters ---------- tss: List[dict] Coordinate list of the clips to drop from from the audio. wav: torch.Tensor, one dimensional One dimensional float torch.Tensor, containing the audio to clip. seconds: bool (default - False) Whether input coordinates are passed as seconds or samples. sampling_rate: int (default - None) Input audio sampling rate. Required if seconds is True. Returns ------- torch.Tensor, one dimensional One dimensional float torch.Tensor of the input audio minus the dropped chunks. Raises ------ ValueError Raised if sampling_rate is not provided when seconds is True. """ if seconds and not sampling_rate: raise ValueError('sampling_rate must be provided when seconds is True') chunks = list() cur_start = 0 _tss = _seconds_to_samples_tss(tss, sampling_rate) if seconds else tss for i in _tss: chunks.append((wav[cur_start: i['start']])) cur_start = i['end'] return torch.cat(chunks) def _seconds_to_samples_tss(tss: List[dict], sampling_rate: int) -> List[dict]: """Convert coordinates expressed in seconds to sample coordinates. """ return [{ 'start': round(crd['start']) * sampling_rate, 'end': round(crd['end']) * sampling_rate } for crd in tss] 这个是silero vad 的模型框架吗
最新发布
09-24
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值