语音分析第1部分语音分析的基础

那么什么是声波? (So what is a sound wave ?)

Speech signals are sound signals, defined as pressure variations travelling through the air. These variations in pressure can be described as waves and correspondingly they are often called sound waves.

语音信号是声音信号,定义为通过空气传播的压力变化。 这些压力变化可以描述为波,因此它们通常被称为声波。

Sound wave can be described by five characteristics: Wavelength, Amplitude, Time-Period, Frequency and Velocity or Speed

声音 波长,幅度,时间段,频率,速度或速度:波可通过五个特点来描述

  1. Wavelength — The minimum distance in which a sound wave repeats itself is called its wavelength. That is it is the length of one complete wave. It is denoted by a Greek letter λ

    波长 -声波自身重复的最小距离称为其波长。 那就是一整波的长度。 用希腊字母λ表示

  2. Amplitude — When a wave passes through a medium, the particles of the medium get displaced temporarily from their original undisturbed positions. The maximum displacement of the particles of the medium from their original undisturbed positions, when a wave passes through the medium is called amplitude of the wave.

    振幅 -当波穿过介质时,介质的粒子会暂时从其原始不受干扰的位置移开。 当波通过介质时,介质粒子从其原始未受干扰位置的最大位移称为波的振幅。

3. Time Period — The time required to produce one complete wave or cycle or cycle is called time-period of the wave.

3. 时间段 -产生一个完整的波或一个周期或一个周期所需的时间称为波的时间周期。

4. Frequency — The number of complete waves or cycles produced in one second is called frequency of the wave. The S.I unit of frequency is hertz or Hz.

4. 频率 -一秒钟内产生的完整波或周期数称为波频率。 频率的SI单位是赫兹或Hz。

5. Velocity — The distance traveled by a wave in one second is called velocity of the wave or speed of the wave.

5. 速度 -波浪在一秒钟内传播的距离称为波浪的速度或波浪的速度。

什么是声音数据中的窗口化和采样? (What is Windowing and Sampling in Sound Data?)

Sampling the signal is a process of converting an analog signal to a digital signal by selecting a certain number of samples per second from the analog signal. Can you see what we are doing here? We are converting an audio signal to a discrete signal through sampling so that it can be stored and processed efficiently in memory.

对信号进行采样是通过从模拟信号中每秒选择一定数量的采样来将模拟信号转换为数字信号的过程。 您可以看到我们在做什么吗? 我们正在通过采样将音频信号转换为离散信号,以便可以在内存中对其进行有效存储和处理。

Image for post

The key thing to take away from the above figure is that we are able to reconstruct an almost similar audio wave even after sampling the analog signal. The sampling rate or sampling frequency is defined as the number of samples selected per second.

摆脱上图的关键之处在于,即使在对模拟信号进行采样之后,我们也能够重建几乎相似的音频波。 采样率或采样频率定义为每秒选择的样本数。

Windowing is a classical method in signal processing and it refers to splitting the input signal into temporal segments. The borders of segments are then visible as discontinuities, which are incongruent with the real-world signal.

加窗是信号处理中的一种经典方法,它是指将输入信号分成时间段。 然后,将片段的边界视为不连续点,这与实际信号不一致。

Image for post
Image for post

A spoken sentence is a sequence of phonemes. Speech signals are thus time-variant in character. To extract information from a signal, we must therefore split the signal into sufficiently short segments, such that each segment contains only one phoneme.

口头句子是一系列音素。 因此,语音信号的字符是随时间变化的。 为了从信号中提取信息,因此我们必须将信号分成足够短的段,以使每个段仅包含一个音素。

Next step we need to extract information from this wave form which is accomplished by changing the speech waveform to a form of parametric representation at a relatively lesser data rate for subsequent processing and analysis. This is usually called the front end signal-processing.

下一步,我们需要从该波形中提取信息,这是通过以相对较小的数据速率将语音波形更改为参数表示形式来完成的,以进行后续处理和分析。 这通常称为前端信号处理。

特征提取和可视化声波- (Feature Extraction and Visualizing a sound wave -)

  1. Time Domain — Here, the audio signal is represented by the amplitude as a function of time. In simple words, it is a plot between amplitude and time.

    时域 —在这里,音频信号由幅度表示,是时间的函数。 简而言之,它是幅度和时间之间的关系图

  2. Frequency Domain — In the frequency domain, the audio signal is represented by amplitude as a function of frequency. Simply put — it is a plot between frequency and amplitude.

    频域 -在频域中,音频信号由幅度表示,是频率的函数。 简而言之,它是频率和幅度之间的关系图

  3. STFT and Spectogram -

    STFT和频谱图-

Pitch refers to our perception of the frequency of a tonal sound. The Fourier spectrum of a signal reveals such frequency content. It maps a length N signal xn into a complex valued frequency domain representation Xk of N coefficients as

音高是指我们对音调频率的感知。 信号的傅立叶频谱揭示了这种频率成分。 它将长度为N的信号xn映射为N个系数的复数值频域表示Xk

Image for post

However, since spectra are complex-valued vectors, it is difficult to visualize them as such.

但是,由于频谱是复数值向量,因此很难将其可视化。

By windowing and taking the discrete Fourier transform (DFT) of each window, we obtain the Short-time Fourier transform (STFT) of the signal.

通过加窗并取离散傅里叶变换的每个窗口的(DFT),我们得到在S 园艺时间傅立叶变换的信号的(STFT)。

A further parallel with a spectrum is that the output of the STFT is complex-valued, though where the spectrum is a vector, the STFT output is a matrix. As a consequence, we cannot directly visualize the complex-valued output. Instead, STFTs are usually visualized using their log-spectra, 20log10(X(h,k)). Such 2 dimensional log-spectra can then be visualized with a heat-map known as a Spectrogram.

与频谱的另一个相似之处是STFT的输出是复数值,尽管在频谱是矢量的情况下,STFT的输出是矩阵。 结果,我们无法直接可视化复数值输出。 取而代之的是,STFT通常使用其对数光谱20log10(X(h,k))可视化。 这样的二维对数光谱然后可以通过称为光谱图的热图可视化。

Image for post
Sound Segment
声音片段
Image for post
Image for post
Magnitude of DFT Speech Segments converted to Power
DFT语音段的幅度转换为幂
Image for post
Image for post
LogSpectrum of Speech Segment
语音段的LogSpectrum

Spectogram -It’s a 2D plot between time and frequency where each point in the plot represents the amplitude of a particular frequency at a particular time in terms of intensity of color. In simple terms, the spectrogram is a spectrum (broad range of colors) of frequencies as it varies with time.

频谱图-这是时间和频率之间的2D绘图,其中绘图中的每个点都代表特定时间在特定时间以颜色强度表示的振幅。 简而言之,频谱图是随时间变化的频率频谱(各种颜色的频谱)。

Image for post
Spectogram
频谱图

4. Cepstrum and Mel Spectogram and MFCC -

4. 倒谱和梅尔谱图和MFCC-

We now see that the log-spectrum has plenty of structure. It is a more or less continuous signal, owing to a large part, to the smoothing effect of windowing. It also a periodic structure, which corresponds to the harmonic structure of the signal caused by the fundamental frequency.

现在我们看到对数谱具有足够的结构。 由于很大程度上是开窗的平滑效果,所以它或多或少是连续的信号。 它也是一种周期性结构,它对应于由基频引起的信号谐波结构。

Specifically, we can take the discrete Fourier transform (DFT) or the discrete cosine transform (DCT) of the the log-spectrum, to obtain a representation known as the Cepstrum.

具体来说,我们可以采用对数谱的离散傅里叶变换(DFT)或离散余弦变换(DCT),以获得称为谱的表示

It is worth repeating that the cepstrum involves two time-frequency transforms. The cepstrum of a time-signal is therefore in some sense similar to the time-domain. The x-axis of a cepstrum is known as the quefrency-axis and it is expressed typically in the unit seconds.A second useful piece of information in the cepstrum is the harmonic structure of the log-spectrum. Recall that the fundamental frequency is visible as a comb-structure in the log-spectrum.

值得重复的是,倒频谱涉及两个时频转换。 因此,时间信号的倒谱在某种意义上类似于时域。 倒频谱的x轴称为频率轴,通常以为单位表示。倒频谱中的第二条有用信息是对数频谱的谐波结构。 回想一下,基本频率在对数频谱中是可见的梳状结构。

Image for post
Image for post
Image for post
Image for post
Sound Wave -> Spectrum -> Cepstrum
声波->频谱->倒谱

Importantly, it has also a macro-level structure; by connecting the peaks of the harmonic structure, we see that the signal forms peaks and valleys, which correspond to the resonances of the vocal tract. These peaks are known as formants and they can be used to uniquely identify all vowels.

重要的是,它还具有宏观层次的结构。 通过连接谐波结构的峰,我们可以看到信号形成了峰和谷,它们对应于声道的共振。 这些峰称为共振峰,可用于唯一识别所有元音。

To further improve on the cepstral representation, we can include more information only concerned within human perception.This can be done using scales like equivalent rectangular bandwidth (ERB) scale, the Bark scale, and the mel-scale.

为了进一步改善倒谱表示,我们可以包括更多仅在人类感知范围内涉及的信息,这可以使用等效矩形带宽 (ERB)尺度, Bark尺度和mel尺度等尺度来完成。

Mel Spectogram is given by —

梅尔频谱图由—

Image for post

MFCC — Finally, by taking the discrete cosine transform (DCT) of the parameters, we obtain the representation known as mel-frequency cepstral coefficients (MFCCs). The benefit of the DCT at the end is to approximately decorrelate the signal, such that the MFCC coefficients are not correlated with each other.

MFCC —最后,通过对参数进行离散余弦变换(DCT),我们获得了称为梅尔频率倒谱系数 (MFCC)的表示形式。 最后,DCT的好处是使信号大致解相关,以使MFCC系数彼此不相关。

Other Feature Extraction Techniques — LPC, LPCC, PLP

其他特征提取技术-LPC,LPCC,PLP

(You can Read more about them here)

(您可以在此处阅读有关它们的更多信息)

什么是三角洲? (What are Deltas ?)

We describe speech as a sequence of phonemes. So how do we define a phoneme in a sound wave ? A common method for extracting information about such transitions is to determine the first difference of signal features, known as the delta of a feature. Specifically, for a feature fk, at time-instant k, the corresponding delta is defined as

我们将语音描述为一系列音素。 那么,如何定义声波中的音素呢? 提取有关此类过渡信息的常用方法是确定信号特征的第一差异,称为特征增量 。 具体而言,对于特征f k,在时间常数k处 ,将相应的增量定义为

Image for post

Now once we have extracted features from the wave. It is very important to clean it.

现在,一旦我们从wave中提取了特征。 清洁它非常重要。

转换和预处理SpeechWave: (Transforming and Pre-Processing a SpeechWave:)

  1. Filters — Like the word suggest if only allows frequencies above or below a cutoff frequency

    滤波器 -像“建议”一词一样,如果只允许高于或低于截止频率的频率

Low-pass — Low-pass filters pass through frequencies below their cutoff frequencies, and progressively attenuates frequencies above the cutoff frequency.

低通- 低通滤波器通过低于其截止频率的频率,并逐渐衰减高于截止频率的频率。

High-pass — A high-pass filter does the opposite, passing high frequencies above the cutoff frequency, and progressively attenuating frequencies below the cutoff frequency.

高通- 高通滤波器的作用相反,使高于截止频率的高频通过,并逐渐衰减低于截止频率的频率。

2. Masking — Masking refers to unwanted sound in your soundwave (Noise). Masking sound reduces or eliminates perception of sound. Apply masking to a spectrogram in the time domain refers to Time Masking.Masking a spectogram in the frequency domain refers to frequency masking

2.遮罩—遮罩是指声波中的有害声音(噪声)。 掩盖声音会减少或消除声音的感知。 在时域中对频谱图应用屏蔽是指时间屏蔽。在频域中对频谱图进行屏蔽是指频率屏蔽

Image for post

3. TimeStretch — Stretch a spectrogram in time without modifying pitch foa given rate.

3. TimeStretch —时间拉伸频谱图,而无需修改给定速率的音高。

4. Amplification / Gain — It increases amplitude or attenuate to the whole waveform. In short it increases the loudness of your sound wave.

4.放大/增益 —增大或衰减整个波形。 简而言之,它增加了声波的响度。

5. Dither — It increases the perceived dynamic range of audio stored at a particular bit-depth.

5.抖动— 增加了以特定位深度存储的音频的感知动态范围。

6. Equalize — It is used for nnormalising the sound waves within a running window frame. It helps to remove noise from the sound

6. 均衡 —用于使运行中的窗框内的声波标准化。 它有助于消除声音中的噪音

因此,在本博客中,我们了解到 (Thus in this Blog we have learned)

  1. What are the important features of a sound wave

    声波的重要特征是什么
  2. How to visualize and extract information from a sound wave

    如何可视化和提取声波中的信息
  3. How to pre — process a sound wave

    如何预备—处理声波

These are the fundamental steps which has to be accomplished. Now once we are finished with these tasks this wave can be used for your Speech Analytics project such as :

这些是必须完成的基本步骤。 现在,一旦我们完成了这些任务,便可以将该波用于您的Speech Analytics项目,例如:

  1. Speaker Identification

    说话人识别
  2. Speech To Text Conversion

    语音到文本转换
  3. Speech Modulation

    语音调制
  4. Music Recommendation

    音乐推荐

We Would be looking into the Practical Aspects in TorchAudio Library in the Next Part………………………..

我们将在下一部分中研究TorchAudio库中的实用方面…………………………..

翻译自: https://medium.com/analytics-vidhya/speech-analytics-part-1-basics-of-speech-analytics-37ba6d5904e2

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值