【课程】SP Module3 数字语音信号

Time domain

Sound is a wave of pressure travelling through a medium, such as air. We can plot the variation in pressure (captured by microphone) against time to visualise the waveform.

Sound source

Air flow from the lungs is the power source for generating a basic source of sound either using the vocal folds or at a constriction made anywhere in the vocal tract.

the two principal sources of sound in speech

somehthing about pressure with our vocal folds, the air flow is slow, its only the power source of sound, the pressure change is the key generating sounds, repeat pulse of sound.

Periodic signal

The vocal folds block air flow from the lungs, burst open under pressure to create a glottal pulse, then rapidly close. This repeats, creating a periodic signal.

Pitch

Periodic signals are perceived as having a pitch. The physical property of fundamental frequency relates to the perceptual quantity of pitch.

a musical note, logarithmic none linear, with a base 2

Digital signal

To do speech processing with a computer, we need to convert sound first to an analogue electrical signal, and then to a digital representation of that signal.

sample of a waveform (analogue wave), sampling rate (or sampling frequency, digitized time) and quantization (or bit depth, the digitized amplitude) are the things determine the quality of sound.

Aliasing, the wave generated with sampling rate at a frequency lower than the original analog signal. To avoid aliasing, we have to remove all analogue sounds which has a higher frequency than the sampling rate.

Short-term analysis

Because speech sounds change over time, we need to analyse only short regions of the signal. We convert the speech signal into a sequence of frames.

To define a frame of the waveform, we have window function, cutting out of waveform.

Different window function leading to different results. If we simply use a 0/1 window function, and we analysed this signal we’d not only be analysing the speech but also those artefacts. So, we can use tapered windows, it’s cut out with a window function that tapers towards the edges. Think of that as a fade-in and a fade-out.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-47OGRvgd-1666257398345)(https://image.discover304.top/blog-img/s11082210022022.png?imageView2/2/h/300)]

Series expansion

Speech is hard to analyse directly in the time domain. So we need to convert it to the frequency domain using Fourier analysis, which is a special case of series expansion.

To reconstruct the original analogue sounds, we can add together an infinite number of terms to get exactly the original signal.

However, there’s a finite amount of information, we only need a finite number of basis functions to exactly reconstruct it.

Another way of saying that is that these basis functions are also digital signals, and the highest possible frequency one is the one at the Nyquist frequency, which is half the sampling rate.

What we do is simply calculate the coefficient of every possible frequency, and add them up to reconstruct the original signal.

One application of this is removing noise or not useful information by stop adding terms, and we get a smoother curve.

Fourier analysis

We can express any signal as a sum of sine waves that form a series. This takes us from the time domain to the frequency domain.

Spectrum is magnitude (dB) over frequency(kHz).

The basis functions are orthogonal, which means coefficients related are unique.

Frequency domain

We complete our understanding of Fourier analysis with a look at the phase of the component sine waves, and the effect of changing the analysis frame duration.

We neglect phase information during wave reconstruction. Where the wave start is not a big matter, because basis functions will synchronized sometime later.

The larger the analysis frame size, the more the basis functions.

The effect of analysis frame size

The frequency domain remove the amplitude information. Or we can interpret that as we decompose time domain waveform to frequency domain and amplitude information.

Summary

After pitch we have prosody, refer to collectively the fundamental frequency, the duration, and the amplitude of speech sounds (sometimes also voice quality).
when we attempt to generate synthetic speech, we’ll have to give it an appropriate prosody if we want it to sound natural.

After frequency domain, the next steps involve finding, in the frequency domain, some evidence of the periodicity in the speech signal: the harmonics. And Spectral envelope is the other half, answering what the vocal tract does to that sound source.


Origin: Module 3 – Digital Speech Signals
Translate + Edit: YangSier (Homepage)

🍀碎碎念🍀
Hello米娜桑,这里是英国留学中的杨丝儿。我的博客的关键词集中在编程、算法、机器人、人工智能、数学等等,点个关注吧,持续高质量输出中。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

白拾

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值