Speex之二-编码描述及相关概念解析

最新推荐文章于 2024-04-19 09:55:59 发布

zsJum

最新推荐文章于 2024-04-19 09:55:59 发布

阅读量6.9k

点赞数 1

分类专栏：语音处理文章标签： preprocessor codec compression signal encoding delay

语音处理专栏收录该内容

16 篇文章 3 订阅

订阅专栏

编码描述

这一章节将描述Speex更为详细特征

1. 概念

在介绍全部Speex特征之前，这里有一些语音编码的概念以帮助更好地理解本手册。虽然有一些是语音/音频处理的概念，其它是特殊于Speex的(译得怪怪的)

采样率

采样率使用赫茲（hz）表示，是每一秒钟信号采样的个数。以采样率Fs kHz为例，其最高频率等于Fs/2 kHz（Fs/2 被称为Nyquist频率）。这是信号处理的基本属性，也通过采样定理被描述。Speex 主要被设计为三个不同的采样率：8kHz,16kHz,32kHz。这些分别称为窄带、宽带、超宽带。

质量（可变的）

Speex是一种有损编解码器，这意味着它压缩开支在输入语音信号的保真度。不像其它语音编解码器，Speex是可以权衡质量与比特率的。很多时候通过一个从0到10的质量参数来控制Speex编码器。在固定比特率（CBR）操作，质量参数是一个整形，而可变比特率（VBR）其质量参数为浮点型。

复杂度（可变的）

Speex编码器复杂度是可变的。这是用一个从1到10的整形来完成对它的控制，这类似于使用-1到-9设置gzip和bzip2压缩工具。复杂度1的噪声水平在1到2分贝之间要高于复杂度10，但复杂度10的CPU要求大约是复杂度1的5倍。在实践中，最好的折衷是在复杂度2到4，尽管当非讲话声音如DTMF，更高的设置经常是有用的。

可变比特率（VBR）

可变比特率允许一个编解码器根据不同的语音动态自适应地变化比特率。以Speex为例，声音像元音和高能量瞬时要求一个更高的比特率去达到好的质量，而编码摩擦音（如 s,f发音）会适当较少比特。因此VBR能达到在相同的质量下更低的比特率，或者在相同的比特率下更好的质量。尽管有以上的优点，但VBR有两个主要的缺点:一个是只能指定质量，但不能保证最终平均比特率。另一个是在一些实时应用中像VOIP,最大比特率必须满足通讯通道。

平均比特率（ABR）

平均比特率解决了VBR其中的一个问题，为了满足一个明确的目标比特率，用它来动态调节VBR质量。因为质量/比特率是实时调节的，所以全局质量将会稍微低于设置满足于目标平均比特率的VBR质量。

语音活动性检测（VAD）

当使能语音活动性检测功能，将检测被编码的语音是讲话或静音/背景噪声。当编码为VBR时VAD总是暗中活动，所以设置只在非VBR有效。在这种情况，Speex在检测非讲话周期和足够的比特率下重建背景噪声，这称为舒适噪声生成（GNG）

间断传输（DTX）

间断传输是除了VAD/VBR的一个操作，它允许当只有固定背景噪声时完全停止传输。在文本模式操作，由于我们不能停止写文件，所以每帧只能使用5比特（相当于250bps）

感知加强

感知加强是解码器的一部分，开启该功能时试图通过编码/解码处理降低噪声的感知。在很多情况下，感知加强从客观上会使声音与源声音相关更大（只从信噪比角度考虑），但最终声音从主观上会更好。

延迟与算法延迟

第一个语音编解码都会介绍传输的时延。Speex的时延等于一帧时间加上处理每一帧向前看需要的时间。窄带（8kHz）的时延是30ms,而宽带（16kHz）是34ms。这些值不计编码或解码所花费CPU的时间。

3. 预处理器

这部分参考预处理模型介绍1.1.x分支。预处理器的设计是运行于编码器之前的。预处理器提供了三个

主要的功能：

（1）噪声抑制（NS）

（2）自动增益控制（AGC）

（3）活动性检测（VAD）

插图1 回音消除模型

噪声抑制（NS）

噪声抑制器能被用于在发送之前降低输入信号的背景噪声的数量。提供更高质量的语音是否噪声抑制信号或者是根本。无论怎样使用降噪信号是有益的。一般语音编解码（包括Speex）对噪声输入的处理是不足的，会趋向于放大噪声。噪声抑制器很大地降低了这一影响。

自动增益控制（AGC）

自动增益控制是处理面对录音因为大量不同设置而导致音量变化。AGC提供了一种方式去调整参考音量。这在VOIP中是很有用的，因为不需再手动调节麦克风的增益。还有另外一个优点是麦克风增益在一个比较保守的水平，它更容易避免削波、失真。

语音活动性检测（VAD）

通过预处理提供的语音活动性检测（VAD）会比直接由编解码器提供的更为高级。

4 自适应抖动缓冲

当使用UDP或RTP传输语音时，包可能会有丢包、不同时延甚至错序。抖动缓冲的目的就是重排包和缓冲，它们是足够长的（但又不能超过所需），所以他们被发送去解码。

5. 语音回音消除器

在任何一个免提通讯系统中（插图1），远端的语音被本地场声播放，传播到整个房间并且被麦克风所采集。如果抓取麦克风的语音直接发送给麦克风，那么远端用户会听到他自己的回声。一个语音回音消除器的设计是在发送去远端之前移除回声。去理解回声消除器意味着改善远端质量是很重要的。

6. 重采样

有些情况，转换一个语音的采样率是很有用的。这有很多原因。比如，它能混合不合采样率的流，支持声卡不支持的采样率。这是为什么重采样是Speex工程的一部分。这个重采样能用于转换两个任意采样率（比例必须只能是有理数），而且可以通过权衡质量与复杂度去控制它。

原文

2 Codec description

This section describes Speex and itsfeatures into more details.

2.1 Concepts

Before introducing all the Speex features,here are some concepts in speech coding that help better understand the rest ofthe manual. Although some are general concepts in speech/audio processing,others are specific to Speex.

Sampling rate

The sampling rate expressed in Hertz (Hz)is the number of samples taken from a signal per second. For a sampling rate ofFs kHz, the highest frequency that can be represented is equal to Fs/2 kHz(Fs/2 is known as the Nyquist frequency).

This is a fundamental property in signalprocessing and is described by the sampling theorem. Speex is mainly designedfor three different sampling rates: 8 kHz, 16 kHz, and 32 kHz. These arerespectively refered to as narrowband, wideband and ultra-wideband.

Bit-rate

When encoding a speech signal, the bit-rateis defined as the number of bits per unit of time required to encode thespeech. It is measured in bits per second (bps), or generally kilobits persecond. It is important to make the distinction between kilobits per second(kbps) and kilobytes per second (kBps).

Quality (variable)

Speex is a lossy codec, which means that itachives compression at the expense of fidelity of the input speech signal.Unlike some other speech codecs, it is possible to control the tradeoff madebetween quality and bit-rate. The Speex encoding process is controlled most ofthe time by a quality parameter that ranges from 0 to 10. In constant bit-rate(CBR) operation, the quality parameter is an integer, while for variable bit-rate(VBR), the parameter is a float.

Complexity (variable)

With Speex, it is possible to vary thecomplexity allowed for the encoder. This is done by controlling how the searchis performed with an integer ranging from 1 to 10 in a way that’s similar to the-1 to -9 options to gzip and bzip2 compression utilities. For normal use, thenoise level at complexity 1 is between 1 and 2 dB higher than at complexity 10,but the CPU requirements for complexity 10 is about 5 times higher than forcomplexity 1. In practice, the best trade-off is between complexity 2 and 4,though higher settings are often useful when encoding non-speech sounds likeDTMF tones.

Variable Bit-Rate (VBR)

Variable bit-rate (VBR) allows a codec tochange its bit-rate dynamically to adapt to the “difficulty” of the audio beingencoded. In the example of Speex, sounds like vowels and high-energy transientsrequire a higher bit-rate to achieve good quality, while fricatives (e.g. s,fsounds) can be coded adequately with less bits. For this reason, VBR can achivelower bit-rate for the same quality, or a better quality for a certainbit-rate. Despite its advantages, VBR has two main drawbacks: first, by onlyspecifying quality, there’s no guaranty about the final average bit-rate.Second, for some real-time applications like voice over IP (VoIP), what countsis the maximum bit-rate, which must be low enough for the communicationchannel.

Average Bit-Rate (ABR)

Average bit-rate solves one of the problemsof VBR, as it dynamically adjusts VBR quality in order to meet a specifictarget bit-rate. Because the quality/bit-rate is adjusted in real-time(open-loop), the global quality will be slightly lower than that obtained byencoding in VBR with exactly the right quality setting to meet the target averagebit-rate.

Voice Activity Detection (VAD)

When enabled, voice activity detectiondetects whether the audio being encoded is speech or silence/background noise.VAD is always implicitly activated when encoding in VBR, so the option is onlyuseful in non-VBR operation. In this case, Speex detects non-speech periods andencode them with just enough bits to reproduce the background noise. This iscalled “comfort noise generation” (CNG).

Discontinuous Transmission (DTX)

Discontinuous transmission is an additionto VAD/VBR operation, that allows to stop transmitting completely when thebackground noise is stationary. In file-based operation, since we cannot juststop writing to the ?le, only 5 bits are used for such frames (corresponding to250 bps).

Perceptual enhancement

Perceptual enhancement is a part of thedecoder which, when turned on, attempts to reduce the perception of thenoise/distortion produced by the encoding/decoding process. In most cases,perceptual enhancement brings the sound further from the

original objectively (e.g. considering onlySNR), but in the end it still sounds better (subjective improvement).

Latency and algorithmic delay

Every speech codec introduces a delay inthe transmission. For Speex, this delay is equal to the frame size, plus someamount of “look-ahead” required to process each frame. In narrowband operation(8 kHz), the delay is 30 ms, while for wideband (16kHz), the delay is 34 ms.These values don’t account for the CPU time it takes to encode or decode theframes.

3． Preprocessor

This part refers to the preprocessor moduleintroduced in the 1.1.x branch. The preprocessor is designed to be used on theaudio before running the encoder. The preprocessor provides three mainfunctionalities:

• noise suppression

• automatic gain control (AGC)

• voice activity detection (VAD)

The denoiser can be used to reduce theamount of background noise present in the input signal. This provides higherquality speech whether or not the denoised signal is encoded with Speex (or atall). However, when using the denoised signal with the codec, there is anadditional benefit. Speech codecs in general (Speex included) tend to performpoorly on noisy input, which tends to amplify the noise. The denoiser greatlyreduces this effect.

Automatic gain control (AGC) is a featurethat deals with the fact that the recording volume may vary by a large amountbetween different setups. The AGC provides a way to adjust a signal to areference volume. This is useful for voice over IP because it removes the needfor manual adjustment of the microphone gain. A secondary advantage is that bysetting the microphone gain to a conservative (low) level, it is easier toavoid clipping.

The voice activity detector (VAD) providedby the preprocessor is more advanced than the one directly provided in thecodec.

2.4 Adaptive Jitter Buffer

When transmitting voice (or any content forthat matter) over UDP or RTP, packet may be lost, arrive with differentdelay,or even out of order. The purpose of a jitter buffer is to reorderpackets and buffer them long enough (but no longer than necessary) so they can besent to be decoded.

2.5 Acoustic Echo Canceller

In any hands-free communication system(Fig. 2.1), speech from the remote end is played in the local loudspeaker,propagates in the room and is captured by the microphone. If the audio capturedfrom the microphone is sent directly to the remote end,then the remove userhears an echo of his voice. An acoustic echo canceller is designed to removethe acoustic echo before it is sent to the remote end. It is important tounderstand that the echo canceller is meant to improve the quality on theremote end.

2.6 Resampler

In some cases, it may be useful to convertaudio from one sampling rate to another. There are many reasons for that. Itcan be for mixing streams that have different sampling rates, for supportingsampling rates that the soundcard doesn’t support, for transcoding, etc. That’swhy there is now a resampler that is part of the Speex project. This resamplercan be used to convert between any two arbitrary rates (the ratio must only bea rational number) and there is control over the quality/complexity tradeoff。

zsJum

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Speex之二-编码描述及相关概念解析

编码描述这一章节将描述Speex更为详细特征1. 概念在介绍全部Speex特征之前，这里有一些语音编码的概念以帮助更好地理解本手册。虽然有一些是语音/音频处理的概念，其它是特殊于Speex的(译得怪怪的) 采样率采样率使用赫茲（hz）表示，是每一秒钟信号采样的个数。以采样率Fs kHz为例，其最高频率等于Fs/2 kHz（Fs/2 被称为Nyquist频率）。这是信号处理的基本
复制链接

扫一扫