The Speex Codec Manual (version 1.2-beta1)

最新推荐文章于 2019-07-29 13:12:00 发布

andylin02

最新推荐文章于 2019-07-29 13:12:00 发布

阅读量1.8k

点赞数

分类专栏： C/C++ 文章标签： codec preprocessor integer signal encoding filter

C/C++ 专栏收录该内容

1243 篇文章 0 订阅

订阅专栏

The Speex Codec Manual (version 1.2-beta1)

Author: Jean-Marc Valin August 12, 2006

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Section, with no Front-Cover Texts, and with no Back-Cover. A copy of the license is included in the section entitled "GNU Free Documentation License".

CONTENTS 3

Contents

1 Introduction to Speex 6

2 Codec description 8

2.1 Concepts ................................ 8

2.2 Codec .................................. 10

2.3 Preprocessor............................... 10

2.4 AdaptiveJitterBuffer.......................... 10

2.5 AcousticEchoCanceller ........................ 10

3 Compiling 11

4 Command-line encoder/decoder 12

4.1 speexenc ................................. 12

4.2 speexdec ................................. 13

5 Programming with Speex (the libspeex API) 15

5.1 Encoding ................................ 15

5.2 Decoding ................................ 16

5.3 Preprocessor............................... 17

5.4 EchoCancellation............................ 18

5.4.1 Troubleshooting ........................ 20

5.5 CodecOptions(speex_*_ctl)...................... 21

5.6 Modequeries .............................. 23

5.7 Preprocessoroptions .......................... 23

5.8 Packingandin-bandsignalling..................... 24

6 Formats and standards 25

6.1 RTPPayloadFormat .......................... 25

6.2 MIMEType............................... 26

6.3 Oggfileformat ............................. 26

7 Introduction to CELP Coding 28

7.1 Source-FilterModelofSpeechPrediction ............... 28

7.2 LinearPrediction(LPC) ........................ 28

7.3 PitchPrediction............................. 30

7.4 InnovationCodebook.......................... 31

7.5 NoiseWeighting ............................ 31

CONTENTS 4

7.6 Analysis-by-Synthesis ......................... 32

8 Speex narrowband mode 33

8.1 Whole-FrameAnalysis ......................... 33

8.2 Sub-FrameAnalysis-by-Synthesis ................... 33

8.3 Bitallocation .............................. 35

8.4 Perceptualenhancement ........................ 36

9 Speex wideband mode (sub-band CELP) 38

9.1 LinearPrediction ............................ 38

9.2 PitchPrediction............................. 38

9.3 ExcitationQuantization......................... 38

9.4 Bitallocation .............................. 38

A FAQ 40

B Sample code 44

B.1 sampleenc.c............................... 44

B.2 sampledec.c............................... 46

4 Quality versus bit-rate 37

5 Bit allocation for high-band in wideband mode 39

1 Introduction to Speex

The Speex project (http://www.speex.org/) has been started because there was a need for a speech codec that was open-source and free from software patents. These are essential conditions for being used by any open-source software. There is already Vorbis that does general audio, but it is not really suitable for speech. Also, unlike many other speech codecs, Speex is not targeted at cell phones but rather at voice over IP (VoIP) and file-based compression.

As design goals, we wanted to have a codec that would allow both very good quality speech and low bit-rate (unfortunately not at the same time!), which led us to developing a codec with multiple bit-rates. Of course very good quality also meant we had to do wideband (16 kHz sampling rate) in addition to narrowband (telephone quality, 8 kHz sampling rate). Designing for VoIP instead of cell phone use means that Speex must be robust to lost packets, but not to corrupted ones since packets either arrive unaltered or don’t arrive at all. Also, the idea was to have a reasonable complexity and memory requirement without compromising too much on the efficiency of the codec. All this led us to the choice of CELP as the encoding technique to use for Speex.

One of the main reasons is that CELP has long proved that it could do the job and scale well to both low bit-rates (think DoD CELP @ 4.8 kbps) and high bit-rates (think G.728 @ 16 kbps).

The main characteristics can be summarized as follows:

. Free software/open-source, patent and royalty-free

. Integration of narrowband and wideband using an embedded bit-stream

. Wide range of bit-rates available (from 2 kbps to 44 kbps)

. Dynamic bit-rate switching and Variable Bit-Rate (VBR)

. Voice Activity Detection (VAD, integrated with VBR)

. Variable complexity

. Ultra-wideband mode at 32 kHz (up to 48 kHz)

. Intensity stereo encoding option

. Fixed-point implementation (work in progress)

This document is divided in the following way. Section 2 describes the different Speex features and defines some terms that will be used in later sections. Section 4 provides information about the standard command-line tools, while 5 contains information about programming using the Speex API. Section 6 has some information related to Speex and standards. The three last sections describe the internals of the codec and require some signal processing knowledge. Section 7 explains the general idea behind CELP, while sections 8 and 9 are specific to Speex. Note that if you are only interested in using Speex, those three last sections are not required.

2 Codec description

This section describes the main features provided by Speex.

2.1 Concepts

Here are some concepts in speech coding that help better understand the rest of the manual. Emphasis is placed on the Speex features.

Sampling rate

Speex is mainly designed for 3 different sampling rates: 8 kHz, 16 kHz, and 32 kHz. These are respectively refered to as narrowband, wideband and ultra-wideband. For a sampling rate of Fs kHz, the highest frequency that can be represented is equal to Fs/2 kHz. This is a consequence of Nyquist’s sampling theorem (and Fs/2 is known as the Nyquist frequency).

Quality

Speex encoding is controlled most of the time by a quality parameter that ranges from 0 to 10. In constant bit-rate (CBR) operation, the quality parameter is an integer, while for variable bit-rate (VBR), the parameter is a float.

Complexity (variable)

With Speex, it is possible to vary the complexity allowed for the encoder. This is done by controlling how the search is performed with an integer ranging from 1 to 10 in a way that’s similar to the -1 to -9 options to gzip and bzip2 compression utilities. For normal use, the noise level at complexity 1 is between 1 and 2 dB higher than at complexity 10, but the CPU requirements for complexity 10 is about 5 times higher than for complexity 1. In practice, the best trade-off is between complexity 2 and 4, though higher settings are often useful when encoding non-speech sounds like DTMF tones.

Variable Bit-Rate (VBR)

Variable bit-rate (VBR) allows a codec to change its bit-rate dynamically to adapt to the “difficulty” of the audio being encoded. In the example of Speex, sounds like vowels and high-energy transients require a higher bit-rate to achieve good quality,

while fricatives (e.g. s,f sounds) can be coded adequately with less bits. For this reason, VBR can achive lower bit-rate for the same quality, or a better quality for a certain bit-rate. Despite its advantages, VBR has two main drawbacks: first, by only specifying quality, there’s no guaranty about the final average bit-rate. Second, for some real-time applications like voice over IP (VoIP), what counts is the maximum bit-rate, which must be low enough for the communication channel.

Average Bit-Rate (ABR)

Average bit-rate solves one of the problems of VBR, as it dynamically adjusts VBR quality in order to meet a specific target bit-rate. Because the quality/bit-rate is adjusted in real-time (open-loop), the global quality will be slightly lower than that obtained by encoding in VBR with exactly the right quality setting to meet the target average bitrate.

Voice Activity Detection (VAD)

When enabled, voice activity detection detects whether the audio being encoded is speech or silence/background noise. VAD is always implicitly activated when encoding in VBR, so the option is only useful in non-VBR operation. In this case, Speex detects non-speech periods and encode them with just enough bits to reproduce the background noise. This is called “comfort noise generation” (CNG).

Discontinuous Transmission (DTX)

Discontinuous transmission is an addition to VAD/VBR operation, that allows to stop transmitting completely when the background noise is stationary. In file-based operation, since we cannot just stop writing to the file, only 5 bits are used for such frames (corresponding to 250 bps).

Perceptual enhancement

Perceptual enhancement is a part of the decoder which, when turned on, tries to reduce (the perception of) the noise produced by the coding/decoding process. In most cases, perceptual enhancement make the sound further from the original objectively (if you use SNR), but in the end it still sounds better (subjective improvement).

Algorithmic delay

Every speech codec introduces a delay in the transmission. For Speex, this delay is equal to the frame size, plus some amount of “look-ahead” required to process each frame. In narrowband operation (8 kHz), the delay is 30 ms, while for wideband (16 kHz), the delay is 34 ms. These values don’t account for the CPU time it takes to encode or decode the frames.

2.2 Codec

2.3 Preprocessor

This part refers to the preprocessor module introduced in the 1.1.x branch. The preprocessor is designed to be used on the audio before running the encoder. The preprocessor provides three main functionalities:

. denoising

. automatic gain control (AGC)

. voice activity detection (VAD)

The denoiser can be used to reduce the amount of background noise present in the input signal. This provides higher quality speech whether or not the denoised signal is encoded with Speex (or at all). However, when using the denoised signal with the codec, there is an additional benefit. Speech codecs in general (Speex included) tend to perform poorly on noisy input, which tends to amplify the noise. The denoiser greatly reduces this effect.

Automatic gain control (AGC) is a feature that deals with the fact that the recording volume may vary by a large amount between different setups. The AGC provides a way to adjust a signal to a reference volume. This is useful for voice over IP because it removes the need for manual adjustment of the microphone gain. A secondary advantage is that by setting the microphone gain to a conservative (low) level, it is easier to avoid clipping.

The voice activity detector (VAD) provided by the preprocessor is more advanced than the one directly provided in the codec.

2.4 Adaptive Jitter Buffer

2.5 Acoustic Echo Canceller

3 Compiling

Compiling Speex under UNIX or any platform supported by autoconf (e.g. Win32/cygwin) is as easy as typing:

% ./configure [options]

% make

% make install

The options supported by the Speex configure script are:

– prefix=<path> Specifies where to install Speex

– enable-shared/–disable-shared Whether to compile shared libraries

– enable-static/–disable-static Whether to compile static libraries

– disable-wideband Disable the wideband part of Speex (typically to same space)

– enable-valgrind Enable extra information when (and only when) running with valgrind

– enable-sse Enable use of SSE instructions (x86/float only)

– enable-fixed-point Compile Speex for a processor that does not have a floating point unit (FPU)

– enable-arm4-asm Enable assembly specific to the ARMv4 architecture (gcc only)

– enable-arm5e-asm Enable assembly specific to the ARMv5E architecture (gcc only)

– enable-fixed-point-debug Use only for debugging the fixed-point code (very slow)

– enable-epic-48k Enable a special (and non-compatible) 4.8 kbps narrowband mode

– enable-ti-c55x Enable support for the TI C5x family

– enable-blackfin-asm Enable assembly specific to the Blackfin DSP architecture (gcc only)

– enable-16bit-precision Reduces precision to 16 bits in time-critical areas (fixedpoint only)

4 Command-line encoder/decoder

The base Speex distribution includes a command-line encoder (speexenc) and decoder (speexdec). This section describes how to use these tools.

4.1 speexenc

The speexenc utility is used to create Speex files from raw PCM or wave files. It can be used by calling:

speexenc [options] input_file output_file

The value ’-’ for input_file or output_file corresponds respectively to stdin and stdout.

The valid options are:

– narrowband (-n) Tell Speex to treat the input as narrowband (8 kHz). This is the default

– wideband (-w) Tell Speex to treat the input as wideband (16 kHz)

– ultra-wideband (-u) Tell Speex to treat the input as “ultra-wideband” (32 kHz)

– quality n Set the encoding quality (0-10), default is 8

– bitrate n Encoding bit-rate (use bit-rate n or lower)

– vbr Enable VBR (Variable Bit-Rate), disabled by default

– abr n Enable ABR (Average Bit-Rate) at n kbps, disabled by default

– vad Enable VAD (Voice Activity Detection), disabled by default

– dtx Enable DTX (Discontinuous Transmission), disabled by default

– nframes n Pack n frames in each Ogg packet (this saves space at low bit-rates)

– comp n Set encoding speed/quality tradeoff. The higher the value of n, the slower the encoding (default is 3)

-V Verbose operation, print bit-rate currently in use

– help (-h) Print the help

– version (-v) Print version information

Speex comments

– comment Add the given string as an extra comment. This may be used multiple times.

– author Author of this track.

– title Title for this track. Raw input options

– rate n Sampling rate for raw input

– stereo Consider raw input as stereo

– le Raw input is little-endian

– be Raw input is big-endian

– 8bit Raw input is 8-bit unsigned

– 16bit Raw input is 16-bit signed

4.2 speexdec

The speexdec utility is used to decode Speex files and can be used by calling:

speexdec [options] speex_file [output_file]

The value ’-’ for input_file or output_file corresponds respectively to stdin and stdout. Also, when no output_file is specified, the file is played to the soundcard. The valid options are:

– enh enable post-filter (default)

– no-enh disable post-filter

– force-nb Force decoding in narrowband

– force-wb Force decoding in wideband

– force-uwb Force decoding in ultra-wideband

– mono Force decoding in mono

– stereo Force decoding in stereo

– rate n Force decoding at n Hz sampling rate

– packet-loss n Simulate n % random packet loss

-V Verbose operation, print bit-rate currently in use

– help (-h) Print the help

– version (-v) Print version information

5 Programming with Speex (the libspeex API)

This section explains how to use the Speex API. Examples of code can also be found in appendix B.

5.1 Encoding

In order to encode speech using Speex, you first need to:

#include <speex/speex.h>

You then need to declare a Speex bit-packing struct SpeexBits bits;and a Speex encoder state void *enc_state;The two are initialized by:

speex_bits_init(&bits);

enc_state = speex_encoder_init(&speex_nb_mode);

For wideband coding, speex_nb_mode will be replaced by speex_wb_mode. In most cases, you will need to know the frame size used by the mode you are using. You can get that value in the frame_size variable with:

speex_encoder_ctl(enc_state,SPEEX_GET_FRAME_SIZE,&frame_size);

In practice, frame_size will correspond to 20 ms when using 8, 16, or 32 kHz sampling rate. Once the initialization is done, for every input frame:

speex_bits_reset(&bits);

speex_encode_int(enc_state, input_frame, &bits);

nbBytes = speex_bits_write(&bits, byte_ptr, MAX_NB_BYTES);

where input_frame is a (short *) pointing to the beginning of a speech frame, byte_ptr is a (char *) where the encoded frame will be written, MAX_NB_BYTES is the maximum number of bytes that can be written to byte_ptr without causing an overflow and nbBytes is the number of bytes actually written to byte_ptr (the encoded size in bytes).

Before calling speex_bits_write, it is possible to find the number of bytes that need to be written by calling speex_bits_nbytes(&bits), which returns a number of bytes. It is still possible to use the speex_encode() function, which takes a (float *) for the audio. However, this would make an eventual port to an FPU-less platform (like ARM) more complicated. Internally, speex_encode() and speex_encode_int() are processed in the same way. Whether the encoder uses the fixed-point version is only decided by the compile-time flags, not at the API level.

After you’re done with the encoding, free all resources with:

speex_bits_destroy(&bits);

speex_encoder_destroy(enc_state);

That’s about it for the encoder.

5.2 Decoding

In order to decode speech using Speex, you first need to:

#include <speex/speex.h>

You also need to declare a Speex bit-packing struct

SpeexBits bits;and a Speex decoder state

void *dec_state;

The two are initialized by:

speex_bits_init(&bits);

dec_state = speex_decoder_init(&speex_nb_mode);

For wideband decoding, speex_nb_mode will be replaced by speex_wb_mode. If you need to obtain the size of the frames that will be used by the decoder, you can get that value in the frame_size variable with:

speex_decoder_ctl(dec_state, SPEEX_GET_FRAME_SIZE, &frame_size);

There is also a parameter that can be set for the decoder: whether or not to use a perceptual enhancer. This can be set by:

speex_decoder_ctl(dec_state, SPEEX_SET_ENH, &enh);

where enh is an int with value 0 to have the enhancer disabled and 1 to have it enabled. As of 1.2-beta1, the default is now to enable the enhancer. Again, once the decoder initialization is done, for every input frame:

speex_bits_read_from(&bits, input_bytes, nbBytes);

speex_decode_int(dec_state, &bits, output_frame);

where input_bytes is a (char *) containing the bit-stream data received for a frame, nbBytes is the size (in bytes) of that bit-stream, and output_frame is a (short *) and points to the area where the decoded speech frame will be written. A NULL value as the first argument indicates that we don’t have the bits for the current frame. When a frame is lost, the Speex decoder will do its best to "guess" the correct signal. As for the encoder, the speex_decode() function can still be used, with a (float *) as the output for the audio. After you’re done with the decoding, free all resources with:

speex_bits_destroy(&bits);

speex_decoder_destroy(dec_state);

5.3 Preprocessor

In order to use the Speex preprocessor, you first need to:

#include <speex/speex_preprocess.h>

Then, a preprocessor state can be created as:

SpeexPreprocessState *preprocess_state = speex_preprocess_state_init(frame_size, sampling)

It is recommended to use the same value for frame_size as is used by the encoder (20 ms). For each input frame, you need to call:

speex_preprocess(preprocess_state, audio_frame, echo_residue);

where audio_frame is used both as input and output and echo_residue is either an array filled by the echo canceller, or NULL if the preprocessor is used without the echo canceller. In cases where the output audio is not useful for a certain frame, it is possible to use instead:

speex_preprocess_estimate_update(preprocess_state, audio_frame, echo_residue);

This call will update all the preprocessor internal state variables without computing the output audio, thus saving some CPU cycles. The behaviour of the preprocessor can be changed using:

speex_preprocess_ctl(preprocess_state, request, ptr);

which is used in the same way as the encoder and decoder equivalent. Options are listed in Section . The preprocessor state can be destroyed using:

speex_preprocess_state_destroy(preprocess_state);

5.4 Echo Cancellation

The Speex library now includes an echo cancellation algorithm suitable for Acoustic Echo Cancellation (AEC). In order to use the echo canceller, you first need to

#include <speex/speex_echo.h>

Then, an echo canceller state can be created by:

SpeexEchoState *echo_state = speex_echo_state_init(frame_size, filter_length);

where frame_size is the amount of data (in samples) you want to process at once and filter_length is the length (in samples) of the echo cancelling filter you want to use (also known as tail length). It is recommended to use a frame size in the order of 20 ms (or equal to the codec frame size) and make sure it is easy to perform an FFT of that size (powers of two are better than prime sizes). The recommended tail length is approximately the third of the room reverberation time. For example, in a small room, reverberation time is in the order of 300 ms, so a tail length of 100 ms is a good choice (800 samples at 8000 Hz sampling rate). Once the echo canceller state is created, audio can be processed by:

speex_echo_cancel(echo_state, input_frame, echo_frame, output_frame, residue);

where input_frame is the audio as captured by the microphone, echo_frame is the signal that was played in the speaker (and needs to be removed) and output_frame is the signal with echo removed. The residue parameter is optional (you can set it to NULL) and is used to return the estimated power spectrum of the echo residue so it can be removed by the preprocessor (if you with to use it).

One important thing to keep in mind is the relationship between input_frame and echo_frame. It is important that, at any time, any echo that is present in the input has already been sent to the echo canceller as echo_frame. In other words, the echo canceller cannot remove a signal that it hasn’t yet received. On the other hand, the delay between the input signal and the echo signal must be small enough because otherwise part of the echo cancellation filter is inefficient. In the ideal case, you code would look like:

write_to_soundcard(echo_frame, frame_size);

read_from_soundcard(input_frame, frame_size);

speex_echo_cancel(echo_state, input_frame, echo_frame, output_frame, residue);

As stated above, if you wish to further reduce the echo present in the signal, you can do so by passing residue as the last parameter of speex_preprocess() function (see Section 5.3).

As of version 1.2-beta1, there is an alternative, simpler API that can be used instead of speex_echo_cancel(). When audio capture and playback are handled asynchronously (e.g. in different threads or using the poll() or select() system call), it can be difficult to keep track of what input_frame comes with what echo_frame. Instead, the playback comtext/thread can simply call:

speex_echo_playback(echo_state, echo_frame);

every time an audio frame is played. Then, the capture context/thread calls:

speex_echo_capture(echo_state, input_frame, output_frame, residue);

for every frame captured. Internally, speex_echo_playback() simply buffers the playback frame so it can be used by speex_echo_capture() to call speex_echo_cancel(). When capture and playback are done synchronously, speex_echo_cancel() is still prefered since it gives better control on the exact input/echo timing.

The echo cancellation state can be destroyed with:

speex_echo_state_destroy(echo_state);

It is also possible to reset the state of the echo canceller so it can be reused without the need to create another state with: speex_echo_state_reset(echo_state);

5.4.1 Troubleshooting

There are several things that may prevent the echo canceller from working properly. One of them is a bug (or something suboptimal) in the code, but there are many others you should consider first . Using a different soundcard to do the capture and plaback will *not* work, regardless of what you may think. The only exception to that is if the two cards can be made to have their sampling clock “locked” on the same clock source. . The delay between the record and playback signals must be minimal. Any signal played has to “appear” on the playback (far end) signal slightly before the echo canceller “sees” it in the near end signal, but excessive delay means that part of the filter length is wasted. In the worst situations, the delay is such that it is longer than the filter length, in which case, no echo can be cancelled.

. When it comes to echo tail length (filter length), longer is *not* better. Actually, the longer the tail length, the longer it takes for the filter to adapt. Of course, a tail length that is too short will not cancel enough echo, but the most common problem seen is that people set a very long tail length and then wonder why no echo is being cancelled.

. Non-linear distortion cannot (by definition) be modeled by the linear adaptive filter used in the echo canceller and thus cannot be cancelled. Use good audio gear and avoid saturation/clipping.

Also useful is reading Echo Cancellation Demystified by Alexey Frunze1, which explains the fundamental principles of echo cancellation. The details of the algorithm described in the article are different, but the general ideas of echo cancellation through adaptive filters are the same. 1http://www.embeddedstar.com/articles/2003/7/article20030720-1.html

5.5 Codec Options (speex_*_ctl)

The Speex encoder and decoder support many options and requests that can be accessed through the speex_encoder_ctl and speex_decoder_ctl functions. These functions are similar to the ioctl system call and their prototypes are:

void speex_encoder_ctl(void *encoder, int request, void *ptr);

void speex_decoder_ctl(void *encoder, int request, void *ptr);

The different values of request allowed are (note that some only apply to the encoder or the decoder):

SPEEX_SET_ENH** Set perceptual enhancer to on (1) or off (0) (integer)

SPEEX_GET_ENH** Get perceptual enhancer status (integer)

SPEEX_GET_FRAME_SIZE Get the frame size used for the current mode (integer)

SPEEX_SET_QUALITY* Set the encoder speech quality (integer 0 to 10)

SPEEX_GET_QUALITY* Get the current encoder speech quality (integer 0 to 10)

SPEEX_SET_MODE*.

SPEEX_GET_MODE*.

SPEEX_SET_LOW_MODE*.

SPEEX_GET_LOW_MODE*.

SPEEX_SET_HIGH_MODE*.

SPEEX_GET_HIGH_MODE*.

SPEEX_SET_VBR* Set variable bit-rate (VBR) to on (1) or off (0) (integer)

SPEEX_GET_VBR* Get variable bit-rate (VBR) status (integer)

SPEEX_SET_VBR_QUALITY* Set the encoder VBR speech quality (float 0 to 10)

SPEEX_GET_VBR_QUALITY* Get the current encoder VBR speech quality (float 0 to 10)

SPEEX_SET_COMPLEXITY* Set the CPU resources allowed for the encoder (integer 1 to 10)

SPEEX_GET_COMPLEXITY* Get the CPU resources allowed for the encoder (integer 1 to 10)

SPEEX_SET_BITRATE* Set the bit-rate to use to the closest value not exceeding the parameter (integer in bps)

SPEEX_GET_BITRATE Get the current bit-rate in use (integer in bps)

SPEEX_SET_SAMPLING_RATE Set real sampling rate (integer in Hz)

SPEEX_GET_SAMPLING_RATE Get real sampling rate (integer in Hz)

SPEEX_RESET_STATE Reset the encoder/decoder state to its original state (zeros all memories)

SPEEX_SET_VAD* Set voice activity detection (VAD) to on (1) or off (0) (integer)

SPEEX_GET_VAD* Get voice activity detection (VAD) status (integer)

SPEEX_SET_DTX* Set discontinuous transmission (DTX) to on (1) or off (0) (integer)

SPEEX_GET_DTX* Get discontinuous transmission (DTX) status (integer)

SPEEX_SET_ABR* Set average bit-rate (ABR) to a value n in bits per second (integer in bps)

SPEEX_GET_ABR* Get average bit-rate (ABR) setting (integer in bps)

SPEEX_SET_PLC_TUNING* Tell the encoder to optimize encoding for a certain percentage of packet loss (integer in percent)

SPEEX_GET_PLC_TUNING* Get the current tuning of the encoder for PLC (integer in percent)

* applies only to the encoder ** applies only to the decoder . normally only used internally

5.6 Mode queries

Speex modes have a query system similar to the speex_encoder_ctl and speex_decoder_ctl calls. Since modes are read-only, it is only possible to get information about a particular mode. The function used to do that is:

void speex_mode_query(SpeexMode *mode, int request, void *ptr);

The admissible values for request are (unless otherwise note, the values are returned through ptr):

SPEEX_MODE_FRAME_SIZE Get the frame size (in samples) for the mode

SPEEX_SUBMODE_BITRATE Get the bit-rate for a submode number specified through ptr (integer in bps).

5.7 Preprocessor options

SPEEX_PREPROCESS_SET_DENOISE Turns denoising on(1) or off(2) (integer)

SPEEX_PREPROCESS_GET_DENOISE Get denoising status (integer)

SPEEX_PREPROCESS_SET_AGC Turns automatic gain control (AGC) on(1) oroff(2) (integer)

SPEEX_PREPROCESS_GET_AGC Get AGC status (integer)

SPEEX_PREPROCESS_SET_VAD Turns voice activity detector (VAD) on(1) or off(2) (integer)

SPEEX_PREPROCESS_GET_VAD Get VAD status (integer)

SPEEX_PREPROCESS_SET_AGC_LEVEL

SPEEX_PREPROCESS_GET_AGC_LEVEL

SPEEX_PREPROCESS_SET_DEREVERB Turns reverberation removal on(1) or off(2) (integer)

SPEEX_PREPROCESS_GET_DEREVERB Get reverberation removal status (integer)

SPEEX_PREPROCESS_SET_DEREVERB_LEVEL

SPEEX_PREPROCESS_GET_DEREVERB_LEVEL

SPEEX_PREPROCESS_SET_DEREVERB_DECAY

SPEEX_PREPROCESS_GET_DEREVERB_DECAY

5.8 Packing and in-band signalling

Sometimes it is desirable to pack more than one frame per packet (or other basic unit of storage). The proper way to do it is to call speex_encode N times before writing the stream with speex_bits_write. In cases where the number of frames is not determined by an out-of-band mechanism, it is possible to include a terminator code. That terminator consists of the code 15 (decimal) encoded with 5 bits, as shown in Table 4. Note that as of version 1.0.2, calling speex_bits_write automatically inserts the terminator so as to fill the last byte. This doesn’t involves any overhead and makes sure Speex can always detect when there is no more frame in a packet.

It is also possible to send in-band “messages” to the other side. All these messages are encoded as “pseudo-frames” of mode 14 which contain a 4-bit message type code, followed by the message. Table 1 lists the available codes, their meaning and the size of the message that follows. Most of these messages are requests that are sent to the encoder or decoder on the other end, which is free to comply or ignore them. By default, all in-band messages are ignored.

Finally, applications may define custom in-band messages using mode 13. The size of the message in bytes is encoded with 5 bits, so that the decoder can skip it if it doesn’t know how to interpret it.

Code Size (bits) Content

0 1 Asks decoder to set perceptual enhancement off (0) or on(1)

1 1 Asks (if 1) the encoder to be less “agressive” due to high packet loss

2 4 Asks encoder to switch to mode N

3 4 Asks encoder to switch to mode N for low-band

4 4 Asks encoder to switch to mode N for high-band

5 4 Asks encoder to switch to quality N for VBR

6 4 Request acknowloedge (0=no, 1=all, 2=only for in-band data)

7 4 Asks encoder to set CBR (0), VAD(1), DTX(3), VBR(5), VBR+DTX(7)

8 8 Transmit (8-bit) character to the other end

9 8 Intensity stereo information

10 16 Announce maximum bit-rate acceptable (N in bytessecond)

11 16 reserved

12 32 Acknowledge receiving packet N

13 32 reserved

14 64 reserved

15 64 reserved

//Table 1: In-band signalling codes

6 Formats and standards

Speex can encode speech in both narrowband and wideband and provides different bitrates. However, not all features need to be supported by a certain implementation or device. In order to be called “Speex compatible” (whatever that means), an implementation must implement at least a basic set of features.

At the minimum, all narrowband modes of operation MUST be supported at the decoder. This includes the decoding of a wideband bit-stream by the narrowband decoder2. If present, a wideband decoder MUST be able to decode a narrowband stream, and MAY either be able to decode all wideband modes or be able to decode the embedded narrowband part of all modes (which includes ignoring the high-band bits).

For encoders, at least one narrowband or wideband mode MUST be supported. The main reason why all encoding modes do not have to be supported is that some platforms may not be able to handle the complexity of encoding in some modes.

6.1 RTP Payload Format

The RTP payload draft is included in appendix C and the latest version is available at http://www.speex.org/drafts/latest. This draft has been sent (2003/02/26) to 2The wideband bit-stream contains an embedded narrowband bit-stream which can be decoded alone

the Internet Engineering Task Force (IETF) and will be discussed at the March 18th meeting in San Francisco.

6.2 MIME Type

For now, you should use the MIME type audio/x-speex for Speex-in-Ogg. We will apply for type audio/speex in the near future.

6.3 Ogg file format

Speex bit-streams can be stored in Ogg files. In this case, the first packet of the Ogg file contains the Speex header described in table 2. All integer fields in the headers are stored as little-endian. The speex_string field must contain the “Speex ” (with 3 trailing spaces), which identifies the bit-stream. The next field, speex_version contains the version of Speex that encoded the file. For now, refer to speex_header.[ch] for more info. The beginning of stream (b_o_s) flag is set to 1 for the header. The header packet has packetno=0 and granulepos=0.

The second packet contains the Speex comment header. The format used is the Vorbis comment format described here: http://www.xiph.org/ogg/vorbis/doc/v-comment.html

. This packet has packetno=1 and granulepos=0.

The third and subsequent packets each contain one or more (number found in header) Speex frames. These are identified with packetno starting from 2 and the granulepos is the number of the last sample encoded in that packet. The last of these packets has the end of stream (e_o_s) flag is set to 1.

Field Type Size

speex_string char[] 8

speex_version char[] 20

speex_version_id int 4

header_size int 4

rate int 4

mode int 4

mode_bitstream_version int 4

nb_channels int 4

bitrate int 4

frame_size int 4

vbr int 4

frames_per_packet int 4

extra_headers int 4

reserved1 int 4

reserved2 int 4

Table 2: Ogg/Speex header packet

7 Introduction to CELP Coding

Speex is based on CELP, which stands for Code Excited Linear Prediction. This section attempts to introduce the principles behind CELP, so if you are already familiar with CELP, you can safely skip to section 8. The CELP technique is based on three ideas:

1. The use of a linear prediction (LP) model to model the vocal tract

2. The use of (adaptive and fixed) codebook entries as input (excitation) of the LP model

3. The search performed in closed-loop in a “perceptually weighted domain” This section describes the basic ideas behind CELP. This is still a work in progress.

7.1 Source-Filter Model of Speech Prediction

The source-filter model of speech production assumes that the vocal cords are the source of spectrally flat sound (the excitation signal), and that the vocal tract acts as a filter to spectrally shape the various sounds of speech. While still an approximation, the model is widely used in speech coding because of its simplicity.Its use is also the reason why most speech codecs (Speex included) perform badly on music signals. The different phonemes can be distinguished by their excitation (source) and spectral shape (filter). Voiced sounds (e.g. vowels) have an excitation signal that is periodic and that can be approximated by an impulse train in the time domain or by regularly-spaced harmonics in the frequency domain. On the other hand, fricatives (such as the "s", "sh" and "f" sounds) have an excitation signal that is similar to white Gaussian noise. So called voice fricatives (such as "z" and "v") have excitation signal composed of an harmonic part and a noisy part. The source-filter model is usually tied with the use of Linear prediction. The CELP model is based on source-filter model, as can be seen from the CELP decoder illustrated in Figure 1.

7.2 Linear Prediction (LPC)

Linear prediction is at the base of many speech coding techniques, including CELP. The idea behind it is to predict the signal x[n] using a linear combination of its past samples:

Figure 1: The CELP model of speech synthesis (decoder) y[n]= N . 1i=aix[n .i]

where y[n] is the linear prediction of x[n]. The prediction error is thus given by: // a potho

Because R is toeplitz hermitian, the Levinson-Durbin algorithm can be used, making the solution to the problem O N2instead of O N3. Also, it can be proven that all the roots of A(z) are within the unit circle, which means that 1/A(z) is always stable. This is in theory; in practice because of finite precision, there are two commonly used techniques to make sure we have a stable filter. First, we multiply R(0) by a number slightly above one (such as 1.0001), which is equivalent to adding noise to the signal. Also, we can apply a window to the auto-correlation, which is equivalent to filtering in the frequency domain, reducing sharp resonances.

7.3 Pitch Prediction

During voiced segments, the speech signal is periodic, so it is possible to take advantage of that property by approximating the excitation signal e[n] by a gain times the past of the excitation:

e[n] . p[n]= be[n .T ]

where T is the pitch period, b is the pitch gain. We call that long-term prediction since the excitation is predicted from e[n .T ] with T .N.

7.4 Innovation Codebook

The final excitation e[n]will be the sum of the pitch prediction and an innovation signal c[n]taken from a fixed codebook, hence the name Code Excited Linear Prediction. The final excitation is given by:

e[n]= p[n]+c[n]=be[n .T ]+c[n]

The quantization of c[n] is where most of the bits in a CELP codec are allocated. It represents the information that couldn’t be obtained either from linear prediction or pitch prediction. In the z-domain we can represent the final signal X (z)as

X (z)= C(z) / A(z)(1 - bz.^-T )

7.5 Noise Weighting

Most (if not all) modern audio codecs attempt to “shape” the noise so that it appears mostly in the frequency regions where the ear cannot detect it. For example, the ear is more tolerant to noise in parts of the spectrum that are louder and vice versa. In order to maximize speech quality, CELP codecs minimize the mean square of the error (noise) in the perceptually weighted domain. This means that a perceptual noise weighting filter W (z) is applied to the error signal in the encoder. In most CELP codecs, W (z) is a pole-zero weighting filter derived from the linear prediction coefficients (LPC), generally using bandwidth expansion. Let the spectral envelope be represented by the synthesis filter 1/A(z), CELP codecs typically derive the noise weighting filter as:

W (z)= A(z/g1)/A(z/g2) (1)

where g1 =0.9 and g2 =0.6 in the Speex reference implementation. If a filter A(z)has (complex) poles at pi in the z-plane, the filter A(z/g) will have its poles at p = gpi. making it a flatter version of A(z).

The weighting filter is applied to the error signal used to optimize the codebook search through analysis-by-synthesis (AbS). This results in a spectral shape of the noise that tends towards 1/W (z). While the simplicity of the model has been an important reason for the success of CELP, it remains that W (z)is a very rough approximation for the perceptually optimal noise weighting function. Fig. 2 illustrates the noise shaping that results from Eq. 1. Throughout this paper, we refer to W (z)as the noise weighting filter and to 1/W (z)as the noise shaping filter (or curve).

7.6 Analysis-by-Synthesis

One of the main principles behind CELP is called Analysis-by-Synthesis (AbS), meaning that the encoding (analysis) is performed by perceptually optimising the decoded (synthesis) signal in a closed loop. In theory, the best CELP stream would be produced by trying all possible bit combinations and selecting the one that produces the best-sounding decoded signal. This is obviously not possible in practice for two reasons: the required complexity is beyond any currently available hardware and the “best sounding” selection criterion implies a human listener.

In order to achieve real-time encoding using limited computing resources, the CELP optimisation is broken down into smaller, more manageable, sequential searches using the perceptual weighting function described earlier.

8 Speex narrowband mode

This section looks at how Speex works for narrowband (8 kHz sampling rate) operation. The frame size for this mode is 20 ms, corresponding to 160 samples. Each frame is also subdivided into 4 sub-frames of 40 samples each. Also many design decisions were based on the original goals and assumptions:

. Minimizing the amount of information extracted from past frames (for robustness to packet loss)

. Dynamically-selectable codebooks (LSP, pitch and innovation)

. sub-vector fixed (innovation) codebooks

8.1 Whole-Frame Analysis

In narrowband, Speex frames are 20 ms long (160 samples) and are subdivided in 4 subframes of 5 ms each (40 samples). For most narrowband bit-rates (8 kbps and above), the only parameters encoded at the frame level are the Line Spectral Pairs (LSP) and a global excitation gain gf rame, as shown in Fig. 3. All other parameters are encoded at the sub-frame level.

Linear prediction analysis is performed once per frame using an asymmetric Hamming window centered on the fourth sub-frame. Because linear prediction coefficients (LPC) are not robust to quantization, they are first are converted to line spectral pairs (LSP). The LSP’s are considered to be associated to the 4th sub-frames and the LSP’s associated to the first 3 sub-frames are linearly interpolated using the current and previous LSP coefficients. The LSP coefficients and converted back to the LPC filter A.(z).

The non-quantized interpolated filter is denoted A(z) and can be used for the weighting filter W (z) because it does not need to be available to the decoder. To make Speex more robust to packet loss, no prediction is applied on the LSP coefficients prior to quantization. The LSPs are encoded using vector quantizatin (VQ) with 30 bits for higher quality modes and 18 bits for lower quality.

8.2 Sub-Frame Analysis-by-Synthesis

The analysis-by-synthesis (AbS) encoder loop is described in Fig. 4. There are three main aspects where Speex significantly differs from most other CELP codecs. First, while most recent CELP codecs make use of fractional pitch estimation with a single

Figure 3: Frame open-loop analysis

Figure 4: Analysis-by-synthesis closed-loop optimization on a sub-frame.

gain, Speex uses an integer to encode the pitch period, but uses a 3-tap predictor (3 gains). The adaptive codebook contribution ea[n]can thus be expressed as:

ea[n]=g0e[n .T .1]+g1e[n .T ]+g2e[n .T +1] (2)

where g0, g1 and g2 are the jointly quantized pitch gains and e[n]is the codec excitation memory. It is worth noting that when the pitch is smaller than the sub-frame size, we repeat the excitation at a period T . For example, when n .T +1 ≥ 0, we use n .2T +1 instead. In most modes, the pitch period is encoded with 7 bits in the [17,144] range and the bi coefficients are vector-quantized using 7 bits at higher bit-rates (15 kbps narrowband and above) and 5 bits at lower bit-rates (11 kbps narrowband and below). Many current CELP codecs use moving average (MA) prediction to encode the fixed codebook gain. This provides slightly better coding at the expense of introducing a dependency on previously encoded frames. A second difference is that Speex encodes the fixed codebook gain as the product of the global excitation gain gf rame with a subframe gain corrections gsub f . This increases robustness to packet loss by eliminating the inter-frame dependency. The sub-frame gain correction is encoded before the fixed codebook is searched (not closed-loop optimized) and uses between 0 and 3 bits per sub-frame, depending on the bit-rate.

The third difference is that Speex uses sub-vector quantization of the innovation (fixed codebook) signal instead of an algebraic codebook. Each sub-frame is divided into sub-vectors of lengths ranging between 5 and 20 samples. Each sub-vector is chosen from a bitrate-dependent codebook and all sub-vectors are concatenated to form a sub-frame. As an example, the 3.95 kbps mode uses a sub-vector size of 20 samples with 32 entries in the codebook (5 bits). This means that the innovation is encoded with 10 bits per sub-frame, or 2000 bps. On the other hand, the 18.2 kbps mode uses a sub-vector size of 5 samples with 256 entries in the codebook (8 bits), so the innovation uses 64 bits per sub-frame, or 12800 bps.

8.3 Bit allocation

There are 7 different narrowband bit-rates defined for Speex, ranging from 250 bps to 24.6 kbps, although the modes below 5.9 kbps should not be used for speech. The bit-allocation for each mode is detailed in table 3. Each frame starts with the mode ID encoded with 4 bits which allows a range from 0 to 15, though only the first 7 values are used (the others are reserved). The parameters are listed in the table in the order they are packed in the bit-stream. All frame-based parameters are packed before sub-frame

parameters. The parameters for a certain sub-frame are all packed before the following sub-frame is packed. Note that the “OL” in the parameter description means that the parameter is an open loop estimation based on the whole frame.

Parameter Update rate 0 1 2 3 4 5 6 7 8

Wideband bit frame 1 1 1 1 1 1 1 1 1

Mode ID frame 4 4 4 4 4 4 4 4 4

LSP frame 0 18 18 18 18 30 30 30 18

OL pitch frame 0 7 7 0 0 0 0 0 7

OL pitch gain frame 0 4 0 0 0 0 0 0 4

OL Exc gain frame 0 5 5 5 5 5 5 5 5

Fine pitch sub-frame 0 0 0 7 7 7 7 7 0

Pitch gain sub-frame 0 0 5 5 5 7 7 7 0

Innovation gain sub-frame 0 1 0 1 1 3 3 3 0

Innovation VQ sub-frame 0 0 16 20 35 48 64 96 10

Total frame 5 43 119 160 220 300 364 492 79

Table 3: Bit allocation for narrowband modes

So far, no MOS (Mean Opinion Score) subjective evaluation has been performed for Speex. In order to give an idea of the quality achivable with it, table 4 presents my own subjective opinion on it. It sould be noted that different people will perceive the quality differently and that the person that designed the codec often has a bias (one way or another) when it comes to subjective evaluation. Last thing, it should be noted that for most codecs (including Speex) encoding quality sometimes varies depending on the input. Note that the complexity is only approximate (within 0.5 mflops and using the lowest complexity setting). Decoding requires approximately 0.5 mflops in most modes (1 mflops with perceptual enhancement).

8.4 Perceptual enhancement

This section was only valid for version 1.1.12 and earlier. It does not apply to version 1.2-beta1 (and later), for which the new perceptual enhancement is not yet documented. This part of the codec only applies to the decoder and can even be changed without affecting inter-operability. For that reason, the implementation provided and described here should only be considered as a reference implementation. The enhancement system is divided into two parts. First, the synthesis filter S(z)= 1/A(z) is replaced by an enhanced filter:

S(z)= ′ A (z/a2)A (z/a3) / A (z) A (z/a1)

Mode Bit-rate (bps) mflops Quality/description

0 250 N/A No transmission (DTX)

1 2,150 6 Vocoder (mostly for comfort noise)

2 5,950 9 Very noticeable artifacts/noise, good intelligibility

3 8,000 10 Artifacts/noise sometimes noticeable

4 11,000 14 Artifacts usually noticeable only with headphones

5 15,000 11 Need good headphones to tell the difference

6 18,200 17.5 Hard to tell the difference even with good headphones

7 24,600 14.5 Completely transparent for voice, good quality music

8 3,950 10.5 Very noticeable artifacts/noise, good intelligibility

9 N/A N/A reserved

10 N/A N/A reserved

11 N/A N/A reserved

12 N/A N/A reserved

13 N/A N/A Application-defined, interpreted by callback or skipped

14 N/A N/A Speex in-band signaling

15 N/A N/A Terminator code

Table 4: Quality versus bit-rate 1.ra1

where a1 and a2 depend on the mode in use and a3 = 1 r with r = .9. The 1 . 1.ra2 second part of the enhancement consists of using a comb filter to enhance the pitch in the excitation domain.

9 Speex wideband mode (sub-band CELP)

For wideband, the Speex approach uses a quadrature mirror f ilter (QMF) to split the band in two. The 16 kHz signal is thus divided into two 8 kHz signals, one representing the low band (0-4 kHz), the other the high band (4-8 kHz). The low band is encoded with the narrowband mode described in section 8 in such a way that the resulting “ embedded narrowband bit-stream” can also be decoded with the narrowband decoder. Since the low band encoding has already been described, only the high band encoding is described in this section.

9.1 Linear Prediction

The linear prediction part used for the high-band is very similar to what is done for narrowband. The only difference is that we use only 12 bits to encode the high-band LSP’s using a multi-stage vector quantizer (MSVQ). The first level quantizes the 10 coefficients with 6 bits and the error is then quantized using 6 bits, too.

9.2 Pitch Prediction

That part is easy: there’s no pitch prediction for the high-band. There are two reasons for that. First, there is usually little harmonic structure in this band (above 4 kHz). Second, it would be very hard to implement since the QMF folds the 4-8 kHz band into 4-0 kHz (reversing the frequency axis), which means that the location of the harmonics is no longer at multiples of the fundamental (pitch).

9.3 Excitation Quantization

The high-band excitation is coded in the same way as for narrowband.

9.4 Bit allocation

For the wideband mode, the entire narrowband frame is packed before the high-band is encoded. The narrowband part of the bit-stream is as defined in table 3. The highband follows, as described in table 5. This also means that a wideband frame may be correctly decoded by a narrowband decoder with the only caveat that if more than one frame is packed in the same packet, the decoder will need to skip the high-band parts in order to sync with the bit-stream.

Parameter Update rate 0 1 2 3 4

Wideband bit frame 1 1 1 1 1

Mode ID frame 3 3 3 3 3

LSP frame 0 12 12 12 12

Excitation gain sub-frame 0 5 4 4 4

Excitation VQ sub-frame 0 0 20 40 80

Total frame 4 36 112 192 352

Table 5: Bit allocation for high-band in wideband mode

A FAQ

Vorbis is open-source and patent-free; why do we need Speex?

Vorbis is a great project but its goals are not the same as Speex. Vorbis is mostly aimed at compressing music and audio in general, while Speex targets speech only. For that reason Speex can achieve much better results than Vorbis on speech, typically 2-4 times higher compression at equal quality.

Isn’t there an open-source implementation of the GSM-FR codec? Why is Speex necessary?

First of all, it’s not clear whether GSM-FR is covered by a Philips patent (see http://kbs.cs.tuberlin.de/~jutta/toast.html). Also, GSM-FR offers mediocre quality at a relatively high bit-rate, while Speex can offer equivalent quality at almost half the bit-rate. Last but not least, Speex offers a wide range of bit-rates and sampling rates, while GSM-FR is limited to 8 kHz speech at 13 kbps.

Under what license is Speex released?

As of version 1.0 beta 1, Speex is released under Xiph’s version of the (revised) BSD license (see Appendix D). This license is one of the most permissive open-source licenses.

Am I allowed to use Speex in commercial software?

Yes. As long as you comply with the license. This basically means you have to keep the copyright notice and you can’t use our name to promote your product without authorization. For more details, see license in Appendix D. Ogg, Speex, Vorbis, what’s the difference? Ogg is a container format for holding multimedia data. Vorbis is an audio codec that uses Ogg to store its bit-streams as files, hence the name Ogg Vorbis. Speex also uses the Ogg format to store its bit-streams as files, so technically they would be “Ogg Speex” files (I prefer to call them just Speex files). One difference with Vorbis however, is that Speex is less tied with Ogg. Actually, if you just do Voice over IP (VoIP), you don’t need Ogg at all.

What’s the extension for Speex?

Speex files have the .spx extension. Note, however that the Speex tools (speexenc, speexdec) do not rely on the extension at all, so any extension will work.

Can I use Speex for compressing music?

Just like Vorbis is not really adapted to speech, Speex is really not adapted for music. In most cases, you’ll be better of with Vorbis when it comes to music.

I converted some MP3s to Speex and the quality is bad. What’s wrong?

This is called transcoding and it will always result in much poorer quality than the original MP3. Unless you have a really good (size) reason to do so, never transcode speech. This is even valid for self transcoding (tandeming), i.e. If you decode a Speex file and re-encode it again at the same bit-rate, you will lose quality.

Does Speex run on Windows?

Compilation on Windows has been supported since version 0.8.0. There are also several front-ends available from the website.

Why is encoding so slow compared to decoding?

For most kinds of compression, encoding is inherently slower than decoding. In the case of Speex, encoding consists of finding, for each vector of 5 to 10 samples, the entry that matches the best within a codebook consisting of 16 to 256 entries. On the other hand, at decoding all that needs to be done is look up the right entry in the codebook using the encoded index. Since a lookup is much faster than a search, the decoder works much faster than the encoder.

Why is Speex so slow on my iPaq (or insert any platform without an FPU)?

You probably didn’t build Speex with the fixed-point option (–enable-fixed-point). Even if you did, not all modes have been ported to use fixed-point arithmetic, so the code may be slowed down by a few float operations left (e.g. in the wideband mode).

I’m getting unusual background noise (hiss) when using libspeex in my application. How do I fix that?

One of the causes could be scaling of the input speech. Speex expects signals to have a ±215 (signed short) dynamic range. If the dynamic range of your signals is too small (e.g. ±1.0), you will suffer important quantization noise. A good target is to have a dynamic range around ±8000 which is large enough, but small enough to make sure there’s no clipping when converting back to signed short.

I get very distorted speech when using libspeex in my application. What’s wrong?

There are many possible causes for that. One of them is errors in the way the bits are manipulated. Another possible cause is the use of the same encoder or decoder state for more than one audio stream (channel), which produces strange effects with the filter memories. If the input speech has an amplitude close to ±215, it is possible that at decoding, the amplitude be a bit higher than that, causing clipping when saving as 16-bit PCM.

How does Speex compare to other proprietary codecs?

It’s hard to give precise figures since no formal listening tests have been performed yet. All I can say is that in terms of quality, Speex competes on the same ground as other proprietary codecs (not necessarily the best, but not the worst either). Speex also has many features that are not present in most other codecs. These include variable bit-rate (VBR), integration of narrowband and wideband, as well as stereo support. Of course, another area where Speex is really hard to beat is the quality/price ratio. Unlike many very expensive codecs, Speex is free and anyone may distribute or modify it at will.

Can Speex pass DTMF?

I guess it all depends on the bit-rate used. Though no formal testing has yet been performed, I’d say is correctly at 8 kbps and above. Also, make sure you don’t use the lowest complexity (see SPEEX_SET_COMPLEXITY or –comp option), as it causes significant noise.

Can Speex pass V.9x modem signals correctly?

If I could do that I’d be very rich by now :-) Seriously, that would break fundamental laws of information theory.

What is your (Jean-Marc) relationship with the University of Sherbrooke and how does Speex fit into that? I have completed my Ph.D. at the University of Sherbrooke in 2005 in mobile robotics. Although I did my master with the Sherbrooke speech coding group (in speech enhancement, not coding), was no longer associated with them when developing Speex. It should not be understood that they or the University of Sherbrooke have anything to do with the Speex project. Furthermore, Speex does not make use of any code or proprietary technology developed in the Sherbrooke speech coding group.

CELP, ACELP, what’s the difference?

CELP stands for “Code Excited Linear Prediction”, while ACELP stands for “Algebraic Code Excited Linear Prediction”. That means ACELP is a CELP technique that uses an algebraic codebook represented as a sum of unit pulses, thus making the codebook search much more efficient. This technique was invented at the University of Sherbrooke and is now one of the most widely used form of CELP. Unfortunately, since it is patented, it cannot be used in Speex.

B SAMPLECODE

B Sample code

This section shows sample code for encoding and decoding speech using the Speex API. The commands can be used to encode and decode a file by calling: % sampleenc in_file.sw | sampledec out_file.swwhere both files are raw (no header) files encoded at 16 bits per sample (in the machine natural endianness).

B.1 sampleenc.c

sampleenc takes a raw 16 bits/sample file, encodes it and outputs a Speex stream to stdout. Note that the packing used is NOT compatible with that of speexenc/speexdec.

#include <speex/speex.h>

#include <stdio.h>

/*The frame size in hardcoded for this sample code but it doesn ’t have to be*/

#define FRAME_SIZE 160

int main(int argc, char **argv)

{

char *inFile;

FILE *fin;

short in[FRAME_SIZE];

float input[FRAME_SIZE];

char cbits[200];

int nbBytes;

/*Holds the state of the encoder*/

void *state;

/*Holds bits so they can be read and written to by the Speex routines*/

SpeexBits bits;

int i, tmp;

/*Create a new encoder state in narrowband mode*/

state = speex_encoder_init(&speex_nb_mode);

/*Set the quality to 8 (15 kbps)*/

tmp=8;

speex_encoder_ctl(state, SPEEX_SET_QUALITY, &tmp);

inFile = argv[1];

fin = fopen(inFile, "r");

/*Initialization of the structure that holds the bits*/

speex_bits_init(&bits);

while (1)

{

/*Read a 16 bits/sample audio frame*/

fread(in, sizeof(short), FRAME_SIZE, fin);

if (feof(fin))

break;

/*Copy the 16 bits values to float so Speex can work on them*/

for (i=0;i<FRAME_SIZE;i++)

input[i]=in[i];

/*Flush all the bits in the struct so we can encode a new frame*/

speex_bits_reset(&bits);

/*Encode the frame*/

speex_encode(state, input, &bits);

/*Copy the bits to an array of char that can be written*/

nbBytes = speex_bits_write(&bits, cbits, 200);

/*Write the size of the frame first. This is what sampledec expects butit ’s likely to be different in your own application*/

fwrite(&nbBytes, sizeof(int), 1, stdout);

/*Write the compressed data*/

fwrite(cbits, 1, nbBytes, stdout);

}

/*Destroy the encoder state*/

speex_encoder_destroy(state);

/*Destroy the bit-packing struct*/

speex_bits_destroy(&bits);

fclose(fin);

return 0;

}

B.2 sampledec.c

sampledec reads a Speex stream from stdin, decodes it and outputs it to a raw 16 bits/sample file. Note that the packing used is NOT compatible with that of speex-enc/speexdec.

#include <speex/speex.h>

#include <stdio.h>

/*The frame size in hardcoded for this sample code but it doesn ’t have to be*/

#define FRAME_SIZE 160

int main(int argc, char **argv)

{

char *outFile;

FILE *fout;

/*Holds the audio that will be written to file (16 bits per sample)*/

short out[FRAME_SIZE];

/*Speex handle samples as float, so we need an array of floats*/

float output[FRAME_SIZE];