理解嵌入式系统中基本的语音算法_嵌入式音频算法-CSDN博客

转至 https://www.embedded.com/print/4015932

Understand the fundamentals of speech algorithms in an embedded system
Nitin Jain, MindTree Consulting - February 06, 2006

An enormously high number of algorithms are in use today in various electronic systems. Integrating and evaluating a DSP algorithm with the system is tricky enough to bring programmers to their knees. To try to simplify this complex technology, let’s first start with a relatively simple example.
Audio frequency spectrum that stretches to 40 kHz is divided in two bands. While the speech components are in the lower part of spectrum, from 5 Hz to 7 kHz, the audio components are in the remaining high portion ((Fig. 1).

Speech processing mainly involves compression/decompression, recognition, conditioning and enhancement algorithms. Signal-processing algorithms count on system resources like available memory and clock. As these resources relate directly to system cost, they’re often prohibitive.

Measuring an algorithm’s complexity is the first step in analyzing the algorithm. This includes looking at the clocks required, and determining the algorithm’s processing load, which can vary based on the processor employed. However, the memory requirements would not change based on the processor.

Most DSP algorithms work on collections of samples, better known as frames (Fig. 1). This introduces an inevitable delay due to frame collection that’s in addition to the processing delay. Note that the International Telecommunication Union (ITU) standardizes the acceptable delay for an algorithm.

在这里插入图片描述

Looking at the audio spectrum, basic telephone-quality speech occurs up to 4 kHz. High-quality speech reaches 7 kHz, followed by CD-quality audio.
An algorithm’s processing load is typically represented in millions of clocks per second (MCPS), which is the number of clocks/s from the core that an algorithm would need. Assume an algorithm that processes a frame of 64 samples at 8 kHz, and requires 300,000 clocks to process each frame. The time required to collect the frame would be 64/8000, or 8 ms. Or, in 1 second, 125 frames could be collected. To process all the frames, the algorithm would consume 300,000 × 125 = 37,500,000 clocks/s, represented as 37.5 MCPS. Simplifying, the MCPS equation is:

MCPS = (clock required to execute one frame × sampling frequency/frame size)/1 million

Note that there’s another common term used for measuring an algorithm’s processing load—MIPS (million instructions/s). The calculation of MIPS for an algorithm can be tricky. If the processor effectively executes one instruction in one cycle, the MCPS and MIPS ratings for that processor are the same. Analog Devices’ BlackFin is one such processor. Otherwise, if the processor takes more than one cycle to execute an instruction, a ration exists between the MCPS and MIPS ratings. For example, ARM7TDMI processor effectively requires 1.9 cycles/instruction.

The memory considerations for any algorithm are typically separated between code (read-only) and data (read-write) memory. The proper memory amount can be found by compiling the source code. Note that algorithms perform at their best when using the fastest memory, and this is usually memory that’s internal to the core.

Before integration
The time to start integrating and evaluating any speech algorithm on any embedded system is when the system is in a predictable or stable state. “Stable” means that the audio front-end’s interrupt structure is consistent. In other words, no bytes are and a decent amplitude level is maintained. It’s also best to have all the statistics of the system’s memory and clock available.

Integrating an algorithm on an existing system is somewhat easier. If the system is in devolvement phase, it’s recommended to test the audio front-end thoroughly before integrating of evaluating any algorithms. Within the system, you must verify that no interrupts are contradicting with each other. If such an issue were to exist, debugging can be a painful experience.

In a system that will incorporate audio/speech algorithms, robust audio firmware is a must. It must give the maximum time and accurate data to the algorithms to perform efficiently. One common mistake is to interrupt the core upon each sample’s arrival. If the algorithm operates only on the frames of a fixed number of samples, other interrupts are redundant. DMAs and internal FIFOs can collect samples and interrupt the core after collecting a frame.

Example algorithms
When developing any telecommunication system, engineers often begin by testing the voice quality with the typical pulse-code modulation (PCM) codec, known as G.711. This narrow-band codec restricts the sample amplitude to 8-bit precision and produces a 64-kbit/s throughput. The encoder and decoder work on each data sample using a weightless algorithm with little complexity and almost no processing delay. This gives designers the option to play with it and verify the system.

Checking the signal levels, adjusting the hardware codec gains, synchronizing the near and far-end interrupts, verifying the DMA function, or any other experiment can be accomplished using this basic telephony standard. During this process, don’t be surprised to find that the received compressed data is in bit-reversed manner. A simple “bit-reversing” code will bring it back to the expected state. Any wideband speech codec could be used as an example of a speech algorithm that’s heavy in terms of memory and clock consumption. One example is sub-band ADPCM (adaptive differential PCM), or G.722, which operates on data sampled on 16 kHz and thus covers entire speech spectrum. It retains the unvoiced frequency components, those between 4 and 7 kHz that provide high-quality natural speech.

Before any codec is integrates into a system, I recommend that the designer do careful testing. While G.711 encoding and decoding can be tested on a sample-by-sample basis, codecs that involve filters and other frequency-domain algorithms are tested differently, using a stream of at least few thousand samples. The codec verification engages the engineer in unit testing with ITU vectors, signal-level testing, and interoperability testing with other available codecs. Interoperability issues related to arranging the encoded data in 16-bit word before transmitting and mismatching in signal levels aren’t new to system integration engineers.

Algorithms that require lots of memory and clock cycles have a big impact on the system, unlike those that have been discussed so far. The more compute-intensive algorithms include echo cancellers, noise suppressors, and Viterbi algorithms. Evaluating the performance of these is not as easy as the speech codecs.

Generally, any telecomm systems that involve a hands-free or speaker mode employ an acoustic echo canceller. This prevents the second party from hearing his own voice as an echo. If operated in a noisy environment, a noise-control algorithm may be needed. The echo canceller-noise reducer (EC-NR) demands lots of memory and clocks from the system. Time- and frequency-domain techniques can help solve the acoustic echo problem, with frequency-domain techniques proven to be more efficient with less computational cost (Table 1).

Table 1
在这里插入图片描述

A frequency-domain technique uses an adaptive FIR filter to update its coefficients only when it finds that the residual echo error is larger than the threshold. Subtracting the estimated echo from the input signal gives the error. The far-end signals are used as a reference to these algorithms to estimate the echo. Providing a proper reference is needed to get good echo estimation and cancellation.

Another factor, echo tail length, is the echo reverberation time measured in milliseconds. Simply put, it’s the time spent in echo formation. The filter length is found by multiplying the echo tail length by the sampling frequency (Table 2).

在这里插入图片描述
Table 2
One of the basic requisites for an error-correction (EC) implementation is to support the data sampled until at least 16 kHz to ensure that wideband speech is covered. Integrating EC with wideband speech codecs requires some attention. As the echo tail length depends on the sampling frequency, canceling echo up to 72 ms with data sampled at 8 kHz will effectively cancel only half of the span when applied on the 16-kHz sampled data. And compared to 8 kHz, collecting a frame takes only half the time. Hence, engineers find integrating a half-effective EC with wideband codecs doubly challenging. Designers often raise the core frequency to efficiently manage EC on the system with a 16-kHz sampling rate.

Noise-reduction techniques have been used for many years. Depending on the application, an approach is chosen, implemented, and applied. For example, a technique could consider noise as more stationary then the human speech. The algorithm will model the noise and then subtract it from the input signal. A decay of 10 to 30 dB is significant for some applications. A common application that uses EC noise reduction could be when a handset is placed in speaker mode in a noisy environment or when hands-free mode is enabled in the car (Fig. 2).

Click here for Fig. 2
2. A basic system employs some of the speech compression/decompression and speech enhancement algorithms.

The EC tail length requirement for the hands-free application is about 50 ms and the NR level required can vary from 12 to 25 dB, depending on the noise attributes and expected voice quality. Generally, the higher the noise reduction, the more the speech quality is put at risk. Hence, a level can be selected dynamically to give a reasonable reduction while still maintaining the proper voice quality.

The EC noise reduction can require up to 15 or 20 kbytes of system memory. The processing of each 64-sample frame can consume from 1.5 to 3.0 Mclocks, depending on the processor. Evaluating the performance of this combination can be tricky. The steps include tuning the hardware codec gains; finding the correct microphone and speaker placement; finding the synchronization between far and near-end speech and interrupts; finding audio hardware with linear attributes; and testing various EC tail lengths and noise reduction levels to achieve the best echo cancellation and noise reduction.

It’s important to consider worst cases when evaluating the complexity of any algorithm. An algorithm’s execution time can vary for different frames. This data dependency is due to the fact that a processor might take more time to multiply two samples of higher amplitude than multiplying samples of lesser amplitude.

An example of being cheated with adaptive algorithms comes when you observe the cycles consumed for a few frames, when the filter coefficients have not been updated. Adaptation of filter data can take several thousand cycles, which must be considered. A word of caution—don’t rely solely on the algorithm. Experimenting with a variety of vectors will help increase the accuracy of MCPS and performance measurements.

Collecting the bits and pieces
The algorithms discussed are good enough to build a basic telephony system. When the system has more than one algorithm, the sequence of algorithms to be called is key. Few speech algorithms (like noise suppressors) introduce attributes of non-linearity to their output, and this could hamper the performance of other algorithms. That kind of algorithm must be placed as the last module in the speech-enhancement process.

About the author
Nitin Jain currently works in the research and development group of MindTree Consulting, in Bangalore, India. He holds an engineering degree in electronics and communications. Jain can be reached at nitin_jain@mindtree.com.