本文中关于论文的部分不是逐字逐句翻译的,有的是按照对论文的理解,自己说的,有不准确的地方,欢迎指正。
接上回,2.1节 Feature Extracion(特征提取)
特征提取阶段旨在提供语音波形的compact representation。This form should minimise the loss of information that discriminates between words, and provide a good match with the distributional assumptions made by the acoustic models.举例,如果将对角协方差高斯分布用于状态输出分布,feature应该被设计为高斯和不相关的。
Feature vectors are typically computed every 10 ms using an overlapping analysis window of around 25 ms.(语音信号是变化快的,傅里叶变换适用于分析平稳的信号语音。信号分帧,每帧做傅里叶变换。一般帧长取20~50ms,一帧内有足够多的周期,又不变化剧烈。每帧信号通常要与一个平滑的窗函数相乘,这样可取得更高质量的频谱。帧和帧之间的时间差常取为10ms,这样帧与帧之间会有重叠。否则,由于帧与帧连接处的信号会因为加窗而被弱化,这部分的信息就丢失了。)
最简单常用的编码方式是梅尔倒谱系数,即MFCC。These are generated by applying a truncated discrete cosine transformation (DCT离散余弦变换) to a log spectral estimate computed by smoothing an FFT with around 20 frequency bins distributed non-linearly across the speech spectrum. 非线性的频率范围通常被称之为梅尔范围(mel scale), 符合人耳的听觉特性。DCT是为了平滑频谱估计并且对特征元素去相关。经过余弦变换,第一个元素表示频带的对数能量平均值。这有时会被帧的对数能量所取代,或者完全删除。
Further psychoacoustic constraints are incorporated into a related encoding called perceptual linear prediction (PLP感知线性预测). PLP根据感知加权的非线性压缩功率谱来计算线性预测系数,然后将线性预测系数转换为倒谱系数。PLP根据感知加权的非线性压缩功率来计算线性预测系数ÿ