

传统上,语音信号中的信息被表示为一系列特征向量,每个向量从大约20 - 30毫秒的片段中派生,并以每5 - 10毫秒移动一次。通常特征向量表示与声道系统形状相对应的短时谱包络信息。在大多数情况下,特征向量由mel频率倒谱系数(MFCCs)或加权线性预测倒谱系数(wLPCCs)或这些参数的一些变体组成。许多语音系统,如语音和说话人识别系统,都是利用这些语音表示作为属于公共特征空间的特征向量序列来开发的。然而,在这些表示中,对语音产生机制的信息,特别是激励源信息,并没有得到充分的利用或捕获。





Information in speech signal istraditionally represented as a sequence of feature vectors, each vector derivedfrom a segment of about 20-30 ms for a shift of every 5-10 ms. Typically thefeature vector represents the short-time spectral envelope information, whichcorresponds to the shape of the vocal tract system. In most cases the feature vectorconsists of either mel frequency cepstral coefficients (MFCCs) or weightedlinear prediction cepstral coefficients (wLPCCs) or some variants of theseparameters. Many speech systems such as speech and speaker recognition systemsare developed using these representations of speech as a sequence of featurevectors belonging to a common feature space. However, the knowledge of thespeech production mechanism, especially the excitation source information, isnot adequately utilized or captured in these representations. It is well knownthat speech is produced and perceived as a sequence of acoustic events. Theinformation of the events is also derived from the representation in the formof a sequence of feature vectors in a common feature space. But many of thefeatures that make up an acoustic event depend on the distinct productioncharacteristics for each event. These characteristics may require emphasis ondifferent aspects of production such as nature of excitation source andcoupling/decoupling of different parts of the vocal tract system.Representation of these events may need different sets of features and hencemay not be possible to represent them in a common feature space. In otherwords, speech signal may have significant amount of information which may notbe possible to represent only by the information in the spectral envelope andgross excitation information such as voiced/nonvoiced. Performance of speechsystems may be limited due to lack of the additional information in the speechsignal in the feature vector representation. The objective of this study is toidentify some significant acoustic events and the features needed to representthem. To extract those features, new signal processing tools are needed,besides spectral analysis tools. In particular, the information in the sourceof excitation of the vocal tract system is important in describing many usefulevents. Some of the events may need adequate temporal or spectral resolution,which may not be possible to realize using the standard discrete Fouriertransform (DFT) based spectrum analysis. It is also interesting to note thatsometimes the spectral features need to be determined only around the instantsof significant excitation, rather than over an arbitrarily chosen intervalaround an arbitrarily chosen instant. In this study a few steady acousticevents are chosen to examine the need for different ways of representing theinformation in different events. Events occurring in short bursts such as stopsare not considered. Signal processing methods such as zerofrequency filtering,zero-time liftering and group delay processing are proposed to extractinformation that is not possible to extract by the conventional short-timespectral analysis tools. The acoustic events considered for detailed analysisare: voiced/nonvoiced, voice bars, trills, nasals and fricatives. To describethese events adequately, both the excitation source and vocal tract systemfeatures are needed. Application of these features for spotting acoustic eventsin continuous speech is demonstrated. Significance of the additionalinformation derived by the analysis of acoustic events is demonstrated in thecontext of development of a phone recognizer. The information derived from theexcitation-based analysis of acoustic events is considered for refinement ofthe baseline phone recognizer at various levels, namely, feature, constraintand decision levels. It is shown that appending this additional event knowledgeto the existing feature vector can improve the performance of the phonerecognizer. This is illustrated by first appending the feature vector withknown labels of the events. The events information extracted by the proposedsignal processing methods is then used to study the improvement in performance.The limited acoustic-phonetic segmentation achieved in this thesis by detectingacoustic events in speech is shown to improve the performance of the baselinesystem in detecting the manner of articulation of a phone.

1 引言
2 语音分析综述
3 基于激励分析语音的信号处理方法
4 语音中的浊音/清音分割
5 声带分析
6 颤音分析
8 摩擦音分析
10 总结与结论








