HMM原理简述和使用说明

最新推荐文章于 2023-03-04 10:43:26 发布

Crossi

最新推荐文章于 2023-03-04 10:43:26 发布

阅读量7.3k

点赞数

分类专栏：算法

算法专栏收录该内容

3 篇文章 0 订阅

订阅专栏

1、原理简述　　

为了对GMM-HMM在语音识别上的应用有个宏观认识，花了些时间读了下HTK（用htk完成简单的孤立词识别）的部分源码，对该算法总算有了点大概认识，达到了预期我想要的。不得不说，网络上关于语音识别的通俗易懂教程太少，都是各种公式满天飞，很少有说具体细节的，当然了，那需要有实战经验才行。下面总结以下几点，对其有个宏观印象即可（以孤立词识别为例）。

　　一、每个单词的读音都对应一个HMM模型，大家都知道HMM模型中有个状态集S，那么每个状态用什么来表示呢，数字？向量？矩阵？其实这个状态集中的状态没有具体的数学要求，只是一个名称而已，你可以用’1’, ’2’, ‘3’…表示，也可以用’a’, ‘b’, ’c ’表示。另外每个HMM模型中到底该用多少个状态，是通过先验知识人为设定的。

　　二、HMM的每一个状态都对应有一个观察值，这个观察值可以是一个实数，也可以是个向量，且每个状态对应的观察值的维度应该相同。假设现在有一个单词的音频文件，首先需要将其进行采样得到数字信息（A/D转换），然后分帧进行MFCC特征提取，假设每一帧音频对应的MFCC特征长度为39，则每个音频文件就转换成了N个MFCC向量（不同音频文件对应的N可能不同），这就成了一个序列，而在训练HMM模型的参数时（比如用Baum-Welch算法），每次输入到HMM中的数据要求就是一个观测值序列。这时，每个状态对应的观测值为39维的向量，因为向量中元素的取值是连续的，需要用多维密度函数来模拟，通常情况下用的是多维高斯函数。在GMM-HMM体系中，这个拟合函数是用K个多维高斯混合得到的。假设知道了每个状态对应的K个多维高斯的所有参数，则该GMM生成该状态上某一个观察向量（一帧音频的MFCC系数）的概率就可以求出来了。

　　三、对每个单词建立一个HMM模型，需要用到该单词的训练样本，这些训练样本是提前标注好的，即每个样本对应一段音频，该音频只包含这个单词的读音。当有了该单词的多个训练样本后，就用这些样本结合Baum-Welch算法和EM算法来训练出GMM-HMM的所有参数，这些参数包括初始状态的概率向量，状态之间的转移矩阵，每个状态对应的观察矩阵（这里对应的是GMM，即每个状态对应的K个高斯的权值，每个高斯的均值向量和方差矩阵）。

　　四、在识别阶段，输入一段音频，如果该音频含有多个单词，则可以手动先将其分割开（考虑的是最简单的方法），然后提取每个单词的音频MFCC特征序列，将该序列输入到每个HMM模型（已提前训练好的）中，采用前向算法求出每个HMM模型生成该序列的概率，最后取最大概率对应的那个模型，而那个模型所表示的单词就是我们识别的结果。

　　五、在建立声学模型时，可以用Deep Learning的方法来代替GMM-HMM中的GMM，因为GMM模拟任意函数的功能取决于混合高斯函数的个数，所以具有一定的局限性，属于浅层模型。而Deep Network可以模拟任意的函数，因而表达能力更强。注意，这里用来代替GMM的Deep Nets模型要求是产生式模型，比如DBN，DBM等，因为在训练HMM-DL网络时，需要用到HMM的某个状态产生一个样本的概率。

　　六、GMM-HMM在具体实现起来还是相当复杂的。

　　七、一般涉及到时间序列时才会使用HMM，比如这里音频中的语音识别，视频中的行为识别等。如果我们用GMM-HMM对静态的图片分类，因为这里没涉及到时间信息，所以HMM的状态数可设为1，那么此时的GMM-HMM算法就退化成GMM算法了。

2、使用说明

一、离散输出的隐马尔科夫模型（DHMM，HMM with discrete outputs）

最大似然参数估计EM（Baum Welch算法）

The script dhmm_em_demo.m gives an example of how to learn an HMM with discrete outputs. Let there be Q=2 states and O=3 output symbols. We create random stochastic matrices as follows.

O = 3;

Q = 2;

prior0 = normalise(rand(Q,1));

transmat0 = mk_stochastic(rand(Q,Q));

obsmat0 = mk_stochastic(rand(Q,O));

Now we sample nex=20 sequences of length T=10 each from this model, to use as training data.

T=10; %序列长度

nex=20; %样本序列数目

data = dhmm_sample(prior0, transmat0, obsmat0, nex, T);

Here data is 20x10. Now we make a random guess as to what the parameters are,

prior1 = normalise(rand(Q,1)); %初始状态概率

transmat1 = mk_stochastic(rand(Q,Q)); %初始状态转移矩阵

obsmat1 = mk_stochastic(rand(Q,O)); %初始观测状态到隐藏状态间的概率转移矩阵

and improve our guess using 5 iterations of EM...

[LL, prior2, transmat2, obsmat2] = dhmm_em(data, prior1, transmat1, obsmat1, 'max_iter', 5);

%prior2, transmat2, obsmat2 为训练好后的初始概率，状态转移矩阵及混合状态概率转移矩阵

LL(t) is the log-likelihood after iteration t, so we can plot the learning curve.

序列分类

To evaluate the log-likelihood of a trained model given test data, proceed as follows:

loglik = dhmm_logprob(data, prior2, transmat2, obsmat2) %HMM测试

Note: the discrete alphabet is assumed to be {1, 2, ..., O}, where O = size(obsmat, 2). Hence data cannot contain any 0s.

To classify a sequence into one of k classes, train up k HMMs, one per class, and then compute the log-likelihood that each model gives to the test sequence; if the i'th model is the most likely, then declare the class of the sequence to be class i.

Computing the most probable sequence (Viterbi)

First you need to evaluate B(i,t) = P(y_t | Q_t=i) for all t,i:

B = multinomial_prob(data, obsmat);

Then you can use

[path] = viterbi_path(prior, transmat, B)

二、具有高斯混合输出的隐马尔科夫模型（GHMM，HMM with mixture of Gaussians outputs）

Maximum likelihood parameter estimation using EM (Baum Welch)

Let us generate nex=50 vector-valued sequences of length T=50; each vector has size O=2.

O = 2;

T = 50;

nex = 50;

data = randn(O,T,nex);

Now let use fit a mixture of M=2 Gaussians for each of the Q=2 states using K-means.

M = 2;

Q = 2;

left_right = 0;

prior0 = normalise(rand(Q,1));

transmat0 = mk_stochastic(rand(Q,Q));

[mu0, Sigma0] = mixgauss_init(Q*M, reshape(data, [O T*nex]), cov_type);

mu0 = reshape(mu0, [O Q M]);

Sigma0 = reshape(Sigma0, [O O Q M]);

mixmat0 = mk_stochastic(rand(Q,M));

Finally, let us improve these parameter estimates using EM.

[LL, prior1, transmat1, mu1, Sigma1, mixmat1] = ...

mhmm_em(data, prior0, transmat0, mu0, Sigma0, mixmat0, 'max_iter', 2);

Since EM only finds a local optimum, good initialisation is crucial. The initialisation procedure illustrated above is very crude, and is probably not adequate for real applications... Click here for a real-world example of EM with mixtures of Gaussians using BNT.

What to do if the log-likelihood becomes positive?

It is possible for p(x) > 1 if p(x) is a probability density function, such as a Gaussian. (The requirements for a density are p(x)>0 for all x and int_x p(x) = 1.) In practice this usually means your covariance is shrinking to a point/delta function, so you should increase the width of the prior (see below), or constrain the matrix to be spherical or diagonal, or clamp it to a large fixed constant (not learn it at all). It is also very helpful to ensure the components of the data vectors have small and comparable magnitudes (use e.g., KPMstats/standardize).

This is a well-known pathology of maximum likelihood estimation for Gaussian mixtures: the global optimum may place one mixture component on a single data point, and give it 0 covariance, and hence infinite likelihood. One usually relies on the fact that EM cannot find the global optimum to avoid such pathologies.

What to do if the log-likelihood decreases during EM?

Since I implicitly add a prior to every covariance matrix (see below), what increases is loglik + log(prior), but what I print is just loglik, which may occasionally decrease. This suggests that one of your mixture components is not getting enough data. Try a better initialization or fewer clusters (states).

What to do if the covariance matrix becomes singular?

Estimates of the covariance matrix often become singular if you have too little data, or if too few points are assigned to a cluster center due to a bad initialization of the means. In this case, you should constrain the covariance to be spherical or diagonal, or adjust the prior (see below), or try a better initialization.

How do I add a prior to the covariance matrix?

Buried inside of KPMstats/mixgauss_Mstep you will see that cov_prior is initialized to 0.01*I. This is added to the maximum likelihood estimate after every M step. To change this, you will need to modify the mhmm_em function so it calls mixgauss_Mstep with a different value.

Sequence classification

To classify a sequence (e.g., of speech) into one of k classes (e.g., the digits 0-9), proceed as in the DHMM case above, but use the following procedure to compute likelihood:

loglik = mhmm_logprob(data, prior, transmat, mu, Sigma, mixmat);

Computing the most probable sequence (Viterbi)

First you need to evaluate B(t,i) = P(y_t | Q_t=i) for all t,i:

B = mixgauss_prob(data(:,:,ex), mu, Sigma, mixmat);

where data(:,:,ex) is OxT where O is the size of the observation vector. Finally, use

[path] = viterbi_path(prior, transmat, B);

三、具有高斯输出的HMM

This is just like the mixture of Gaussians case, except we have M=1, and hence there is no mixing matrix.

Online EM for discrete HMMs/ POMDPs

For some applications (e.g., reinforcement learning/ adaptive control), it is necessary to learn a model online. The script dhmm_em_online_demo gives an example of how to do this.

　　MFCC:

　　MFCC的matlab实现教程可参考：张智星老师的网页教程mfcc. 最基本的12维特征。

function mfcc=frame2mfcc(frame, fs, filterNum, mfccNum, plotOpt)
% frame2mfcc: Frame to MFCC conversion.
%    Usage: mfcc=frame2mfcc(frame, fs, filterNum, mfccNum, plotOpt)
%
%    For example:
%        waveFile='what_movies_have_you_seen_recently.wav';
%        [y, fs, nbits]=wavReadInt(waveFile);
%        startIndex=12000;
%        frameSize=512;
%        frame=y(startIndex:startIndex+frameSize-1);
%        frame2mfcc(frame, fs, 20, 12, 1);

%    Roger Jang 20060417

if nargin<1, selfdemo; return; end
if nargin<2, fs=16000; end
if nargin<3, filterNum=20; end
if nargin<4, mfccNum=12; end
if nargin<5, plotOpt=0; end

frameSize=length(frame);
% ====== Preemphasis should be done at wave level
%a=0.95;
%frame2 = filter([1, -a], 1, frame);
frame2=frame;
% ====== Hamming windowing
frame3=frame2.*hamming(frameSize);
% ====== FFT
[fftMag, fftPhase, fftFreq, fftPowerDb]=fftOneSide(frame3, fs);
% ====== Triangular band-pass filter bank
triFilterBankPrm=getTriFilterBankPrm(fs, filterNum);    % Get parameters for triangular band-pass filter bank
% Triangular bandpass filter.
for i=1:filterNum
    tbfCoef(i)=dot(fftPowerDb, trimf(fftFreq, triFilterBankPrm(:,i)));%得到filterNum个滤波系数
end
% ====== DCT
mfcc=zeros(mfccNum, 1); %DCT变换的前后个数也没有变
for i=1:mfccNum
    coef = cos((pi/filterNum)*i*((1:filterNum)-0.5))'; %mfcc中的前mfccNum个系数
    mfcc(i) = sum(coef.*tbfCoef');%直接按照DCT公式
end
% ====== Log energy
%logEnergy=10*log10(sum(frame.*frame));
%mfcc=[logEnergy; mfcc];

if plotOpt
    subplot(2,1,1);
    plot(frame, '.-');
    set(gca, 'xlim', [-inf inf]);
    title('Input frame');
    subplot(2,1,2);
    plot(mfcc, '.-');
    set(gca, 'xlim', [-inf inf]);
    title('MFCC vector');
end

% ====== trimf.m (from fuzzy toolbox)
function y = trimf(x, prm) %由频率的横坐标算出三角形内的纵坐标,0~1
a = prm(1); b = prm(2); c = prm(3);
y = zeros(size(x));
% Left and right shoulders (y = 0)
index = find(x <= a | c <= x);
y(index) = zeros(size(index)); %只考虑三角波内的量
% Left slope
if (a ~= b)
    index = find(a < x & x < b);
    y(index) = (x(index)-a)/(b-a);
end
% right slope
if (b ~= c)
    index = find(b < x & x < c);
    y(index) = (c-x(index))/(c-b);
end
% Center (y = 1)
index = find(x == b);
y(index) = ones(size(index));

% ====== Self demo
function selfdemo
waveFile='what_movies_have_you_seen_recently.wav';
[y, fs, nbits]=wavReadInt(waveFile);
startIndex=12000;
frameSize=512;
frame=y(startIndex:startIndex+frameSize-1);
feval(mfilename, frame, fs, 20, 12, 1);

ZCR:

　　过0检测，用于判断每一帧中过零点的数量情况，最简单的版本可参考：zeros cross rate.

waveFile='csNthu.wav';
frameSize=256;
overlap=0;
[y, fs, nbits]=wavread(waveFile);
frameMat=enframe(y, frameSize, overlap);
frameNum=size(frameMat, 2);
for i=1:frameNum
    frameMat(:,i)=frameMat(:,i)-mean(frameMat(:,i));    % mean justification
end
zcr=sum(frameMat(1:end-1, :).*frameMat(2:end, :)<0);
sampleTime=(1:length(y))/fs;
frameTime=((0:frameNum-1)*(frameSize-overlap)+0.5*frameSize)/fs;
subplot(2,1,1); plot(sampleTime, y); ylabel('Amplitude'); title(waveFile);
subplot(2,1,2); plot(frameTime, zcr, '.-');
xlabel('Time (sec)'); ylabel('Count'); title('ZCR');

EPD:

　　端点检测，检测声音的起始点和终止点，可参考：EPD in Time Domain,在时域中的最简单检测方法。

waveFile='sunday.wav';
[wave, fs, nbits] = wavread(waveFile);
frameSize = 256;
overlap = 128;

wave=wave-mean(wave);                % zero-mean substraction
frameMat=buffer2(wave, frameSize, overlap);    % frame blocking,每一列代表一帧
frameNum=size(frameMat, 2);            % no. of frames
volume=frame2volume(frameMat);        % volume,求每一帧的能量，绝对值或者平方和,volume为行向量
volumeTh1=max(volume)*0.1;            % volume threshold 1
volumeTh2=median(volume)*0.1;            % volume threshold 2
volumeTh3=min(volume)*10;            % volume threshold 3
volumeTh4=volume(1)*5;                % volume threshold 4
index1 = find(volume>volumeTh1); %找出volume大于阈值的那些帧序号
index2 = find(volume>volumeTh2);
index3 = find(volume>volumeTh3);
index4 = find(volume>volumeTh4);
%frame2sampleIndex()为从帧序号找到样本点的序号(即每一个采样点的序号)
%endPointX长度为2,包含了起点和终点的样本点序号
endPoint1=frame2sampleIndex([index1(1), index1(end)], frameSize, overlap);
endPoint2=frame2sampleIndex([index2(1), index2(end)], frameSize, overlap);
endPoint3=frame2sampleIndex([index3(1), index3(end)], frameSize, overlap);
endPoint4=frame2sampleIndex([index4(1), index4(end)], frameSize, overlap);

subplot(2,1,1);
time=(1:length(wave))/fs;
plot(time, wave);
ylabel('Amplitude'); title('Waveform');
axis([-inf inf -1 1]);
line(time(endPoint1(  1))*[1 1], [-1, 1], 'color', 'm');%标起点终点线
line(time(endPoint2(  1))*[1 1], [-1, 1], 'color', 'g');
line(time(endPoint3(  1))*[1 1], [-1, 1], 'color', 'k');
line(time(endPoint4(  1))*[1 1], [-1, 1], 'color', 'r');
line(time(endPoint1(end))*[1 1], [-1, 1], 'color', 'm');
line(time(endPoint2(end))*[1 1], [-1, 1], 'color', 'g');
line(time(endPoint3(end))*[1 1], [-1, 1], 'color', 'k');
line(time(endPoint4(end))*[1 1], [-1, 1], 'color', 'r');
legend('Waveform', 'Boundaries by threshold 1', 'Boundaries by threshold 2', 'Boundaries by threshold 3', 'Boundaries by threshold 4');

subplot(2,1,2);
frameTime=frame2sampleIndex(1:frameNum, frameSize, overlap);
plot(frameTime, volume, '.-');
ylabel('Sum of Abs.'); title('Volume');
axis tight;
line([min(frameTime), max(frameTime)], volumeTh1*[1 1], 'color', 'm');
line([min(frameTime), max(frameTime)], volumeTh2*[1 1], 'color', 'g');
line([min(frameTime), max(frameTime)], volumeTh3*[1 1], 'color', 'k');
line([min(frameTime), max(frameTime)], volumeTh4*[1 1], 'color', 'r');
legend('Volume', 'Threshold 1', 'Threshold 2', 'Threshold 3', 'Threshold 4');

GMM:

　 GMM用在拟合数据分布上，本质上是先假设样本的概率分布为GMM，然后用多个样本去学习这些GMM的参数。GMM建模在语音中可用于某个单词的发音，某个人的音色等。其训练过程可参考:speaker recognition.

function [M, V, W, logProb] = gmmTrain(data, gaussianNum, dispOpt)
% gmmTrain: Parameter training for gaussian mixture model (GMM)
%    Usage: function [M, V, W, logProb] = gmm(data, gaussianNum, dispOpt)
%        data: dim x dataNum matrix where each column is a data point
%        gaussianNum: No. of Gaussians or initial centers
%        dispOpt: Option for displaying info during training
%        M: dim x meanNum matrix where each column is a mean vector
%        V: 1 x gaussianNum vector where each element is a variance for a Gaussian
%        W: 1 x gaussianNum vector where each element is a weighting factor for a Gaussian

% Roger Jang 20000610

if nargin==0, selfdemo; return; end
if nargin<3, dispOpt=0; end

maxLoopCount = 50;    % Max. iteration
minImprove = 1e-6;    % Min. improvement
minVariance = 1e-6;    % Min. variance
logProb = zeros(maxLoopCount, 1);   % Array for objective function
[dim, dataNum] = size(data);

% Set initial parameters
% Set initial M
%M = data(1+floor(rand(gaussianNum,1)*dataNum),:);    % Randomly select several data points as the centers
if length(gaussianNum)==1,
    % Using vqKmeans to find initial centers
    fprintf('Start KMEANS to find the initial mu...\n');
%    M = vqKmeansMex(data, gaussianNum, 0);
    M = vqKmeans(data, gaussianNum, 0); %利用聚类的方法求均值,聚成gaussianNum类
%    M = vqLBG(data, gaussianNum, 0);
    fprintf('Start GMM training...\n');
    if any(any(~isfinite(M))); keyboard; end
else
    % gaussianNum is in fact the initial centers
    M = gaussianNum;
    gaussianNum = size(M, 2);
end
% Set initial V as the distance to the nearest center
if gaussianNum==1
    V=1;
else
    distance=pairwiseSqrDist(M);%pairwiseSqrDist是dll
   %distance=pairwiseSqrDist2(M);
   
    distance(1:(gaussianNum+1):gaussianNum^2)=inf;    % Diagonal elements are inf
    [V, index]=min(distance);    % Initial variance for each Gaussian
end
% Set initial W
W = ones(1, gaussianNum)/gaussianNum;    % Weight for each Gaussian,初始化时是均分权值

if dispOpt & dim==2, displayGmm(M, V, data); end
for i = 1:maxLoopCount  %开始迭代训练参数,EM算法
    % Expectation step:
    % P(i,j) is the probability of data(:,j) to the i-th Gaussian
    % Prob为每个样本在GMM下的概率
    [prob, P]=gmmEval(data, M, V, W);
    logProb(i)=sum(log(prob)); %所有样本的联合概率
    if dispOpt
        fprintf('i = %d, log prob. = %f\n',i-1, logProb(i));
    end
    PW = diag(W)*P;
    BETA=PW./(ones(gaussianNum,1)*sum(PW));    % BETA(i,j) is beta_i(x_j)
    sumBETA=sum(BETA,2);

    % Maximization step:  eqns (2.96) to (2.98) from Bishop p.67:
    M = (data*BETA')./(ones(dim,1)*sumBETA');

   DISTSQ = pairwiseSqrDist(M, data);                    % Distance of M to data
   %DISTSQ = pairwiseSqrDist2(M, data);                    % Distance of M to data
   
    V = max((sum(BETA.*DISTSQ, 2)./sumBETA)/dim, minVariance);    % (2.97)
    W = (1/dataNum)*sumBETA;                    % (2.98)

    if dispOpt & dim==2, displayGmm(M, V, data); end
    if i>1, if logProb(i)-logProb(i-1)<minImprove, break; end; end
end
[prob, P]=gmmEval(data, M, V, W);
logProb(i)=sum(log(prob));
fprintf('Iteration count = %d, log prob. = %f\n',i, logProb(i));
logProb(i+1:maxLoopCount) = [];

% ====== Self Demo ======
function selfdemo
%[data, gaussianNum] = dcdata(2);
data = rand(1000,2);
gaussianNum = 8;
data=data';
plotOpt=1;
[M, V, W, lp] = feval(mfilename, data, gaussianNum, plotOpt);

pointNum = 40;
x = linspace(min(data(1,:)), max(data(1,:)), pointNum);
y = linspace(min(data(2,:)), max(data(2,:)), pointNum);
[xx, yy] = meshgrid(x, y);
data = [xx(:) yy(:)]';
z = gmmEval(data, M, V, W);
zz = reshape(z, pointNum, pointNum);
figure; mesh(xx, yy, zz); axis tight; box on; rotate3d on
figure; contour(xx, yy, zz, 30); axis image

% ====== Other subfunctions ======
function displayGmm(M, V, data)
% Display function for EM algorithm
figureH=findobj(0, 'tag', mfilename);
if isempty(figureH)
    figureH=figure;
    set(figureH, 'tag', mfilename);
    colordef black
    plot(data(1,:), data(2,:),'.r'); axis image
    theta=linspace(-pi, pi, 21);
    x=cos(theta); y=sin(theta);
    sigma=sqrt(V);
    for i=1:length(sigma)
        circleH(i)=line(x*sigma(i)+M(1,i), y*sigma(i)+M(2,i), 'color', 'y');
    end
    set(circleH, 'tag', 'circleH', 'erasemode', 'xor');
else
    circleH=findobj(figureH, 'tag', 'circleH');
    theta=linspace(-pi, pi, 21);
    x=cos(theta); y=sin(theta);
    sigma=sqrt(V);
    for i=1:length(sigma)
        set(circleH(i), 'xdata', x*sigma(i)+M(1,i), 'ydata', y*sigma(i)+M(2,i));
    end
    drawnow
end

Speaker identification:

　给N个人的语音资料，用GMM可以训练这N个人的声音模型，然后给定一段语音，判断该语音与这N个人中哪个最相似。方法是求出该语音在N个GMM模型下的概率，选出概率最大的那个。可参考:speaker recognition.

function [recogRate, confusionMatrix, speakerData]=speakerIdentify(speakerData, speakerGmm, useIntGmm)
% speakerIdentify: speaker identification using GMM parameters
%    Usage: [recogRate, confusionMatrix, speakerData]=speakerIdentify(speakerData, speakerGmm, useIntGmm)
%        speakerData: structure array generated by speakerDataRead.m
%        speakerGmm: speakerGmm(i).gmmPrm is the GMM parameters for speaker i.
%        useIntGmm: use fixed-point GMM

%    Roger Jang, 20070517, 20080726

if nargin<3, useIntGmm=0; end

% ====== Speaker identification using GMM parameters
speakerNum=length(speakerData);
for i=1:speakerNum
%    fprintf('%d/%d: Recognizing wave files by %s\n', i, speakerNum, speakerData(i).name);
    for j=1:length(speakerData(i).sentence)
%        fprintf('\tSentece %d...\n', j);
        frameNum=size(speakerData(i).sentence(j).fea, 2);
        logProb=zeros(speakerNum, frameNum); %logProb(i,m)表示第i个人第j个句子中第m帧在GMM模型下的log概率
        %找出一个句子，看它属于哪个speaker
        for k=1:speakerNum,
%            fprintf('\t\tSpeaker %d...\n', k);
        %    logProb(k, :)=gmmEval(speakerData(i).sentence(j).fea, speakerGmm(k).gmmPrm);
            if ~useIntGmm
            %    logProb(k, :)=gmmEvalMex(speakerData(i).sentence(j).fea, gmm(k).mean, gmm(k).covariance, gmm(k).weight);
                logProb(k, :)=gmmEval(speakerData(i).sentence(j).fea, speakerGmm(k).gmmPrm);
            else
            %    logProb(k, :)=gmmEvalIntMex(speakerData(i).sentence(j).fea, gmm(k).mean, gmm(k).covariance, gmm(k).weight);
                logProb(k, :)=gmmEvalIntMex(speakerData(i).sentence(j).fea, speakerGmm(i).gmmPrm);
            end
        end
        cumLogProb=sum(logProb, 2);
        [maxProb, index]=max(cumLogProb);
        speakerData(i).sentence(j).predictedSpeaker=index; %找出身份
        speakerData(i).sentence(j).logProb=logProb;
    end
end

% ====== Compute confusion matrix and recognition rate
confusionMatrix=zeros(speakerNum);
for i=1:speakerNum,
    predictedSpeaker=[speakerData(i).sentence.predictedSpeaker];
    [index, count]=elementCount(predictedSpeaker);
    confusionMatrix(i, index)=count;
end
recogRate=sum(diag(confusionMatrix))/sum(sum(confusionMatrix));

GMM-HMM:

　　训练阶段：给出HMM的k个状态，每个状态下的观察样本的生成可以用一个概率分布来拟合，这里是采用GMM拟合的。其实，可以把GMM-HMM整体看成是一个生成模型。给定该模型的5个初始参数(结合随机和训练样本获得)，启动EM算法的E步：获得训练样本分布，即计算训练样本在各个状态下的概率。M步：用这些训练样本重新评估那5个参数。

　　测试阶段：(以孤立词识别为例)给定每个词发音的frame矩阵，取出某一个GMM-HMM模型，算出该发音每一帧数据在取出的GMM-HMM模型各个state下的概率，结合模型的转移概率和初始概率，获得对应的clique tree，可用图模型的方法inference出生成该语音的概率。比较多个GMM-HMM模型，取最大概率的模型对应的词。

　 参考资料：

机器学习&数据挖掘笔记_13（用htk完成简单的孤立词识别）

http://htk.eng.cam.ac.uk/extensions/

张智星老师的网页教程mfcc.