HVite源码解析

最新推荐文章于 2021-06-15 16:37:34 发布

hjx5200

最新推荐文章于 2021-06-15 16:37:34 发布

阅读量345

点赞数

分类专栏：语音识别文章标签： HVite HMM HTK 维特比算法语音识别

本文链接：https://blog.csdn.net/hjx5200/article/details/116942917

版权

语音识别专栏收录该内容

45 篇文章 4 订阅

订阅专栏

HVite是解码工具，输入语音信号、字典信息、声学模型和语言模型等条件下，输出对应的转录文本（transcription）。

首先，字典（Vocab）的结构如下：

typedef struct {
   int nwords;          /* total number of words */
   int nprons;          /* total number of prons */
   Word nullWord;       /* dummy null word/node */
   Word subLatWord;     /* special word for HNet subLats */
   Word *wtab;          /* hash table for DictEntry's */
   MemHeap heap;        /* storage for dictionary */
   MemHeap wordHeap;    /* for DictEntry structs  */
   MemHeap pronHeap;    /* for WordPron structs   */
   MemHeap phonesHeap;  /* for arrays of phones   */
} Vocab;

包含了词的个数、发音的个数、字典入口（DictEntry）的hash表，每个槽为指向一个DictEntry的指针（Word）。DictEntry的结构如下：

typedef struct _DictEntry{
   LabId wordName;  /* word identifier */
   Pron pron;       /* first pronunciation */
   int nprons;      /* number of prons for this word */
   Word next;       /* next word in hash table chain */
   void *aux;       /* hook used by HTK library modules for temp info */
} DictEntry;

它指明了当前word的名字、发音以及是否有多个发音（nprons大于1）等等。

然后HVite程序会初始化HMMSet、加载模型参数等等，这些过程在之前的工具中都分析过了，不再赘述。

在解码过程中，比较重要而之前又没涉及过的数据结构是Lattice如果理解了Lattice的结构和作用，viterbi的算法就理解差不多了。可以认为Lattice是掌握viterbi算法的钥匙。

typedef struct lattice
{
   MemHeap *heap;               /* Heap lattice uses */
   LatFormat format;	       	/* indicate which fields are valid */
   Vocab *voc;                  /* Dictionary lattice based on */

   int nn;                      /* Number of nodes */
   int na;                      /* Number of arcs */
   LNode *lnodes;               /* Array of lattice nodes */
   LArc *larcs;                 /* Array of lattice arcs */

   LabId subLatId;              /* Lattice Identifier (for SubLats only) */
   SubLatDef *subList;          /* List of sublats in this lattice level */
   SubLatDef *refList;          /* List of all SubLats referring to this lat */
   struct lattice *chain;       /* Linked list used for various jobs */

   char *utterance;		/* Utterance file name (NULL==unknown) */
   char *vocab;			/* Dictionary file name (NULL==unknown) */
   char *hmms;			/* MMF file name (NULL==unknown) */
   char *net;			/* Network file name (NULL==unknown) */

   float acscale;               /* Acoustic scale factor */
   float lmscale;		/* LM scale factor */
   LogFloat wdpenalty;		/* Word insertion penalty */
   float prscale;		/* Pronunciation scale factor */
   HTime framedur;              /* Frame duration in 100ns units */
   float logbase;               /* base of logarithm for likelihoods in lattice files
                                   (1.0 = default (e), 0.0 = no logs) */
   float tscale;                /* time scale factor (default: 1, i.e. seconds) */

   Ptr hook;                    /* User definable hook */

} Lattice;

首先，我们从总体上了解Lattice是什么。然后再逐步细化下去。lattice结构包含的信息有多个节点、多少条边、节点和边构成的数组、子lattices和它的所有前向lattices以及一些参数系数等。同时包含对应的词典指针，它指向该[词格栅]（lattice）涉及的词典。

首先看看lattice中的节点是什么样的。

typedef struct lnode
{
   int n;              /* Sorted order */

   Word word;          /* Word represented by arc (labels may be on nodes) */
   char *tag;          /* Semantic tag for this node */
   short v;            /* Pronunciation variant number */
   SubLatDef *sublat;  /* SubLat for node (if word==lat->voc->subLatWord) */

   HTime time;         /* Time of word boundary at node */

   ArcId foll;         /* Linked list of arcs following node */
   ArcId pred;         /* Linked list of arcs preceding node */

   double score;       /* Field used for pruning */

   Ptr hook;           /* User definable hook */
}
LNode;

它最重要的是代表哪个词，也就是Word对象。它其实就是指向DictEntry的指针，也就是LNode包含一个指向DictEntry的指针。而DictEntry如前面所述，包含LabId（单词名称）、Pron（单词发音）、以及它的发音个数。

还有LNode里包含子lattice，（有什么作用，什么情况下启用？）

还有该节点引出的边（acrs）和指向该节点的边。LNode中的数据项：ArcId foll和ArcId pred分别代表输出和输入边，ArcId就是LArc指针。而LArc的结构如下：

typedef struct larc
{
   NodeId start;       /* Node at start of word */
   NodeId end;         /* Node at end of word */
   LogFloat lmlike;    /* Language model likelihood of word */

   ArcId farc;         /* Next arc following start node */
   ArcId parc;         /* Next arc preceding end node */

   LogFloat aclike;    /* Acoustic likelihood of word */

   short nAlign;       /* Number of alignment records in word */
   LAlign *lAlign;     /* Array[0..nAlign-1] of alignment records */

   float score;        /* Field used for pruning/sorting */
   LogFloat prlike;    /* Pronunciation likelihood of arc */
}
LArc;

它包含该边的起始节点：start和end，该单词（word）的语言模型似然度（lmlike）。还有，连接起始节点的边（应该是个数组）。

到此我们就应该有个清晰的画面，节点代表【word】，连接【word】的是【arc】。每个【arc】有起始节点【word】，同时还包含连接这些起始节点的下/上一段【arc】。lattice其实就是一个图，在这个图上的完整路径，就是一个可能的识别结果。如果没有从SENT-START到SENT-END，那么这次识别应该是失败的。

可以想象一下，已经通过语言模型或者系统的任务语法构建好了一个lattice。这个lattice有唯一的开始、结束节点。这两个节点的不同之处是，开始节点的前指向边的列表pred为空，而结束节点的后指向边foll的列表为空。从【开始】节点出发，顺序参考它所有的后继（后指向）【边】。系统为了构图方便，设置一些空节点。

Lattice是词级的网络结构，每个节点是一个word，连接word的边称为arc。每个word（其实就是DictEntry）包含的信息有：word名称（nameCell）、发音（pron）、有几个发音等。

上图是一个具体的Lattice的示意图，圆圈表示一个LNode，里面包含了Word，而边就是LAcr，它包含了开始、结束节点。

接下来，就是要根据lattice以及hmm模型和发音字典，对lattice进行扩展，产生Network:

typedef struct {
   MemHeap *heap;     /* heap for allocating network */
   Vocab *vocab;      /* Dictionary from which words appear */
   Word nullWord;     /* Word for output when word==NULL */
   Boolean teeWords;  /* True if any tee words are present */
   NetNode initial;   /* Initial (dummy) node */
   NetNode final;     /* Final (dummy) node */
   int numNode;
   int numLink;
   MemHeap nodeHeap;  /* a heap for allocating nodes */
   MemHeap linkHeap;  /* a stack for adding the links as needed */
   NetNode *chain;
} Network;

NetNode是Network的节点，

/* The network nodes themselves just store connectivity info */
struct _NetNode {
   NetNodeType type;    /* Type of this node (includes context) */
   union {
      HLink  hmm;       /* HMM (physical) definition */
      Pron   pron;      /* Word represented (may == null) */
   }
   info;                /* Extra information specific to type of node */
   char    *tag;        /* Semantic tagging information */
   int nlinks;          /* Number of nodes connected to this one */
   NetLink *links;      /* Array[0..nlinks-1] of links to connected nodes */
   NetInst *inst;       /* Model Instance (if one exists, else NULL) */   
   NetNode *chain;
   int aux;
};

它包含有多个连接，每个连接（link）指向一个节点（NetNode），和这个连接的转移概率值。还有就是这个节点的类型。（存疑？各类型代表什么意义，以及NetInst代表什么意思？）

函数 ExpandWordNet的注释如下：

Network *ExpandWordNet(MemHeap *heap,Lattice *lat,Vocab *voc,HMMSet *hset);

/*
ExpandWordNet converts a lattice to a network.

It uses the dictionary voc to expand each word in lat into a series
of pronunciation instances. How this expansion is performed depends
upon the hmms that appear in hset and the value of HNet configuration
parameters, ALLOWCXTEXP, ALLOWXWRDEXP, FORCECXTEXP, FORCELEFTBI and
FORCERIGHTBI.

The expansion proceeds in four stages.
i) Context definition.
It is necessary for the expansion routine to determine how model
   names are constructed from the dictionary entries and whether
   cross word context expansion should be performed.
   Phones in the dictionary are classified as either
   a) Context Free. Phone is skipped when determining context.
   b) Context Indpendent. Phone only exists in CI form.
   c) Context Dependent. Otherwise phone needs modelsname expansion.
   This classification depends on whether a phone appears in the context
   part of the name (and this defines the context name) and whether
   and context dependent versions of the phone exist in the HMMSet.

ii) Determination of network type.
   The default behaviour is to try and produce the simplest network
   possible. So if the dictionary is closed no expansion of phone
   names is used to get model names, otherwise if word internal
   context expansion will find each model this is used otherwise it
   tries full cross word context expansion.
   This behaviour can be modified by the configuration parameters.
   If ALLOWCXTEXP==FALSE no expansion of phone names (from the
   dictionary) is performed and each phone corresponds to the model
   of the same name.
   If ALLOWXWRDEXP==FALSE expansion across word boundaries is blocked
   and although each phone still corresponds to a single model the
   phone labels can be expanded to produce a context dependent model
   name.
   If FORCECXTEXP==TRUE an error will be generated if no context
   expansion is possible.

iii) Network expansion.
For cross word context expansion the initial and final context
   dependent phones (and any preceding/following context independent
   ones) are duplicated several times to allow for different cross
   word contexts. Each pronunciation instance has a word end node
   for each left context in which it appears. (!NULL words just
   have these word nodes).
   Otherwise each word in the lattice is expanded into its different
   pronunciations and these expanded into a node for each phone
   together with a word end node. (Again !NULL words just have the
   word end node).

iv) Linking of models to network nodes.
Model names are determined from the phone name and the surrounding
   context names.
   a) Construct CD name and see if model exists.
   b) Construct CI name and see if model exists.
   If ALLOWCXTEXP==FALSE (a) is skipped and if FORCECXTEXP==TRUE
   (b) is skipped. When no matching model is found and error is
   generated.
   The name for (a) is either a left biphone (when the right context
   is a boundary or FORCELEFTBI==TRUE), a right biphone (when the left
   context is a boundary or FORCERIGHTBI==TRUE) or a triphone.
   The resulting name is of the [left_context-]phone[+right_context]
   with the phone label coming direct from the dictionary and the
   context names coming from (i) above.
   Context free phones are skipped in this process so
   sil aa r sp y uw sp sil
   would be expanded as
   sil sil-aa+r aa-r+y sp r-y+uw y-uw+sil sp sil
   if sil was context independent and sp was context free.

[ Stages (iii) and (iv) actually proceed concurrently to allow sharing
of logical models with the same underlying physical model for the first
and last phone of context dependent models ].
*/

/* --- Context handling stuff useful for general network building --- */

翻译下就是：

函数ExpandWordNet将lattice转换为network。

它利用词典voc将lat对象中的每个word扩展为一系列的发音实例（pronunciation instance）。如何扩展，依赖于模型集以及HNet的配置参数，

ALLOWCXTEXP, ALLOWXWRDEXP, FORCECXTEXP, FORCELEFTBI 和 FORCERIGHTBI.

扩展过程分四个阶段：

1）上下文环境定义

扩展程序必须通过字典和是否跨词扩展（cross word expansion）来决定模型名字。

字典中的音子（phone）被分为如下几类：

a) 上下文自由的（context free）在考虑上下文时，它直接被跳过

b) 上下文无关的（context independent）音子只存在CI形式。

c) 上下文依赖（context dependent）其他音子需要模型名称扩展

2）网络类型的确定

默认情况下，系统构建尽可能简单的网络。所以如果字典是关闭的，那么音子名称是无法扩展的。

或者词内扩展（word internal context expansion）模式会为每个模型寻找适合的名字

通过配置参数可以改变这种行为。如果设置ALLOWCXEXP为FALSE，那么字典中音子就是对应的HMM模型名称而无扩展。

如果ALLOWXWRDEXP为FALSE，那么扩词边界的扩展被阻止，虽然每个音子还是对应一个模型，但是音子的标签可以被扩展，

用来产生上下文依赖的模型名字。

当FORCECXTEXP为TRUE，如果上下文扩展不被允许的话，会触发错误。

3）网络扩展

对跨词扩展，初始和末尾位置上的上下文依赖的音子需要被复制多份，好满足不同的跨词上下文环境需要。

每个发音实例都有一个词结束节点（word end node）与它的左上文对应。（！NULL单词只有word节点）

其他词会根据她们的发音而被扩展——每个发音音子作为一个节点，还有词结束节点。（再次，！NULL只有词结束节点）

4）连接模型以构成网络节点

模型名称是有音子名称以及上下文决定的。

a) 构建环境依赖（CD）名字，看看该对应的模型是否存在；

b) 构建环境独立（CI）名字，看看该对应的模型是否存在；

如果ALLOWCXREXP为FALSE，那么跳过a)步骤；如果FORCECXTEXP为TRUE，b)步骤被跳过。如果没有匹配的模型，

触发错误。

a）产生的模型名字，要么是左邦定双音子、要么是又绑定双音子，更多的是三音子。

产生的模型名称，形式上如[L-Phone+R]，Phone来自字典，而L和R来自所在的上下文环境（还得根据它们的类别（CI、CD、CF等））

环境自由音子（Context Free phone）例如sp，在这个过程中直接跳过，不作处理。

例如,某词发音序列是这样的：sil aa r sp y uw sp sil。如果sil是context independent而sp是context free的话，它被扩展为：

sil sil-aa+r aa-r+y sp r-y+ uw y-uw+sil sp sil。

hjx5200

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
HVite源码解析

HVite是解码工具，输出语音信号，和字典信息、声学模型、语言模型等条件下，输出对应的转录文本（transcription）。首先，字典（Vocab）的结构如下：typedef struct { int nwords; /* total number of words */ int nprons; /* total number of prons */ Word nullWord; /* dummy null word/node *
复制链接

扫一扫

专栏目录