Kaldi三音素GMM学习笔记
三音素GMM与单音素GMM的主要差别在于决策树状态绑定,与GMM参数更新相关的原理、程序和类两者都是一样的。
在这个笔记中,我会首先介绍表示HMM的类HmmTopology和TransitionModel,然后介绍三音素GMM训练脚本train_deltas.sh用到的几个程序,这几个程序与单音素GMM的不同或者只在三音素GMM训练中出现。与GMM相关的其余部分请参考单音素GMM学习笔记
文章目录
HmmTopology
为什么要介绍HmmTopology(后简称HT)和TransitionModel(后简称TM)?前面我们几乎一直在讲GMM和决策树,那么HMM用什么表示?在Kaldi中用TM表示HMM,TM中包含一个HT对象,用来表示HMM拓扑结构。
在Kaldi数据准备阶段,Kaldi会在data/lang目录下自动生成表示HMM拓扑结构的文件topo,HT对象就保存topo中的信息。一是topo中都有哪些音素,保存在HT的数据成员phone_中;二是每个音素的HMM结构是什么,由HT的数据成员phone2idx_和entries_共同决定。我们用下面一个图来解构HmmTopology的数据成员。
上面的图片对应kaldi中的类为,HmmTopology
TransitionModel
在单音素GMM初始化程序gmm-init-mono和三音素GMM初始化程序gmm-init-model中都会调用TM构造函数TransitionModel(const ContextDependencyInterface &ctx_dep, const HmmTopology &hmm_topo)来初始化TM。我们也就以此构造函数为切入口,来学习TransitionModel中各数据成员是怎么构造出来的。
我们先来看看TM都有哪些数据成员以及各自的作用:
HmmTopology topo_;
/// 由transition state – 1进行索引;
std::vector<Tuple> tuples_;
/// Gives the first transition_id of each transition-state; indexed by
/// the transition-state. Array indexed 1..num-transition-states+1 (the last one
/// is needed so we can know the num-transitions of the last transition-state.
std::vector<int32> state2id_;
/// For each transition-id, the corresponding transition
/// state (indexed by transition-id).
std::vector<int32> id2state_;
std::vector<int32> id2pdf_id_;
/// For each transition-id, the corresponding log-prob. Indexed by transition-id.
Vector<BaseFloat> log_probs_;
/// For each transition-state, the log of (1 - self-loop-prob). Indexed by
/// transition-state.
Vector<BaseFloat> non_self_loop_log_probs_;
/// This is actually one plus the highest-numbered pdf we ever got back from the
/// tree (but the tree numbers pdfs contiguously from zero so this is the number
/// of pdfs).
int32 num_pdfs_;
构造函数的调用过程如下两图所示:
compute tuples的kaldi文档
const std::vector<int32> &phones = topo_.GetPhones();
KALDI_ASSERT(!phones.empty());
// this is the case for normal models. but not for chain models
std::vector<std::vector<std::pair<int32, int32> > > pdf_info;
std::vector<int32> num_pdf_classes( 1 + *std::max_element(phones.begin(), phones.end()), -1);
for (size_t i = 0; i < phones.size(); i++)
num_pdf_classes[phones[i]] = topo_.NumPdfClasses(phones[i]);
ctx_dep.GetPdfInfo(phones, num_pdf_classes, &pdf_info);
// pdf_info is list indexed by pdf of which (phone, pdf_class) it
// can correspond to.
std::map<std::pair<int32, int32>, std::vector<int32> > to_hmm_state_list;
// to_hmm_state_list is a map from (phone, pdf_class) to the list
// of hmm-states in the HMM for that phone that that (phone, pdf-class)
// can correspond to.
for (size_t i = 0; i < phones.size(); i++) { // setting up to_hmm_state_list.
int32 phone = phones[i];
const HmmTopology::TopologyEntry &entry = topo_.TopologyForPhone(phone);
for (int32 j = 0; j < static_cast<int32>(entry.size()); j++) { // for each state...
int32 pdf_class = entry[j].forward_pdf_class;
if (pdf_class != kNoPdf) {
to_hmm_state_list[std::make_pair(phone, pdf_class)].push_back(j);
}
}
}
for (int32 pdf = 0; pdf < static_cast<int32>(pdf_info.size()); pdf++) {
for (size_t j = 0; j < pdf_info[pdf].size(); j++) {
int32 phone = pdf_info[pdf][j].first,
pdf_class = pdf_info[pdf][j].second;
const std::vector<int32> &state_vec = to_hmm_state_list[std::make_pair(phone, pdf_class)];
KALDI_ASSERT(!state_vec.empty());
// state_vec is a list of the possible HMM-states that emit this
// pdf_class.
for (size_t k = 0; k < state_vec.size(); k++) {
int32 hmm_state = state_vec[k];
tuples_.push_back(Tuple(phone, hmm_state, pdf, pdf));
}
}
}
}
函数ComputeDerived()介绍:
{
state2id_.resize(tuples_.size()+2); // indexed by transition-state, which
// is one based, but also an entry for one past end of list.
int32 cur_transition_id = 1;
num_pdfs_ = 0;
for (int32 tstate = 1;
tstate <= static_cast<int32>(tuples_.size()+1); // not a typo.
tstate++) {
state2id_[tstate] = cur_transition_id;
if (static_cast<size_t>(tstate) <= tuples_.size()) {
int32 phone = tuples_[tstate-1].phone,
hmm_state = tuples_[tstate-1].hmm_state,
forward_pdf = tuples_[tstate-1].forward_pdf,
self_loop_pdf = tuples_[tstate-1].self_loop_pdf;
num_pdfs_ = std::max(num_pdfs_, 1 + forward_pdf);
num_pdfs_ = std::max(num_pdfs_, 1 + self_loop_pdf);
const HmmTopology::HmmState &state = topo_.TopologyForPhone(phone)[hmm_state];
int32 my_num_ids = static_cast<int32>(state.transitions.size());
cur_transition_id += my_num_ids; // # trans out of this state.
}
}
id2state_.resize(cur_transition_id); // cur_transition_id is #transition-ids+1.
id2pdf_id_.resize(cur_transition_id);
for (int32 tstate = 1; tstate <= static_cast<int32>(tuples_.size()); tstate++) {
for (int32 tid = state2id_[tstate]; tid < state2id_[tstate+1]; tid++) {
id2state_[tid] = tstate;
if (IsSelfLoop(tid))
id2pdf_id_[tid] = tuples_[tstate-1].self_loop_pdf;
else
id2pdf_id_[tid] = tuples_[tstate-1].forward_pdf;
}
}
// The following statements put copies a large number in the region of memory
// past the end of the id2pdf_id_ array, while leaving the array as it was
// before. The goal of this is to speed up decoding by disabling a check
// inside TransitionIdToPdf() that the transition-id was within the correct
// range.
int32 num_big_numbers = std::min<int32>(2000, cur_transition_id);
id2pdf_id_.resize(cur_transition_id + num_big_numbers,
std::numeric_limits<int32>::max());
id2pdf_id_.resize(cur_transition_id);
}
tuples={ phone , hmm_state , forward_pdf , self_loop_pdf}
state2id:返回该state的第一个transition_id
num_pdfs: 这实际上是一个加上我们从树上得到的最高编号的pdf(但是树从零开始连续地给pdf编号,所以这就是pdf的数量)。
id2pdf_id_:对于每一个transition_id所对应的那个pdf,可能是指向下一个状态的forward_pdf,也可能是指向自己的self_loop_pdf
id2state_:对于每一个transition_id所对应的那个transition_state
函数InitializeProbs()介绍
{
log_probs_.Resize(NumTransitionIds()+1); // one-based array, zeroth element empty.
for (int32 trans_id = 1; trans_id <= NumTransitionIds(); trans_id++) {
int32 trans_state = id2state_[trans_id];
int32 trans_index = trans_id - state2id_[trans_state];
const Tuple &tuple = tuples_[trans_state-1];
const HmmTopology::TopologyEntry &entry = topo_.TopologyForPhone(tuple.phone);
KALDI_ASSERT(static_cast<size_t>(tuple.hmm_state) < entry.size());
BaseFloat prob = entry[tuple.hmm_state].transitions[trans_index].second;
if (prob <= 0.0)
KALDI_ERR << "TransitionModel::InitializeProbs, zero "
"probability [should remove that entry in the topology]";
if (prob > 1.0)
KALDI_WARN << "TransitionModel::InitializeProbs, prob greater than one.";
log_probs_(trans_id) = Log(prob);
}
ComputeDerivedOfProbs();
}
取每个transition_index上对于的转移概率prob,然后对他取个log之后,让他与相应的transition_id对应
train_deltas.sh中与三音素GMM相关的几个程序
gmm-init-model
-
示例:gmm-init-model tree treeacc topo 1.mdl
-
作用:使用决策树tree和决策树统计量treeacc初始化GMM。
-
流程:
1.读取tree, treeacc, topo。
2.用tree和topo初始化TransitionModel trans_model,trans_model中保存着每个音素和其每个状态对应的pdf-id的Tuple(实际为Triple).
3.调用InitAmGmm()初始化am_gmm;若提供old_tree_filename和old_model_filename,调用InitAmGmmFromOld()初始化am_gmm。在InitAmGmm()中,将stats划分到决策树的每个叶子上(对应一个pdf),用该pdf对应的stats的count_、x、x^2初始化该pdf对应的DiagGmm的参数weight_、means_invvars_、inv_vars_和gconsts_。
4.若指定参数–write-occs=1.occs,调用GetOccs()得到每个pdf对应的state occupancies(也就是该pdf对应的观测的数量,或者说该pdf对应的帧数),将state occupancies写到1.occs
5.将trans_model和am_gmm写到1.mdl,得到初始GMM模型。
gmm-mixup
-
示例:gmm-mixup –mix-up=4000 1.mdl 1.occs 2.mdl
gmm-mixup –merge=2000 1.mdl 1.occs 2.mdl -
作用:用来增加GMM混合分量的个数,或合并GMM混合分量。
-
流程:
1.从1.mdl里读取trans_model, am_gmm, 从1.occs里读取occs。2.若mixdown!=0,对am_gmm调用MergeByCount();若mixup!=0,对am_gmm调用SplitByCount()。
3.将trans_model和改变后的am_gmm写到1.mdl。
AmDiagGmm::SplitByCount()
根据occs,调用AmDiagGmm::GetSplitTargets()得到am_gmm中每个DiagGmm i应该增加到的混个分量个数targets[i]。 GetSplitTargets()对观测数最多的pdf优先增加混合分量个数(使用优先队列实现)。对每个DiagGmm i,根据targets[i],调用DiagGmm::Split()增加该DiagGmm i的混合分量。Split()对混合分量中weights_最大的分量优先进行分割,将其权值对半分,一半留给自己一半分给新的分量,被分割分量的均值、方差相关参数直接复制给新分量,复制完后对新分量的均值、方差相关参数加一个随机的扰动。
convert-ali
由单音素GMM我们得到训练数据的对齐文件,但是单音素GMM中的TransitionModel tm1和三音素GMM中的TransitionModel tm2不同,两者的每个数据成员都不一样,所以要把用tm1的tid(transition-id)表示的对齐转换成tm2的tid表示的对齐。这就是convert-ali的作用。
要看懂convert-ali,首先要对TransitionModel理解的比较清楚。建议先搞明白TM再去看该程序的代码。
我个人觉得这里的核心在于:根据tid能知道当前是哪个音素的哪个HMM状态(知道tid和对应的TM,由id2state_知道t-state,由state2id和tid只能t-idx,由t-state索引tuple_知道tuple,tuple保存音素、HMM state-id,也就知道了这两者),而无论该特征向量所对应的tid编号怎么变化,该特征向量对应的音素和HMM状态都是不变的。
从旧的tid转换成新的tid的流程大致如下:
Parameters
输入:
old_trans_model 原始对齐所使用的transition model
new_trans_model 我们想用于新对齐的transition model
new_ctx_dep 新的决策树
old_alignment 我们想转换的对齐
subsample_factor 帧次采样因子…通常为1,但如果我们转换为低帧速率系统,则可能大于1。
repeat_frames 只有当子样本系数不等于1时才相关。如果为真,则在对齐转换后通过“子采样因子”重复对齐帧,以保持对齐与输入对齐的长度相同。[注:我们实际上是通过插入单独生成的“子样本因子”来实现这一点,以尽可能保持音子边界与输入相同。]
reorder(重新排序) 如果要重新排序新对齐上的pdf id,则为true。(与它们在Hmm Topology object中的显示方式相比)
phone_map 如果非空,则从旧音子映射到新音子。
输出:
new_alignment 转化之后的对齐
1.SplitToPhones(old_trans_model,old_alignment,&old_ split_alignment)函数解释
splittophones将transitionid按“对齐”方式拆分为各自的音子(每个音子实例一个向量)。
在输出时,分割对齐中的向量大小之和将与“对齐”的对应和相同,函数在成功时返回true。如果对齐看起来不完整,例如未在音子的结束状态结束,它仍会将其分解为音子,但会返回false。对于更严重的错误,它会终止或抛出异常。这个函数自己计算出这个图是否是用“重新排序”创建的,只是做了正确的事情。
SplitToPhones:
bool SplitToPhones ( const TransitionModel & trans_model,
const std::vector< int32 > & alignment,
std::vector< std::vector< int32 > > * split_alignment
)
{
KALDI_ASSERT(split_alignment != NULL);
split_alignment->clear();
bool is_reordered = IsReordered(trans_model, alignment);
return SplitToPhonesInternal(trans_model, alignment,
is_reordered, split_alignment);
}
IsReordered()
static bool kaldi::IsReordered ( const TransitionModel & trans_model,
const std::vector< int32 > & alignment
)
{
for (size_t i = 0; i + 1 < alignment.size(); i++) {
int32 tstate1 = trans_model.TransitionIdToTransitionState(alignment[i]),
tstate2 = trans_model.TransitionIdToTransitionState(alignment[i+1]);
if (tstate1 != tstate2) {
bool is_loop_1 = trans_model.IsSelfLoop(alignment[i]),
is_loop_2 = trans_model.IsSelfLoop(alignment[i+1]);
KALDI_ASSERT(!(is_loop_1 && is_loop_2)); // Invalid.
if (is_loop_1) return true; // Reordered. self-loop is last.
if (is_loop_2) return false; // Not reordered. self-loop is first.
}
}
// Just one trans-state in whole sequence.
if (alignment.empty()) return false;
else {
bool is_loop_front = trans_model.IsSelfLoop(alignment.front()),
is_loop_back = trans_model.IsSelfLoop(alignment.back());
if (is_loop_front) return false; // Not reordered. Self-loop is first.
if (is_loop_back) return true; // Reordered. Self-loop is last.
return false; // We really don't know in this case but calling code should
// not care.
}
}
SplitToPhonesInternal()
static bool kaldi::SplitToPhonesInternal
( const TransitionModel & trans_model,
const std::vector< int32 > & alignment,
bool reordered,
std::vector< std::vector< int32 > > * split_output
)
{
if (alignment.empty()) return true; // nothing to split.
std::vector<size_t> end_points; // points at which phones end [in an
// stl iterator sense, i.e. actually one past the last transition-id within
// each phone]..
bool was_ok = true;
for (size_t i = 0; i < alignment.size(); i++) {
int32 trans_id = alignment[i];
if (trans_model.IsFinal(trans_id)) { // is final-prob
if (!reordered) end_points.push_back(i+1);
else { // reordered.
while (i+1 < alignment.size() &&
trans_model.IsSelfLoop(alignment[i+1])) {
KALDI_ASSERT(trans_model.TransitionIdToTransitionState(alignment[i]) ==
trans_model.TransitionIdToTransitionState(alignment[i+1]));
i++;
}
end_points.push_back(i+1);
}
} else if (i+1 == alignment.size()) {
// need to have an end-point at the actual end.
// but this is an error- should have been detected already.
was_ok = false;
end_points.push_back(i+1);
} else {
int32 this_state = trans_model.TransitionIdToTransitionState(alignment[i]),
next_state = trans_model.TransitionIdToTransitionState(alignment[i+1]);
if (this_state == next_state) continue; // optimization.
int32 this_phone = trans_model.TransitionStateToPhone(this_state),
next_phone = trans_model.TransitionStateToPhone(next_state);
if (this_phone != next_phone) {
// The phone changed, but this is an error-- we should have detected this via the
// IsFinal check.
was_ok = false;
end_points.push_back(i+1);
}
}
}
size_t cur_point = 0;
for (size_t i = 0; i < end_points.size(); i++) {
split_output->push_back(std::vector<int32>());
// The next if-statement checks if the initial trans-id at the current end
// point is the initial-state of the current phone if that initial-state
// is emitting (a cursory check that the alignment is plausible).
int32 trans_state =
trans_model.TransitionIdToTransitionState(alignment[cur_point]);
int32 phone = trans_model.TransitionStateToPhone(trans_state);
int32 forward_pdf_class = trans_model.GetTopo().TopologyForPhone(phone)[0].forward_pdf_class;
if (forward_pdf_class != kNoPdf) // initial-state of the current phone is emitting
if (trans_model.TransitionStateToHmmState(trans_state) != 0)
was_ok = false;
for (size_t j = cur_point; j < end_points[i]; j++)
split_output->back().push_back(alignment[j]);
cur_point = end_points[i];
}
return was_ok;
}
根据old-aliment中的tid和old-model,就可以知道该tid是hmm的第几个状态,根据hmm-state,就可以知道在新model中对应的pdf-id,也知道该tid对应的transition-idx,幼虫new-phone-window知道当前的中心音素是什么,于是得到tuple(中心音素,hmm-state,pdf-id,pdf-id),根据新模型就得到了新模型的transition-state,前面又知道trains-idx,由于t-state和t-idx就能得到新的tid