viterbi算法词性标注_使用Viterbi算法深入研究词性标记

最新推荐文章于 2022-10-20 16:40:12 发布

cumi6497

最新推荐文章于 2022-10-20 16:40:12 发布

阅读量676

点赞数

文章标签：算法机器学习人工智能深度学习 java

原文链接：https://www.freecodecamp.org/news/a-deep-dive-into-part-of-speech-tagging-using-viterbi-algorithm-17c8de32e8bc/

版权

viterbi算法词性标注

by Sachin Malhotra

由Sachin Malhotra

使用Viterbi算法深入研究词性标记 (A deep dive into part-of-speech tagging using the Viterbi algorithm)

by Sachin Malhotra and Divya Godayal

由Sachin Malhotra和Divya Godayal撰写

Welcome back, Caretaker!

欢迎回来，看守！

In case you’ve forgotten the problem we were trying to tackle in the previous article, let us revise it for you.

如果您忘记了上一篇文章中我们试图解决的问题，请让我们为您修改。

So there’s this naughty kid Peter and he’s going to pester his new caretaker, you!

因此，有一个顽皮的孩子彼得，他要缠着他的新看守，你！

As a caretaker, one of the most important tasks for you is to tuck Peter in bed and make sure he is sound asleep. Once you’ve tucked him in, you want to make sure that he’s actually asleep and not up to some mischief.

作为看守，对您来说最重要的任务之一就是让Peter躺在床上，并确保他睡着了。将他塞进去之后，您要确保他实际上在睡觉，而不至于恶作剧。

You cannot, however, enter the room again, as that would surely wake Peter up. All you can hear are the noises that might come from the room.

但是，您不能再次进入房间，因为这肯定会使Peter醒来。您所听到的只是从房间传来的噪音。

Either the room is quiet or there is noise coming from the room. These are your observations.

房间很安静，或者房间里有噪音。这些是您的观察。

All you have as the caretaker are:

您所拥有的看守者包括：

a set of observations, which is basically a sequence containing noise or quiet over time, and
一组观测值，基本上是一个包含噪声的序列或随着时间的流逝而安静，以及
A state diagram provided by Peter’s mom — who happens to be a neurological scientist — that contains all the different sets of probabilities that you can use to solve the problem defined below.
彼得的妈妈(恰好是一名神经科学家)提供的状态图，其中包含可用于解决以下定义的问题的所有不同概率集。

问题 (The problem)

Given the state diagram and a sequence of N observations over time, we need to tell the state of the baby at the current point in time. Mathematically, we have N observations over times t0, t1, t2 .... tN . We want to find out if Peter would be awake or asleep, or rather which state is more probable at time tN+1 .

给定状态图和随时间推移的N次观察序列，我们需要告诉婴儿当前时间点的状态。在数学上，我们在时间t0, t1, t2 .... tN有N个观测值。我们想知道彼得是否会醒着或睡着，或者更确切地说是在时间tN+1处哪个状态。

In case any of this seems like Greek to you, go read the previous article to brush up on the Markov Chain Model, Hidden Markov Models, and Part of Speech Tagging.

如果您觉得其中的任何一种看起来像希腊语，请阅读上一篇文章，以复习Markov链模型，隐式Markov模型和语音标记的一部分。

In that previous article, we had briefly modeled the problem of Part of Speech tagging using the Hidden Markov Model.

在上一篇文章中，我们使用隐马尔可夫模型简要地建模了词性标注的问题。

The problem of Peter being asleep or not is just an example problem taken up for a better understanding of some of the core concepts involved in these two articles. At the core, the articles deal with solving the Part of Speech tagging problem using the Hidden Markov Models.

彼得是否睡着的问题只是为了更好地理解这两篇文章所涉及的一些核心概念而提出的一个示例问题。文章的核心是使用隐马尔可夫模型解决语音部分标记问题。

So, before moving on to the Viterbi Algorithm, let’s first look at a much more detailed explanation of how the tagging problem can be modeled using HMMs.

因此，在继续介绍Viterbi算法之前 ，让我们先来看一个更详细的解释，说明如何使用HMM对标签问题进行建模。

生成模型和噪声通道模型 (Generative Models and the Noisy Channel Model)

A lot of problems in Natural Language Processing are solved using a supervised learning approach.

使用监督学习方法可以解决自然语言处理中的许多问题。

Supervised problems in machine learning are defined as follows. We assume training examples (x(1), y(1)). . .(x(m) , y(m)), where each example consists of an input x(i) paired with a label y(i) . We use X to refer to the set of possible inputs, and Y to refer to the set of possible labels. Our task is to learn a function f : X → Y that maps any input x to a label f(x).

机器学习中的监督问题定义如下。我们假设训练示例(x(1), y(1)) 。。。 (x(m) , y(m)) ，其中每个示例均由与标签y(i)配对的输入x(i)组成。我们使用X表示可能的输入集，并使用Y表示可能的标签集。我们的任务是学习函数f：X→Y，它将任何输入x映射到标签f(x)。

In tagging problems, each x(i) would be a sequence of words X1 X2 X3 …. Xn(i), and each y(i) would be a sequence of tags Y1 Y2 Y3 … Yn(i)(we use n(i)to refer to the length of the i’th training example). X would refer to the set of all sequences x1 . . . xn, and Y would be the set of all tag sequences y1 . . . yn. Our task would be to learn a function f : X → Y that maps sentences to tag sequences.

在标记问题中，每个x(i)都是单词X1 X2 X3 …. Xn(i)的序列X1 X2 X3 …. Xn(i) X1 X2 X3 …. Xn(i) ，每个y(i)都是标签Y1 Y2 Y3 … Yn(i)的序列(我们使用n(i)来表示第i个训练示例的长度)。 X将引用所有序列x1的集合。。。 xn和Y将是所有标记序列y1的集合。。。 yn。我们的任务是学习将句子映射到标签序列的函数f：X→Y。

An intuitive approach to get an estimate for this problem is to use conditional probabilities. p(y | x) which is the probability of the output y given an input x. The parameters of the model would be estimated using the training samples. Finally, given an unknown input x we would like to find

估计此问题的直观方法是使用条件概率。 p(y | x)是给定输入x时输出y的概率。将使用训练样本来估计模型的参数。最后，给定未知输入x我们想找到

f(x) = arg max(p(y | x)) ∀y ∊ Y

This here is the conditional model to solve this generic problem given the training data. Another approach that is mostly adopted in machine learning and natural language processing is to use a generative model.

这是在给定训练数据的情况下解决此通用问题的条件模型。机器学习和自然语言处理中最常用的另一种方法是使用生成模型。

Rather than directly estimating the conditional distribution p(y|x), in generative models we instead model the joint probability p(x, y) over all the (x, y) pairs.

在生成模型中，我们不是直接估计条件分布p(y|x) ，而是对所有(x，y)对的联合概率p(x, y)进行建模。

We can further decompose the joint probability into simpler values using Bayes’ rule:

我们可以使用贝叶斯规则将联合概率进一步分解为更简单的值：

p(y) is the prior probability of any input belonging to the label y.
p(y)是属于标签y的任何输入的先验概率。
p(x | y) is the conditional probability of input x given the label y.
p(x | y)是给定标签y时输入x的条件概率。

We can use this decomposition and the Bayes rule to determine the conditional probability.

我们可以使用这种分解和贝叶斯规则来确定条件概率。

Remember, we wanted to estimate the function

记住，我们想估计函数

f(x) = arg max( p(y|x) ) ∀y ∊ Y

f(x) = arg max( p(y) * p(x | y) )

The reason we skipped the denominator here is because the probability p(x) remains the same no matter what the output label being considered. And so, from a computational perspective, it is treated as a normalization constant and is normally ignored.

我们在这里跳过分母的原因是，无论考虑什么输出标签，概率p(x)保持不变。因此，从计算角度来看，它被视为归一化常数，通常被忽略。

Models that decompose a joint probability into terms p(y) and p(x|y) are often called noisy-channel models. Intuitively, when we see a test example x, we assume that it has been generated in two steps:

将联合概率分解为项p(y)和p(x|y)的模型通常称为“ 噪声通道模型” 。直观地，当我们看到一个测试示例x时，我们假设它是通过两个步骤生成的：

first, a label y has been chosen with probability p(y)
首先，选择标签y的概率为p(y)
second, the example x has been generated from the distribution p(x|y). The model p(x|y) can be interpreted as a “channel” which takes a label y as its input, and corrupts it to produce x as its output.
第二，从分布p(x | y)生成了示例x。模型p(x | y)可以解释为“通道” ，将标签y作为其输入，并将其破坏以产生x作为其输出。

语音标记模型的生成部分 (Generative Part of Speech Tagging Model)

Let us assume a finite set of words V and a finite sequence of tags K. Then the set S will be the set of all sequence, tags pairs <x1, x2, x3 ... xn, y1, y2, y3, ..., yn> such that n > 0 ∀x ∊ V and ∀y ∊ K .

让我们假设单词V的有限集合和标签K的有限序列。那么集合S将是所有序列的集合，标签对<x1, x2, x3 ... xn, y1, y2, y3, ..., yn>使得n > 0∀x∊ ∊ V a和∀y∊ K。

A generative tagging model is then the one where

生成标记模型就是

2。

Given a generative tagging model, the function that we talked about earlier from input to output becomes

给定一个生成标记模型，我们前面讨论的从输入到输出的功能变为

Thus for any given input sequence of words, the output is the highest probability tag sequence from the model. Having defined the generative model, we need to figure out three different things:

因此，对于任何给定的单词输入序列，输出都是来自模型的最高概率标签序列。定义了生成模型之后，我们需要弄清楚三件事：

How exactly do we define the generative model probability p(<x1, x2, x3 ... xn, y1, y2, y3, ..., yn>)
我们如何精确定义生成模型概率p(<x1, x2, x3 ... xn, y1, y2, y3, ..., y n>)
How do we estimate the parameters of the model, and
我们如何估算模型的参数，以及
How do we efficiently calculate
我们如何有效地计算

Let us look at how we can answer these three questions side by side, once for our example problem and then for the actual problem at hand: part of speech tagging.

让我们看一下如何并排回答这三个问题，一次是针对我们的示例问题，然后是针对当前的实际问题：语音标记的一部分。

定义生成模型 (Defining the Generative Model)

Let us first look at how we can estimate the probability p(x1 .. xn, y1 .. yn) using the HMM.

让我们首先看一下如何使用HMM估计概率p(x1 .. xn, y1 .. yn) 。

We can have any N-gram HMM which considers events in the previous window of size N.

我们可以有任何N-gram HMM来考虑前一个大小为N的窗口中的事件。

The formulas provided hereafter are corresponding to a Trigram Hidden Markov Model.

此后提供的公式对应于Trigram隐藏Markov模型。

Trigram隐藏Markov模型 (Trigram Hidden Markov Model)

A trigram Hidden Markov Model can be defined using

可以使用以下方式定义三元组隐马尔可夫模型：

A finite set of states.
有限的状态集。
A sequence of observations.
一系列观察。
q(s|u, v)
q(s | u，v)

q(s|u, v)Transition probability defined as the probability of a state “s” appearing right after observing “u” and “v” in the sequence of observations.
q(s | u，v) 转移概率定义为在观察序列中观察到“ u”和“ v”之后状态“ s”立即出现的概率。
e(x|s)
e(x | s)

e(x|s)Emission probability defined as the probability of making an observation x given that the state was s.
e(x | s) 发射概率定义为在状态为s的情况下进行观察x的概率。

Then, the generative model probability would be estimated as

然后，将生成模型的概率估计为

As for the baby sleeping problem that we are considering, we will have only two possible states: that the baby is either awake or he is asleep. The caretaker can make only two observations over time. Either there is noise coming in from the room or the room is absolutely quiet. The sequence of observations and states can be represented as follows:

至于我们正在考虑的婴儿睡眠问题，我们将只有两种可能的状态：婴儿要么醒着要么睡着了。随着时间的推移，看守只能进行两次观察。房间里有噪音进来，或者房间绝对安静。观察和状态的顺序可以表示如下：

Coming on to the part of speech tagging problem, the states would be represented by the actual tags assigned to the words. The words would be our observations. The reason we say that the tags are our states is because in a Hidden Markov Model, the states are always hidden and all we have are the set of observations that are visible to us. Along similar lines, the sequence of states and observations for the part of speech tagging problem would be

谈到语音标签问题，状态将由分配给单词的实际标签表示。这些话是我们的观察。之所以说标记是我们的状态，是因为在隐马尔可夫模型中，状态始终是隐藏的，而我们所拥有的只是我们可见的一组观测值。同样，对于语音标记问题，状态和观察的顺序将是

估计模型参数 (Estimating the model’s parameters)

We will assume that we have access to some training data. The training data consists of a set of examples where each example is a sequence consisting of the observations, every observation being associated with a state. Given this data, how do we estimate the parameters of the model?

我们假设我们可以访问一些培训数据。训练数据由一组示例组成，其中每个示例都是由观察值组成的序列，每个观察值都与一个状态相关联。有了这些数据，我们如何估算模型的参数？

Estimating the model’s parameters is done by reading various counts off of the training corpus we have, and then computing maximum likelihood estimates:

通过读取我们拥有的训练语料库的各种计数，然后计算最大似然估计值，可以估计模型的参数：

We already know that the first term represents transition probability and the second term represents the emission probability. Let us look at what the four different counts mean in the terms above.

我们已经知道，第一项代表过渡概率，第二项代表发射概率。让我们看看以上四个术语中的不同含义。

c(u, v, s) represents the trigram count of states u, v and s. Meaning it represents the number of times the three states u, v and s occurred together in that order in the training corpus.
c(u，v，s)表示状态u，v和s的三元组计数。意思是它表示训练语料库中三个状态u，v和s以此顺序一起出现的次数。
c(u, v) following along similar lines as that of the trigram count, this is the bigram count of states u and v given the training corpus.
c(u，v)遵循与三字母组计数相似的线，这是给定训练语料的状态u和v的二元组计数。
c(s → x) is the number of times in the training set that the state s and observation x are paired with each other. And finally,
c(s→x)是训练集中状态s和观察值x相互配对的次数。最后，
c(s) is the prior probability of an observation being labelled as the state s.
c(s)是观测值被标记为状态s的先验概率。

Let us look at a sample training set for the toy problem first and see the calculations for transition and emission probabilities using the same.

让我们首先看一下玩具问题的样本训练集，并查看使用该样本训练集的过渡和排放概率的计算。

The BLUE markings represent the transition probability, and RED is for emission probability calculations.

蓝色标记表示过渡概率，红色标记用于发射概率计算。

Note that since the example problem only has two distinct states and two distinct observations, and given that the training set is very small, the calculations shown below for the example problem are using a bigram HMM instead of a trigram HMM.

请注意，由于示例问题只有两个不同的状态和两个不同的观察值，并且由于训练集很小，因此下面显示的示例问题的计算使用的是二元模型 HMM 而不是Trigram HMM。

Peter’s mother was maintaining a record of observations and states. And thus she even provided you with a training corpus to help you get the transition and emission probabilities.

彼得的母亲保持着观察和状态的记录。因此，她甚至为您提供了训练语料库，以帮助您获得过渡和排放概率。

转移概率示例： (Transition Probability Example:)

排放概率示例： (Emission Probability Example:)

That was quite simple, since the training set was very small. Let us look at a sample training set for our actual problem of part of speech tagging. Here we can consider a trigram HMM, and we will show the calculations accordingly.

这很简单，因为训练量很小。让我们看一下针对部分语音标记的实际问题的训练样本集。在这里，我们可以考虑三元组HMM，并将相应地显示计算结果。

We will use the following sentences as a corpus of training data (the notation word/TAG means word tagged with a specific part-of-speech tag).

我们将使用以下句子作为训练数据的语料库(符号单词/ TAG表示用特定词性标签标记的单词)。

The training set that we have is a tagged corpus of sentences. Every sentence consists of words tagged with their corresponding part of speech tags. eg:- eat/VB means that the word is “eat” and the part of speech tag in this sentence in this very context is “VB” i.e. Verb Phrase. Let us look at a sample calculation for transition probability and emission probability just like we saw for the baby sleeping problem.

我们拥有的训练集是带标记的句子语料库。每个句子都由带有相应语音标记部分的单词组成。例如：-eat / VB表示该单词为“吃”，而在此上下文中此句子中的语音标签部分为“ VB”，即动词短语。让我们看一下过渡概率和发射概率的样本计算，就像我们看到的婴儿睡眠问题一样。

转移概率 (Transition Probability)

Let’s say we want to calculate the transition probability q(IN | VB, NN). For this, we see how many times we see a trigram (VB,NN,IN) in the training corpus in that specific order. We then divide it by the total number of times we see the bigram (VB,NN) in the corpus.

假设我们要计算转移概率q(IN | VB，NN)。为此，我们看到在训练语料库中以该特定顺序看到了一个三字母组(VB，NN，IN)的次数。然后，将其除以在语料库中看到二元(VB，NN)的总次数。

排放概率 (Emission Probability)

Let’s say we want to find out the emission probability e(an | DT). For this, we see how many times the word “an” is tagged as “DT” in the corpus and divide it by the total number of times we see the tag “DT” in the corpus.

假设我们要找出发射概率e(an | DT)。为此，我们看到单词“ an”在语料库中被标记为“ DT”的次数，然后将其除以我们在语料库中看到“ DT”标记的总次数。

So if you look at these calculations, it shows that calculating the model’s parameters is not computationally expensive. That is, we don’t have to do multiple passes over the training data to calculate these parameters. All we need are a bunch of different counts, and a single pass over the training corpus should provide us with that.

因此，如果您查看这些计算，则表明计算模型的参数在计算上并不昂贵。也就是说，我们不必对训练数据进行多次传递即可计算这些参数。我们所需要的只是一堆不同的计数，而训练语料库的一次通过就可以为我们提供这些。

Let’s move on and look at the final step that we need to look at given a generative model. That step is efficiently calculating

让我们继续来看最后的步骤，在给定的生成模型下，我们需要查看的最后一步。该步骤正在有效地计算

We will be looking at the famous Viterbi Algorithm for this calculation.

我们将研究著名的Viterbi算法进行此计算。

寻找最可能的序列—维特比算法 (Finding the most probable sequence — Viterbi Algorithm)

Finally, we are going to solve the problem of finding the most likely sequence of labels given a set of observations x1 … xn. That is, we are to find out

最后，我们将解决在给定一组观测值x1…xn的情况下找到最可能的标签序列的问题。也就是说，我们要找出

The probability here is expressed in terms of the transition and emission probabilities that we learned how to calculate in the previous section of the article. Just to remind you, the formula for the probability of a sequence of labels given a sequence of observations over “n” time steps is

这里的概率以过渡和排放概率表示，我们在本文的上一节中学习了如何计算。提醒您，给定一系列标记在n个时间步长上的观察值的概率公式为

Before looking at an optimized algorithm to solve this problem, let us first look at a simple brute force approach to this problem. Basically, we need to find out the most probable label sequence given a set of observations out of a finite set of possible sequences of labels. Let’s look at the total possible number of sequences for a small example for our example problem and also for a part of speech tagging problem.

在研究解决此问题的优化算法之前，让我们首先看一下解决该问题的简单暴力方法。基本上，我们需要从一组有限的可能标签序列中给出一组观察值，找出最可能的标签序列。让我们看一下示例问题和语音标记问题的一个小示例的序列总数。

Say we have the following set of observations for the example problem.

说我们对示例问题有以下观察结果。

Noise     Quiet     Noise

We have two possible labels {Asleep and Awake}. Some of the possible sequence of labels for the observations above are:

我们有两个可能的标签{Asleep和Awake}。上面观察到的一些可能的标签顺序是：

Awake      Awake     Awake

Awake      Awake     Asleep

Awake      Asleep    Awake

Awake      Asleep    Asleep

In all we can have 2³ = 8 possible sequences. This might not seem like very many, but if we increase the number of observations over time, the number of sequences would increase exponentially. This is the case when we only had two possible labels. What if we have more? As is the case with part of speech tagging.

总之，我们可以有2³= 8个可能的序列。这看起来可能不是很多，但是如果我们随着时间的推移增加观察的数量，那么序列的数量将成倍增加。当我们只有两个可能的标签时就是这种情况。如果我们还有更多呢？与部分语音标记一样。

For example, consider the sentence

例如，考虑句子

the dog barks

and assuming that the set of possible tags are {D, N, V}, let us look at some of the possible tag sequences:

并假设一组可能的标签为{D，N，V}，让我们看一些可能的标签序列：

D     D     DD     D     ND     D     VD     N     DD     N     ND     N     V ... etc

Here, we would have 3³ = 27 possible tag sequences. And as you can see, the sentence was extremely short and the number of tags weren’t very many. In practice, we can have sentences that might be much larger than just three words. Then the number of unique labels at our disposal would also be too high to follow this enumeration approach and find the best possible tag sequence this way.

在这里，我们将有3³= 27个可能的标签序列。正如您所看到的，句子非常短，标签的数量不是很多。在实践中，我们的句子可能比三个单词大得多。那么，我们可以使用的唯一标签的数量也将太高而无法遵循这种枚举方法，从而无法以这种方式找到最佳的标签序列。

So the exponential growth in the number of sequences implies that for any reasonable length sentence, the brute force approach would not work out as it would take too much time to execute.

因此，序列数量的指数增长意味着对于任何合理长度的句子，暴力破解方法都不会奏效，因为执行起来会花费太多时间。

Instead of this brute force approach, we will see that we can find the highest probable tag sequence efficiently using a dynamic programming algorithm known as the Viterbi Algorithm.

代替这种暴力手段，我们将看到我们可以使用称为Viterbi算法的动态编程算法有效地找到最高可能的标签序列。

Let us first define some terms that would be useful in defining the algorithm itself. We already know that the probability of a label sequence given a set of observations can be defined in terms of the transition probability and the emission probability. Mathematically, it is

让我们首先定义一些术语，这些术语将对定义算法本身有用。我们已经知道，给定一组观察结果的标记序列的概率可以根据跃迁概率和发射概率来定义。从数学上讲

Let us look at a truncated version of this which is

让我们看看这个的截短版本

and let us call this the cost of a sequence of length k.

并称其为长度为k的序列的代价。

So the definition of “r” is simply considering the first k terms off of the definition of probability where k ∊ {1..n} and for any label sequence y1…yk.

因此，“ r”的定义只是考虑概率定义中的前k个项，其中k ∊ {1..n}和任何标签序列y1…yk。

Next we have the set S(k, u, v) which is basically the set of all label sequences of length k that end with the bigram (u, v) i.e.

接下来，我们有一个集合S(k，u，v)，它基本上是所有长度为k的双标签(u，v)结尾的所有标签序列的集合，即

Finally, we define the term π(k, u, v) which is basically the sequence with the maximum cost.

最后，我们定义术语π(k，u，v)，它基本上是具有最大成本的序列。

The main idea behind the Viterbi Algorithm is that we can calculate the values of the term π(k, u, v) efficiently in a recursive, memoized fashion. In order to define the algorithm recursively, let us look at the base cases for the recursion.

维特比算法背后的主要思想是，我们可以以递归，记忆的方式有效地计算项π(k，u，v)的值。为了递归定义算法，让我们看看递归的基本情况。

π(0, *, *) = 1

π(0, u, v) = 0

Since we are considering a trigram HMM, we would be considering all of the trigrams as a part of the execution of the Viterbi Algorithm.

由于我们正在考虑三元组HMM，因此我们将所有三元组视为维特比算法执行的一部分。

Now, we can start the first trigram window from the first three words of the sentence but then the model would miss out on those trigrams where the first word or the first two words occurred independently. For that reason, we consider two special start symbols as * and so our sentence becomes

现在，我们可以从句子的前三个词开始第一个三元组窗口，但是随后模型会错过第一个或前两个词独立出现的那些三元组。因此，我们将两个特殊的开始符号视为* ，因此我们的句子变为

*    *    x1   x2   x3   ......         xn

And the first trigram we consider then would be (*, *, x1) and the second one would be (*, x1, x2).

然后我们考虑的第一个三元组为(*，*，x1)，第二个三元组为(*，x1，x2)。

Now that we have all our terms in place, we can finally look at the recursive definition of the algorithm which is basically the heart of the algorithm.

现在我们有了所有术语，我们终于可以看一下算法的递归定义，这基本上是算法的核心。

This definition is clearly recursive, because we are trying to calculate one π term and we are using another one with a lower value of k in the recurrence relation for it.

这个定义显然是递归的，因为我们试图计算一个π项，并且在递归关系中使用另一个k值较低的项。

Every sequence would end with a special STOP symbol. For the trigram model, we would also have two special start symbols “*” in the beginning.

每个序列都将以特殊的STOP符号结尾。对于Trigram模型，我们在开始处还会有两个特殊的开始符号“ *”。

Have a look at the pseudo-code for the entire algorithm.

看一下整个算法的伪代码。

The algorithm first fills in the π(k, u, v) values in using the recursivedefinition. It then uses the identity described before to calculate the highest probability for any sequence.

该算法首先使用递归定义填充π(k，u，v)值。然后，它使用前面描述的身份来计算任何序列的最高概率。

The running time for the algorithm is O(n|K|³), hence it is linear in the length of the sequence, and cubic in the number of tags.

该算法的运行时间为O(n | K |³)，因此在序列长度上是线性的，在标签数上是三次的。

NOTE: We would be showing calculations for the baby sleeping problem and the part of speech tagging problem based off a bigram HMM only. The calculations for the trigram are left to the reader to do themselves. But the code that is attached at the end of this article is based on a trigram HMM. It’s just that the calculations are easier to explain and portray for the Viterbi algorithm when considering a bigram HMM instead of a trigram HMM.

注意：我们将仅基于bigram HMM显示婴儿睡眠问题和语音标记问题的计算。 Trigram的计算留给读者自己完成。但是，本文结尾处附加的代码基于Trigram HMM。当考虑使用双字母HMM而不是三字母HMM时，维特比算法的计算更容易解释和描绘。

Therefore, before showing the calculations for the Viterbi Algorithm, let us look at the recursive formula based on a bigram HMM.

因此，在展示维特比算法的计算之前，让我们看一下基于双字母HMM的递归公式。

This one is extremely similar to the one we saw before for the trigram model, except that now we are only concerning ourselves with the current label and the one before, instead of two before. The complexity of the algorithm now becomes O(n|K|²).

这一点与我们之前在Trigram模型中看到的非常相似，不同之处在于，现在我们只关注当前标签和之前的标签，而不是之前的两个标签。该算法的复杂度现在变为O(n | K |²)。

婴儿睡眠问题的计算 (Calculations for Baby Sleeping Problem)

Now that we have the recursive formula ready for the Viterbi Algorithm, let us see a sample calculation of the same firstly for the example problem that we had, that is, the baby sleeping problem, and then for the part of speech tagging version.

既然我们已经为Viterbi算法准备好了递归公式，那么让我们来看一个示例计算，首先是针对我们遇到的示例问题(即婴儿睡眠问题)，然后是语音标记版本。

Note that when we are at this step, that is, the calculations for the Viterbi Algorithm to find the most likely tag sequence given a set of observations over a series of time steps, we assume that transition and emission probabilities have already been calculated from the given corpus. Let’s have a look at a sample of transition and emission probabilities for the baby sleeping problem that we would use for our calculations of the algorithm.

请注意，当我们处于此步骤时，即在给定的一系列时间步长的观察结果的基础上，Viterbi算法的计算以查找最可能的标签序列，我们假设已经根据给定语料。让我们看一下我们用于算法计算的婴儿睡眠问题的过渡和发射概率示例。

The baby starts by being awake, and remains in the room for three time points, t1 . . . t3 (three iterations of the Markov chain). The observations are: quiet, quiet, noise. Have a look at the following diagram that shows the calculations for up to two time-steps. The complete diagram with all the final set of values will be shown afterwards.

婴儿从清醒开始，并在房间里停留了三个时间点t1。。。 t3(马尔可夫链的三个迭代)。观察结果是：安静，安静，嘈杂。下图显示了最多两个时间步长的计算。随后将显示带有所有最终值集的完整图表。

We have not shown the calculations for the state of “asleep” at k = 2 and the calculations for k = 3 in the above diagram to keep things simple.

为了简化起见，我们在上图中未显示k = 2时“睡眠”状态的计算和k = 3的计算。

Now that we have all these calculations in place, we want to calculate the most likely sequence of states that the baby can be in over the different given time steps. So, for k = 2 and the state of Awake, we want to know the most likely state at k = 1 that transitioned to Awake at k = 2. (k = 2 represents a sequence of states of length 3 starting off from 0 and t = 2 would mean the state at time-step 2. We are given the state at t = 0 i.e. Awake).

现在我们已经进行了所有这些计算，我们想要计算婴儿在不同的给定时间步长中可能处于的最可能状态序列。因此，对于k = 2和Awake状态，我们想知道在k = 1时最有可能在k = 2转换为Awake的状态。(k = 2表示长度为3的状态序列，从0开始， t = 2表示时间步2处的状态。我们给定t = 0处的状态(即唤醒)。

Clearly, if the state at time-step 2 was AWAKE, then the state at time-step 1 would have been AWAKE as well, as the calculations point out. So, the Viterbi Algorithm not only helps us find the π(k) values, that is the cost values for all the sequences using the concept of dynamic programming, but it also helps us to find the most likely tag sequence given a start state and a sequence of observations. The algorithm, along with the pseudo-code for storing the back-pointers is given below.

显然，如果计算指出，如果时间步骤2的状态为“唤醒”，则时间步骤1的状态也将为“唤醒”。因此，维特比算法不仅使用动态编程的概念帮助我们找到π(k)值，即所有序列的成本值，而且还帮助我们在给定起始状态和条件的情况下找到最可能的标签序列。一系列观察。该算法以及用于存储反向指针的伪代码在下面给出。

语音标记问题部分的计算 (Calculations for the Part of Speech Tagging Problem)

Let us look at a slightly bigger corpus for the part of speech tagging and the corresponding Viterbi graph showing the calculations and back-pointers for the Viterbi Algorithm.

让我们看一下语音标记部分的语料库和相应的维特比图，该图显示了维特比算法的计算和反向指针。

Here is the corpus that we will consider:

这是我们将要考虑的语料库：

Now take a look at the transition probabilities calculated from this corpus.

现在看看从该语料库计算出的过渡概率。

Here, q0 → VB represents the probability of a sentence starting off with the tag VB, that is the first word of a sentence being tagged as VB. Similarly, q0 → NN represents the probability of a sentence starting with the tag NN. Notice that out of 10 sentences in the corpus, 8 start with NN and 2 with VB and hence the corresponding transition probabilities.

在此，q0→VB表示句子以标签VB开头的概率，也就是将句子的第一个单词标记为VB。类似地，q0→NN表示以标签NN开头的句子的概率。请注意，在语料库中的10个句子中，有8个以NN开头，有2个以VB开头，因此有相应的转移概率。

As for the emission probabilities, ideally we should be looking at all the combinations of tags and words in the corpus. Since that would be too much, we will only consider emission probabilities for the sentence that would be used in the calculations for the Viterbi Algorithm.

至于发射概率，理想情况下，我们应该查看语料库中标记和单词的所有组合。由于那太多了，我们将只考虑将在维特比算法的计算中使用的句子的发射概率。

Time flies like an arrow

The emission probabilities for the sentence above are:

以上句子的发射概率为：

Finally, we are ready to see the calculations for the given sentence, transition probabilities, emission probabilities, and the given corpus.

最后，我们准备查看给定句子，过渡概率，发射概率和给定语料的计算。

So, is that all there is to the Viterbi Algorithm ?

那么，维特比算法是否已包含全部内容？

Take a look at the example below.

看下面的例子。

The bucket below each word is filled with the possible tags seen next to the word in the training corpus. The given sentence can have the combinations of tags depending on which path we take. But there is a catch. Can you figure out what that is?

每个单词下方的存储桶中充满了训练语料库中单词旁边可能出现的标记。给定的句子可以具有标签的组合，具体取决于我们采用的路径。但是有一个问题！你能弄清楚那是什么吗？

Were you able to figure it out?

你能弄清楚吗？

No??

没有？？

Let me tell you what it is.

我告诉你这是什么。

There might be some path in the computation graph for which we do not have a transition probability. So our algorithm can just discard that path and take the other path.

在计算图中可能存在某些路径，但我们没有转换概率。因此，我们的算法只能丢弃该路径，而采用另一条路径。

In the above diagram, we discard the path marked in red since we do not have q(VB|VB). The training corpus never has a VB followed by VB. So in the Viterbi calculations, we end up taking q(VB|VB) = 0. And if you’ve been following the algorithm along closely, you would find that a single 0 in the calculations would make the entire probability or the maximum cost for a sequence of tags / labels to be 0.

在上图中，由于没有q(VB | VB)，因此我们丢弃了标有红色的路径。训练语料库从来没有VB，其次是VB 。因此，在维特比计算中，我们最终得出q(VB | VB)=0。如果您一直在密切关注算法，则将发现计算中的单个0将使整个概率或最大成本成为可能。标签/标签的序列为0。

This however means that we are ignoring the combinations which are not seen in the training corpus.

但是，这意味着我们将忽略训练语料库中未发现的组合。

Is that the right way to approach the real world examples?

这是处理现实世界中示例的正确方法吗？

Consider a small tweak in the above sentence.

考虑上述句子中的一个小调整。

In this sentence we do not have any alternative path. Even if we have Viterbi probability until we reach the word “like”, we cannot proceed further. Since both q(VB|VB) = 0 and q(VB|IN) = 0. What do we do now?

在这句话中，我们没有其他选择的路径。即使我们有维特比的机率直到我们到达“喜欢”一词，也无法继续进行。由于q(VB | VB)= 0和q(VB | IN)=0。我们现在要做什么？

The corpus that we considered here was very small. Consider any reasonably sized corpus with a lot of words and we have a major problem of sparsity of data. Take a look below.

我们在这里考虑的语料库很小。考虑任何带有大量单词的合理大小的语料库，而我们面临着数据稀疏的主要问题。看看下面。

That means that we can have a potential 68 billion bigrams but the number of words in the corpus are just under a billion. That is a huge number of zero transition probabilities to fill up. The problem of sparsity of data is even more elaborate in case we are considering trigrams.

这意味着我们可以拥有一个潜在的680亿个二元组，但是语料库中的单词数不到10亿个。这是要填补的大量零过渡概率。如果我们正在考虑三字组，那么数据稀疏性的问题就更加复杂了。

To solve this problem of data sparsity, we resort to a solution called Smoothing.

为了解决数据稀疏性的问题，我们采用了称为平滑的解决方案。

平滑处理 (Smoothing)

The idea behind Smoothing is just this:

平滑背后的想法是这样的：

Discount — the existing probability values somewhat and
折现 -现有的概率值和
Reallocate — this probability to the zeroes
重新分配-将此概率归零

In this way, we redistribute the non zero probability values to compensate for the unseen transition combinations. Let us consider a very simple type of smoothing technique known as Laplace Smoothing.

通过这种方式，我们重新分配了非零概率值，以补偿看不见的过渡组合。让我们考虑一种称为拉普拉斯平滑的非常简单的平滑技术。

Laplace smoothing is also known as one count smoothing. You will understand exactly why it goes by that name in a moment. Let’s revise how the parameters for a trigram HMM model are calculated given a training corpus.

拉普拉斯(Laplace)平滑也称为一次计数平滑。您很快就会确切地知道它为何如此称呼。让我们修改在给定训练语料的情况下如何计算三字母组合HMM模型的参数。

The possible values that can go wrong here are

这里可能出错的可能值是

c(u, v, s) is 0
c(u, v, s)为0
c(u, v) is 0
c(u, v)为0
We get an unknown word in the test sentence, and we don’t have any training tags associated with it.
我们在测试句子中得到一个未知词，并且没有任何与之相关的训练标签。

All these can be solved via smoothing. So the Laplace smoothing counts would become

所有这些都可以通过平滑解决。因此，拉普拉斯平滑计数将变为

Here V is the total number of tags in our corpus and λ is basically a real value between 0 and 1. It acts like a discounting factor. A λ = 1 value would give us too much of a redistribution of values of probabilities. For example:

这里V是我们语料库中标签的总数，而λ基本上是0到1之间的实数值。它的作用就像打折因子。 λ= 1的值会使我们过多地重新分配概率值。 例如：

Too much of a weight is given to unseen trigrams for λ = 1 and that is why the above mentioned modified version of Laplace Smoothing is considered for all practical applications. The value of the discounting factor is to be varied from one application to another.

对于λ= 1，过多的权重赋予了看不见的三元组，这就是为什么在所有实际应用中都考虑了上述修改版的拉普拉斯平滑的原因。折现系数的值应从一个应用程序到另一个应用程序变化。

Note that λ = 1 would only create a problem if the vocabulary size is too large. For a smaller corpus, λ = 1 would give us a good performance to start off with.

注意，λ= 1仅在词汇量太大时才会产生问题。对于较小的语料库，λ= 1将为我们提供良好的性能。

A thing to note about Laplace Smoothing is that it is a uniform redistribution, that is, all the trigrams that were previously unseen would have equal probabilities. So, suppose we are given some data and we observe that

关于拉普拉斯平滑的一件事要注意的是，它是统一的重新分布，也就是说，以前看不见的所有三元组的概率都相等。因此，假设我们得到一些数据，并且观察到

Frequency of trigram <gave, the, thing> is zero
Trigram <给定，事物>的频率为零
Frequency of trigram <gave, the, think> is also zero
Trigram <gave，the，think>的频率也为零
Uniform distribution over unseen events means:
在看不见的事件上进行均匀分布意味着：

P(thing|gave, the) = P(think|gave, the)
P(thing | gave，the)= P(think | gave，the)

Does that reflect our knowledge about English use?P(thing|gave, the) > P(think|gave, the) ideally, but uniform distribution using Laplace smoothing will not consider this.

P(thing | gave，the)> P(think | gave，the)理想地反映了我们的英语使用知识，但是使用拉普拉斯平滑的均匀分布不会考虑这一点。

This means that millions of unseen trigrams in a huge corpus would have equal probabilities when they are being considered in our calculations. That is probably not the right thing to do. However, it is better than to consider the 0 probabilities which would lead to these trigrams and eventually some paths in the Viterbi graph getting completely ignored. But this still needs to be worked upon and made better.

这意味着，在我们的计算中考虑到巨大语料库中数以百万计的看不见的卦时，它们具有相等的概率。那可能不是正确的选择。但是，总比考虑将导致这些三元组和最终维特比图中的某些路径被完全忽略的0概率更好。但这仍然需要加以改进。

There are, however, a lot of different types of smoothing techniques that improve upon the basic Laplace Smoothing technique and help overcome this problem of uniform distribution of probabilities. Some of these techniques are:

但是，有许多不同类型的平滑技术可以改进基本的拉普拉斯平滑技术，并有助于克服概率均匀分布的问题。其中一些技术是：

Good-Turing estimate
预估良好
Jelinek-Mercer smoothing (interpolation)
Jelinek-Mercer平滑(插值)
Katz smoothing (backoff)
Katz平滑(补偿)
Witten-Bell smoothing
维滕贝尔平滑
Absolute discounting
绝对折扣
Kneser-Ney smoothing
Kneser-Ney平滑

To read more on these different types of smoothing techniques in more detail, refer to this tutorial. Which smoothing technique to choose highly depends upon the type of application at hand, the type of data being considered, and also on the size of the data set.

要更详细地了解这些不同类型的平滑技术，请参考本教程。选择哪种平滑技术高度取决于手头的应用程序类型，要考虑的数据类型以及数据集的大小。

If you have been following along this lengthy article, then I must say

如果您一直在关注这篇冗长的文章，那么我必须说

Let’s move on and look at a slight optimization that we can do to the Viterbi algorithm that can reduce the number of computations and that also makes sense for a lot of data sets out there.

让我们继续研究一下我们可以对Viterbi算法进行的轻微优化，该优化可以减少计算量，并且对于那里的许多数据集也很有意义。

Before that, however, look at the pseudo-code for the algorithm once again.

但是，在此之前，请再次查看该算法的伪代码。

If we look closely, we can see that for every trigram of words, we are considering all possible set of tags. That is, if the number of tags are V, then we are considering |V|³ number of combinations for every trigram of the test sentence.

如果仔细观察，我们可以看到， 对于每个三字词，我们都在考虑所有可能的标记集。 也就是说，如果标签的数量为V，那么我们考虑对测试句子的每个三字母组合使用| V |³的组合数量。

Ignore the trigram for now and just consider a single word. We would be considering all of the unique tags for a given word in the above mentioned algorithm. Consider a corpus where we have the word “kick” which is associated with only two tags, say {NN, VB} and the total number of unique tags in the training corpus are around 500 (it’s a huge corpus).

现在忽略三连词，只考虑一个单词。我们将考虑上述算法中给定单词的所有唯一标签。考虑一个语料库，其中有一个单词“ kick”，它仅与两个标签(例如，{NN，VB})相关联，并且训练语料库中唯一标签的总数约为500(这是一个巨大的语料库)。

Now the problem here is apparent. We might end up assigning a tag that doesn’t make sense with the word under consideration, simply because the transition probability of the trigram ending at the tag was very high, like in the example shown above. Also, it would be computationally inefficient to consider all 500 tags for the word “kick” if it only ever occurs with two unique tags in the entire corpus.

现在，这里的问题显而易见了。我们可能最终会为所考虑的单词分配一个没有意义的标签，这仅仅是因为三字组以标签结尾的过渡概率非常高，如上例所示。同样，如果单词“ kick”只考虑整个语料库中的两个唯一标签，那么考虑所有500个标签将是计算效率低下的。

So, the optimization we do is that for every word, instead of considering all the unique tags in the corpus, we just consider the tags that it occurred with in the corpus.

因此，我们所做的优化是针对每个单词，而不是考虑语料库中的所有唯一标签， 我们只考虑它在语料库中出现的标签 。

This would work because, for a reasonably large corpus, a given word would ideally occur with all the various set of tags with which it can occur (most of them at-least). Then it would be reasonable to simply consider just those tags for the Viterbi algorithm.

之所以可行，是因为对于一个相当大的语料，理想情况下，给定的单词会与所有可能出现的所有标签集一起出现(大多数情况下至少如此)。那么，仅考虑维特比算法的那些标签将是合理的。

As far as the Viterbi decoding algorithm is concerned, the complexity still remains the same because we are always concerned with the worst case complexity. In the worst case, every word occurs with every unique tag in the corpus, and so the complexity remains at O(n|V|³) for the trigram model and O(n|V|²) for the bigram model.

就维特比解码算法而言，复杂度仍然保持不变，因为我们始终关注最坏情况下的复杂性。在最坏的情况下，语料库中的每个单词都带有唯一的标记，因此对于三字母组模型，复杂度保持在O(n | V |³)，对于双字母组模型，复杂度保持在O(n | V |²)。

For the recursive implementation of the code, please refer to

有关代码的递归实现，请参阅

DivyaGodayal/HMM-POS-TaggerHMM-POS-Tagger — An HMM based Part of Speech Tagger implementation using Laplace Smoothing and Trigram HMMsgithub.com

DivyaGodayal / HMM-POS-Tagger HMM-POS-Tagger —使用Laplace平滑和 Trigram HMM的基于HMM的语音Tagger实现部分 github.com

The recursive implementation is done along with Laplace Smoothing.

递归实现与拉普拉斯平滑一起完成。

For the iterative implementation, refer to

有关迭代实现，请参阅

edorado93/HMM-Part-of-Speech-TaggerHMM-Part-of-Speech-Tagger — An HMM based Part of Speech Taggergithub.com

edorado93 / HMM-语音标注器 的一部分HMM-语音标注器的一部分—基于HMM的语音标注器 github.com

This implementation is done with One-Count Smoothing technique which leads to better accuracy as compared to the Laplace Smoothing.

此实现是通过一次计数平滑技术完成的，与拉普拉斯平滑相比，该技术可带来更高的准确性。

A lot of snapshots of formulas and calculations in the two articles are derived from here.

很多的两篇文章中的公式和计算的快照都源自这里。

Do let us know how this blog post helped you, and point out the mistakes if you find some while reading the article in the comments section below. Also, please recommend (by clapping) and spread the love as much as possible for this post if you think this might be useful for someone.

请让我们知道此博客文章如何为您提供帮助，并指出一些错误，如果您在阅读下面的评论部分的文章时发现了一些错误。另外，如果您认为这可能对某人有用，请(通过鼓掌)推荐并尽可能多地散布这篇文章的爱情。