C2W2.Assignment.Parts-of-Speech Tagging (POS).Part3-CSDN博客

本文链接：https://blog.csdn.net/oldmao_2001/article/details/140578159

理论课：C2W2.Part-of-Speech (POS) Tagging and Hidden Markov Models

文章目录

3 Viterbi Algorithm
4 Predicting on a data set
- Exercise 08

理论课： C2W2.Part-of-Speech (POS) Tagging and Hidden Markov Models
Part1
Part2

3 Viterbi Algorithm

Viterbi算法是一种在隐马尔可夫模型（Hidden Markov Model, HMM）中用于解码问题的强大技术，即在给定一系列观测的情况下，找出最可能的隐状态序列。这种算法以理查德·卫斯里·维特比（Richard Wesley Viterbi）的名字命名，他于1967年首次提出此算法。其三个步骤为：

Initialization ：在这部分中，初始化best_paths 和best_probabilities矩阵，并将其填充到feed_forward中。
Feed forward ：在每一步中，计算每条路径发生的概率以及截至该点的最佳路径。
Feed backward：找出概率最高的最佳路径。

3.1 Initialization

编码初始化 best_probs 和 best_paths两个矩阵

best_probs：每个单元格都包含从一个 POS 标记到语料库中一个词的概率。
best_paths：帮助您追踪语料库中最佳可能路径的矩阵。

Exercise 05

除了 best_probs 的第 0 列之外，两个矩阵都将被初始化为 0。

best_probs “的第0列初始化时假设语料库的第一个单词前是一个起始标记（”–s–"）。
这样可以引用 A 矩阵来计算转换概率。如果不明白为什么要引入起始标记，请回看理论部分的笔记。

best_probs的第 0 列初始化过程如下：

从起始索引到以整数 $i$ 为索引的给定 POS 标记的最佳路径概率用 $best_probs [ s i d x , i ] \textrm{best\_probs}[s_{idx},i]$ 表示。
这个估计值是起始标签过渡到索引 $i$ 所表示的 POS 的概率： $\mathbf{A}[s_{idx}, i]$ 和 $i$ 所表示的 POS 标签条件下出现给定语料库的第一个词的概率，即 $\mathbf{B}[i,vocab[corpus[0]]]$ 。
请注意，vocab[corpus[0]] 指的是语料库的第一个单词（语料库第 0 位的单词）。
vocab 是一个字典，它返回指向该特定单词的唯一整数。

具体计算公式为：
$best_probs [ s i d x , i ] = A [ s _ i d x , i ] × B [ i , c o r p u s [ 0 ] ] \textrm{best\_probs}[s_{idx}, i] = \mathbf{A}[s\_{idx}, i] \times \mathbf{B}[i, corpus[0]]$
为了避免乘法带来的下溢现象，这里使用log使得连乘变连加。
$best\_probs[i,0] = log(A[s_{idx}, i]) + log(B[i, vocab[corpus[0]]$
同时为了避免0的对数计算出现的负无穷大， $A[s_{idx}, i] == 0$ 时，代码本身会直接设置 $best\_probs[i,0] = float('-inf')$

整个函数实现过程可以写为：
$\textrm{if}\ A[s_{idx}, i] <> 0 : best\_probs[i,0] = log(A[s_{idx}, i]) + log(B[i, vocab[corpus[0]]])$

$\textrm{if}\ A[s_{idx}, i] == 0 : best\_probs[i,0] = float('-inf')$
假设语料以"Loss tracks upward"开头，这初始化过程如下图所示：
在这里插入图片描述
正负无穷大写为：

float('inf')
float('-inf')

下面是initialize函数的解释
输入参数：

states：所有可能的词性标注（POS）的列表。
tag_counts：一个字典，映射每个词性标注到它的出现次数。
A：转移矩阵，维度为 (num_tags, num_tags)，表示从一个词性标注转换到另一个词性标注的概率。
B：观测/发射矩阵，维度为 (num_tags, len(vocab))，表示在给定词性标注下单词出现的概率。
corpus：待标注的词序列表。
vocab：词汇表字典，其键是单词，值是索引。

输出：

best_probs：一个矩阵，维度为 (num_tags, len(corpus))，包含每个词性标注对于序列中每个词的最高概率。
best_paths：一个矩阵，维度为 (num_tags, len(corpus))，包含每个词性标注对于序列中每个词的最佳路径（以整数索引表示）。

函数逻辑：

获取词性标注的数量，存储在变量 num_tags 中。
初始化 best_probs 矩阵，其行是词性标注，列是语料库中的单词，初始值为零。
初始化 best_paths 矩阵，其行是词性标注，列是语料库中的单词，初始数据类型为整数。
定义起始标记 --s–，它在 states 列表中的索引为 s_idx。
遍历每个词性标注：
- 检查从起始标记到词性标注 i 的转移概率是否为零：
  - 如果为零，将 best_probs 中词性标注 i 对应的第一列（即第一个词的概率）初始化为负无穷大，表示概率极低。
- 如果不为零，使用公式计算 best_probs 中词性标注 i 对应的第一列的概率：
  - 这里使用了对数概率，以避免在概率计算中发生数值下溢。公式考虑了从起始标记到词性标注 i 的转移概率和在词性标注 i 下观测到第一个词的概率。
返回初始化的 best_probs 和 best_paths 矩阵。

# UNQ_C5 GRADED FUNCTION: initialize
def initialize(states, tag_counts, A, B, corpus, vocab):
    '''
    Input: 
        states: a list of all possible parts-of-speech
        tag_counts: a dictionary mapping each tag to its respective count
        A: Transition Matrix of dimension (num_tags, num_tags)
        B: Emission Matrix of dimension (num_tags, len(vocab))
        corpus: a sequence of words whose POS is to be identified in a list 
        vocab: a dictionary where keys are words in vocabulary and value is an index
    Output:
        best_probs: matrix of dimension (num_tags, len(corpus)) of floats
        best_paths: matrix of dimension (num_tags, len(corpus)) of integers
    '''
    # Get the total number of unique POS tags
    num_tags = len(tag_counts)
    
    # Initialize best_probs matrix 
    # POS tags in the rows, number of words in the corpus as the columns
    best_probs = np.zeros((num_tags, len(corpus)))
    
    # Initialize best_paths matrix
    # POS tags in the rows, number of words in the corpus as columns
    best_paths = np.zeros((num_tags, len(corpus)), dtype=int)
    
    # Define the start token
    s_idx = states.index("--s--")
    ### START CODE HERE (Replace instances of 'None' with your code) ###
    
    # Go through each of the POS tags
    for i in range (num_tags): # Replace None in this line with the proper range.
        
        # Handle the special case when the transition from start token to POS tag i is zero
        if  A[s_idx,i] ==0: # Replace None in this line with the proper condition. # POS by word
            
            # Initialize best_probs at POS tag 'i', column 0, to negative infinity
            best_probs[i,0] = float("-inf")
        
        # For all other cases when transition from start token to POS tag i is non-zero:
        else:
            
            # Initialize best_probs at POS tag 'i', column 0
            # Check the formula in the instructions above
            best_probs[i,0] = np.log(A[s_idx,i]) + np.log(B[i,vocab[corpus[0]]])
            
            
    ### END CODE HERE ### 
    return best_probs, best_paths

运行：

best_probs, best_paths = initialize(states, tag_counts, A, B, prep, vocab)

保存结果：

import pickle
with open('./support_files/best_probs_initilized.pkl', 'wb') as file:
    # 使用 pickle.dump() 序列化并保存矩阵到文件
    pickle.dump(best_probs, file)

with open('./support_files/best_paths_initilized.pkl', 'wb') as file:
    # 使用 pickle.dump() 序列化并保存矩阵到文件
    pickle.dump(best_probs, file)

测试：

# Test the function
print(f"best_probs[0,0]: {best_probs[0,0]:.4f}")
print(f"best_paths[2,3]: {best_paths[2,3]:.4f}")

结果：
best_probs[0,0]: -22.6098
best_paths[2,3]: 0.0000

3.2 Viterbi Forward

实现viterbi_forward函数，继续填充best_probs 和best_paths矩阵。

遍历语料库。
对于每个词，计算每个可能标签的概率。
与之前的算法predict_pos（Part1中的1.2）不同的是，它将包括直到该（单词、标签）组合的路径。

下面是一个包含三个单词的语料库 “Loss tracks upward ”的例子：

为了便于阅读，图中只显示了部分状态（POS 标记）。
在下图中，第一个词 “Loss ”已经初始化。
算法将为第二个单词和后面的单词计算潜在标签概率。

在这里插入图片描述

计算第二个单词（‘tracks’）的词性标签是VBZ（动词、第三人称单数现在时）的概率。

在 best_probs 矩阵中，找到第二个词（“tracks”）的列和第 40 行（VBZ），图中该单元格以浅橙色标出。
检查从第一个词（(‘Loss’）的标签开始的每条路径，并选择最可能的路径。
从（(‘Loss’，NN）到（“tracks”，VBZ）的路径就是计算其中一条路径的一个例子。
从（‘Loss’, NN）到（‘tracks’, VBZ）的路径中，第一个单词’Loss’的 POS 标记为 NN 的概率对数为 $- 14.32$ 。 “best_probs “矩阵的 ”Loss “列和 ”NN "行中包含这个值 $- 14.32$ 。
求 NN 过渡到 VBZ 的概率。要找到这个概率，请转到A过渡矩阵，并转到 “NN ”行和 “VBZ ”列。该值为 $4.37 e - 02$ ，已在图中圈出，因此加上 $- 14.32 + l o g (4.37 e - 02)$ 。
求标签 VBS观测到单词 “track ”的概率对数。请查看 B观测矩阵中 “VBZ ”行和 “tracks ”列。图中圈出了 $4.61 e - 04$ 的值。因此，将 $- 14.32 + l o g (4.37 e - 02) + l o g (4.61 e - 04)$ 相加。
$- 14.32 + l o g (4.37 e - 02) + l o g (4.61 e - 04)$ 的总和为 $- 25.13$ 。将 $- 25.13$ 保存在 “best_probs ”矩阵中的 “VBZ ”行和 “tracks”列（如图中浅橙色高亮显示的单元格所示）。
best_probs 中的所有其他路径都经过计算。请注意， $25.13$ 大于 “best_probs ”矩阵 “tracks ”列中的所有其他值，因此通往 “VBZ ”的最可能路径来自 “NN”。 NN “位于 ”best_probs "矩阵的第 20 行，因此 $20$ 是最可能的路径。
在 “最佳路径 ”表中存储最可能的路径 $20$ 。图中用浅橙色标出。

计算语料库中第 $i^{th}$ 个单词、前一个单词 $i - 1$ 、当前 POS 标记 $j$ 和先前 POS 标记 $k$ 的概率和路径的公式为：
$\mathrm{prob} = \mathbf{best\_prob}_{k, i-1} + \mathrm{log}(\mathbf{A}_{k, j}) + \mathrm{log}(\mathbf{B}_{j, vocab(corpus_{i})})$
其中， $corpus_{i}$ 是索引为 $i$ 的语料库中的单词， $v oc ab$ 是包含单词与整数的键值对的字典。
$\mathrm{path} = k$
其中， $k$ 是代表前一个 POS 的对应整数。

Exercise 06

实现viterbi_forward函数，可以参考下面伪代码：

for each word in the corpus

    for each POS tag type that this word may be
    
        for POS tag type that the previous word could be
        
            compute the probability that the previous word had a given POS tag, that the current word has a given POS tag, and that the POS tag would emit this current word.
            
            retain the highest probability computed for the current word
            
            set best_probs to this highest probability
            
            set best_paths to the index 'k', representing the POS tag of the previous word which produced the highest probability

代码：

# UNQ_C6 GRADED FUNCTION: viterbi_forward
def viterbi_forward(A, B, test_corpus, best_probs, best_paths, vocab, verbose=True):
    '''
    Input: 
        A, B: The transition and emission matrices respectively
        test_corpus: a list containing a preprocessed corpus
        best_probs: an initilized matrix of dimension (num_tags, len(corpus))
        best_paths: an initilized matrix of dimension (num_tags, len(corpus))
        vocab: a dictionary where keys are words in vocabulary and value is an index 
    Output: 
        best_probs: a completed matrix of dimension (num_tags, len(corpus))
        best_paths: a completed matrix of dimension (num_tags, len(corpus))
    '''
    # Get the number of unique POS tags (which is the num of rows in best_probs)
    num_tags = best_probs.shape[0]
    
    # Go through every word in the corpus starting from word 1
    # Recall that word 0 was initialized in `initialize()`
    for i in range(1, len(test_corpus)): 
        
        # Print number of words processed, every 5000 words
        if i % 5000 == 0 and verbose:
            print("Words processed: {:>8}".format(i))
            
       ### START CODE HERE (Replace instances of 'None' with your code EXCEPT the first 'best_path_i = None') ###
        # For each unique POS tag that the current word can be
        for j in range(num_tags): # complete this line
            
            # Initialize best_prob for word i to negative infinity
            best_prob_i = float("-inf")
            
            # Initialize best_path for current word i to None
            best_path_i = None

            # For each POS tag that the previous word can be:
            for k in range(num_tags): # complete this line
            
                # Calculate the probability = 
                # best probs of POS tag k, previous word i-1 + 
                # log(prob of transition from POS k to POS j) + 
                # log(prob that emission of POS j is word i)
                prob = best_probs[k,i-1]+math.log(A[k,j]) +math.log(B[j,vocab[test_corpus[i]]])

                # check if this path's probability is greater than
                # the best probability up to and before this point
                if prob > best_prob_i: # complete this line
                    
                    # Keep track of the best probability
                    best_prob_i = prob
                    
                    # keep track of the POS tag of the previous word
                    # that is part of the best path.  
                    # Save the index (integer) associated with 
                    # that previous word's POS tag
                    best_path_i = k

            # Save the best probability for the 
            # given current word's POS tag
            # and the position of the current word inside the corpus
            best_probs[j,i] = best_prob_i
            
            # Save the unique integer ID of the previous POS tag
            # into best_paths matrix, for the POS tag of the current word
            # and the position of the current word inside the corpus.
            best_paths[j,i] = best_path_i

        ### END CODE HERE ###
    return best_probs, best_paths

运行：

best_probs, best_paths = viterbi_forward(A, B, prep, best_probs, best_paths, vocab)

结果：
Words processed: 5000
Words processed: 10000
Words processed: 15000
Words processed: 20000
Words processed: 25000
Words processed: 30000
测试：

# Test this function 
print(f"best_probs[0,1]: {best_probs[0,1]:.4f}") 
print(f"best_probs[0,4]: {best_probs[0,4]:.4f}")

结果：
best_probs[0,1]: -24.7822
best_probs[0,4]: -49.5601

3.3 Viterbi Backward

使用填充好的best_paths和 best_probs矩阵反向遍历获取每个单词词性标签的预测值。
下面还是使用"Loss tracks upward"进行演示

'upward’的 POS 标记是RB。

在 “best_prob ”表中为语料库中的最后一个词 “upward ”选择最可能的 POS 标记。
在 “upward”列中查找概率最大的一行。
在 “best_probs ”的第 28 行中，估计概率为-34.99，大于该列中的其他值。因此，“upward ”最可能的 POS 标记是 RB 副词，位于 best_prob 的第 28 行。
变量 z 是一个数组，用于存储语料库中每个词的预测 POS 标记的唯一整数 ID。在数组 z 的第 2 位，存储值 28，表示 “upward”（语料库中索引为 2 的词）很可能具有与唯一 ID 28（即 RB）相关的 POS 标记。
变量 pred 包含字符串形式的 POS 标记。因此，索引 2 中的 pred 存储了字符串 RB。

'tracks’的 POS 标记为 ”VBZ

下一步是计算前一个词（‘tracks’）。由于 “upward ”最有可能的 POS 标记是 “RB”，它由整数 ID 28 唯一标识，因此转到第 2 列第 28 行的 best_paths矩阵。存储在 best_paths 第 2 列第 28 行的值表示前一个词的 POS 标记的唯一 ID。在本例中，这里存储的值是 40，即 POS 标记 “VBZ”（动词，第三人称单数现在时）的唯一 ID。
因此，语料库索引 1 中的前一个词（“tracks”）很可能具有唯一 ID 为 40 的 POS 标记，即 “VBZ”。
在数组 z 中，在第 1 位存储值 40，在数组 pred 中，存储字符串 VBZ 以表示单词’tracks’很可能具有 POS 标记 VBZ。

'Loss’的 POS 标记是 `NN

在第 1 列的 best_paths 中，存储在第 40 行的唯一 ID 是 20。 20 是 POS 标签 NN 的唯一 ID。
在数组 z 的第 0 位，存储 20。在数组 pred 的第 0 位，存储 NN。

Exercise 07

实现viterbi_backward函数，返回一个包含单词对应预测词性标签的列表。
注意：
索引起始值为0，不是1；
m是语料库中的单词数量；
语料库的索引范围是 0 到 m - 1；
best_probs 和best_paths矩阵的列索引范围也是 0 到 m - 1。

步骤 1：
循环浏览 best_probs 最后一项中的所有行（POS 标记），找出具有最大值的行（POS 标记）。
使用列表 states 将唯一整数 ID 转换为标签（字符串表示）。

参考上述三字语料库：

z[2] = 28: 对于语料库中位于第 2 位的单词 “upward”，POS 标记 ID 是 28。将 28 储存在位置 2 的 z 中。
states[28] 为 “RB”： POS 标记 ID 28 指 POS 标记 “RB”。
pred[2] = 'RB'：在数组 pred 中，存储单词’upward’的 POS 标记。

步骤 2：

从 best_paths 的最后一列开始，使用 best_probs 找出语料库中最后一个词最可能的 POS 标记。
然后使用 best_paths 为前一个词找出最可能的 POS 标记。
更新 z 和 preds 中每个词的 POS 标记。

参照上面的三词示例，读取第 2 列的 best_paths，并将 z 填入位置 1。
z[1] = best_paths[z[2],2]

# UNQ_C7 GRADED FUNCTION: viterbi_backward
def viterbi_backward(best_probs, best_paths, corpus, states):
    '''
    This function returns the best path.
    
    '''
    # Get the number of words in the corpus
    # which is also the number of columns in best_probs, best_paths
    m = best_paths.shape[1] 
    
    # Initialize array z, same length as the corpus
    z = [None] * m
    
    # Get the number of unique POS tags
    num_tags = best_probs.shape[0]
    
    # Initialize the best probability for the last word
    best_prob_for_last_word = float('-inf')
    
    # Initialize pred array, same length as corpus
    pred = [None] * m
    
    ### START CODE HERE (Replace instances of 'None' with your code) ###
    ## Step 1 ##
    
    # Go through each POS tag for the last word (last column of best_probs)
    # in order to find the row (POS tag integer ID) 
    # with highest probability for the last word
    for k in range(num_tags): # Replace None in this line with the proper range.

        # If the probability of POS tag at row k 
        # is better than the previously best probability for the last word:
        if best_probs[k,-1] > best_prob_for_last_word: # Replace None in this line with the proper condition.
            
            # Store the new best probability for the last word
            best_prob_for_last_word = best_probs[k,-1]

            # Store the unique integer ID of the POS tag
            # which is also the row number in best_probs
            z[m - 1] = k
            
    # Convert the last word's predicted POS tag
    # from its unique integer ID into the string representation
    # using the 'states' list
    # store this in the 'pred' array for the last word
    pred[m - 1] = states[k]
    
    ## Step 2 ##
    # Find the best POS tags by walking backward through the best_paths
    # From the last word in the corpus to the 0th word in the corpus
    for i in range(len(corpus)-1, -1, -1): # Replace None in this line with the proper range.
        # Retrieve the unique integer ID of
        # the POS tag for the word at position 'i' in the corpus
        pos_tag_for_word_i = best_paths[z[i],i]
        
        # In best_paths, go to the row representing the POS tag of word i
        # and the column representing the word's position in the corpus
        # to retrieve the predicted POS for the word at position i-1 in the corpus
        z[i - 1] = best_paths[pos_tag_for_word_i,i]
        
        # Get the previous word's POS tag in string form
        # Use the 'states' list, 
        # where the key is the unique integer ID of the POS tag,
        # and the value is the string representation of that POS tag
        pred[i - 1] = states[pos_tag_for_word_i]
        
     ### END CODE HERE ###
    return pred

测试：

# Run and test your function
pred = viterbi_backward(best_probs, best_paths, prep, states)
m=len(pred)
print('The prediction for pred[-7:m-1] is: \n', prep[-7:m-1], "\n", pred[-7:m-1], "\n")
print('The prediction for pred[0:8] is: \n', pred[0:7], "\n", prep[0:7])

结果：
The prediction for pred[-7:m-1] is:
[‘see’, ‘them’, ‘here’, ‘with’, ‘us’, ‘.’]
[‘VB’, ‘PRP’, ‘RB’, ‘IN’, ‘PRP’, ‘.’]

The prediction for pred[0:8] is:
[‘DT’, ‘NN’, ‘POS’, ‘NN’, ‘MD’, ‘VB’, ‘VBN’]
[‘The’, ‘economy’, “'s”, ‘temperature’, ‘will’, ‘be’, ‘taken’]

4 Predicting on a data set

根据GroundTruth标签y计算正确率。

print('The third word is:', prep[3])
print('Your prediction is:', pred[3])
print('Your corresponding label y is: ', y[3])

结果：
The third word is: temperature
Your prediction is: NN
Your corresponding label y is: temperature NN

Exercise 08

# UNQ_C8 GRADED FUNCTION: compute_accuracy
def compute_accuracy(pred, y):
    '''
    Input: 
        pred: a list of the predicted parts-of-speech 
        y: a list of lines where each word is separated by a '\t' (i.e. word \t tag)
    Output: 
        
    '''
    num_correct = 0
    total = 0
    
    # Zip together the prediction and the labels
    for prediction, y in zip(pred, y):
        ### START CODE HERE (Replace instances of 'None' with your code) ###
        # Split the label into the word and the POS tag
        word_tag_tuple = y.split()
        
        # Check that there is actually a word and a tag
        # no more and no less than 2 items
        if len(word_tag_tuple)!=2: # complete this line
            continue 

        # store the word and tag separately
        word, tag = word_tag_tuple
        
        # Check if the POS tag label matches the prediction
        if prediction == tag: # complete this line
            
            # count the number of times that the prediction
            # and label match
            num_correct += 1
            
        # keep track of the total number of examples (that have valid labels)
        total += 1
        
        ### END CODE HERE ###
    return num_correct/total

运行：

print(f"Accuracy of the Viterbi algorithm is {compute_accuracy(pred, y):.4f}")

结果：
Accuracy of the Viterbi algorithm is 0.9530

扩展：

本次作业通过向前浏览语料库并了解前一个单词来预测 POS 标记。还有其他使用双向 POS 标记的实现方法。
双向 POS 标记要求在预测当前单词的 POS 标记时，知道语料库中的上一个单词和下一个单词。
双向 POS 标记会告诉你更多关于 POS 的信息，而不是只知道前一个词。