NLP学习(6) 用维特比解码进行词性标注

最新推荐文章于 2023-04-19 17:08:20 发布

两个幽灵

最新推荐文章于 2023-04-19 17:08:20 发布

阅读量310

点赞数 1

分类专栏：深度学习

原文链接：www.bilibili.com

版权

深度学习专栏收录该内容

15 篇文章 1 订阅

订阅专栏

词性标注

理论部分

用马尔科夫公式计算, 设 $\bold{w}={w_1,w_2,...,w_n}$ 是单词序列, $\bold{z}={z_1,z_2,...z_n}$ 是词性标注序列

则 $\hat{z}=\mathop{\text{argmax}}\limits_z\sum\limits_{i=1}^n\log p(w_i|z_i)+\sum\limits_{t=1}^n\log p(z_t|z_{t-1})$

数据集位于: 自然语言处理训练营\资料\Lesson9-CaseStudy-Viterbi, csdn链接

代码

# %% 加载训练集(单词和标签)
tags = {'<START>'}
words = {'<START>'}

PATH_TO_TRAIN_DATA = r'xxx\traindata.txt'  # TODO 下载地址上传到CSDN
for line in open(PATH_TO_TRAIN_DATA, 'r'):
    items = line.split('/')
    word, tag = items[0], items[1].rstrip()
    words.add(word)
    tags.add(tag)

id_to_tag = list(tags)
id_to_word = list(words)
tag_to_id = {tag: i for i, tag in enumerate(id_to_tag)}
word_to_id = {word: i for i, word in enumerate(id_to_word)}

vocab_size = len(word_to_id)
tag_size = len(tag_to_id)

del tags
del words

# %% 构建转移概率矩阵
import numpy as np
tag_to_tag_prob = np.zeros((tag_size, tag_size))
tag_to_word_prob = np.zeros((tag_size, vocab_size))

prev_tag_id = tag_to_id['<START>']
for line in open(PATH_TO_TRAIN_DATA):
    items = line.split('/')
    word_id, tag_id = word_to_id[items[0]], tag_to_id[items[1].rstrip()]
    tag_to_tag_prob[prev_tag_id, tag_id] += 1
    tag_to_word_prob[tag_id, word_id] += 1
    prev_tag_id = tag_id
    if word_id == word_to_id["."]:
        prev_tag_id = tag_to_id['<START>']
        
tag_to_word_prob[tag_to_id['<START>'], word_to_id['<START>']] = 1
tag_to_word_prob = tag_to_word_prob / np.sum(tag_to_word_prob, axis=1, keepdims=True)
tag_to_tag_prob = tag_to_tag_prob / np.sum(tag_to_tag_prob, axis=1, keepdims=True)

维特比解码

定义一个函数 $\text{dp}[i][\text{tag}_j]$ 表示从开始到节点 $i$ , 以 $\text{tag}\space j$ 结尾最好的路径的分数

$\text{dp}[i][\text{tag}_j]=\max[\text{dp}[i-1]+\log p(\text{tag}_{i}|\text{tag}_{i-1})+\log {(\text{word}_i|\text{tag}_i)}\space \text{for}\space\text{tag}_{i-1}\space \text{in}\space\text{tags}]$

代码

# %% 维特比算法
def viterbi(x):
    x = [word_to_id[word] for word in x.split(' ')]
    seq_len = len(x)
    # 计算从开始跳转到第一个 tag 的概率
    dp = np.zeros((seq_len, tag_size))
    for j in range(tag_size):
        dp[0][j] = np.log(tag_to_tag_prob[tag_to_id['<START>'], j]) \
                   + np.log(tag_to_word_prob[j, x[0]])
    ptr = np.zeros((seq_len, tag_size), dtype=np.int)
    # 计算第 i-1 个 tag 跳转到第 i 个 tag 的概率
    for i in range(1, seq_len):
        for j in range(tag_size):
            dp[i][j] = -np.inf
            for k in range(tag_size):
                score = dp[i - 1][k] + np.log(tag_to_tag_prob[k][j]) + np.log(tag_to_word_prob[j][x[i]])
                if score > dp[i][j]:
                    dp[i][j] = score
                    ptr[i][j] = k
    # 从 ptr 中找到最好的序列
    best_seq = [0] * seq_len
    best_seq[seq_len - 1] = np.argmax(dp[seq_len - 1])
    for i in range(seq_len - 2, -1, -1):
        best_seq[i] = ptr[i + 1, best_seq[i + 1]]

    for i in best_seq:
        print(id_to_tag[i])


# %%
x = "Social Security number , passport number and details about the services provided for the payment"
viterbi(x)

中间会有RuntimeWarning: divide by zero encountered in log, 因为其中有 $\log 0$ , 虽然有警告, 但是会返回正确的值-inf, 所以能凑合用. 可以单独设置在log为0时返回极小的数.

运行结果

NNP
NNP
NN
,
NN
NN
CC
NNS
IN
DT
NNS
VBN
IN
DT
NN

两个幽灵

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录