NLP作业01：请利用HMM实现词性标注

最新推荐文章于 2024-11-07 18:48:12 发布

林希～

最新推荐文章于 2024-11-07 18:48:12 发布

阅读量177

点赞数

文章标签：自然语言处理机器学习人工智能

本文链接：https://blog.csdn.net/qq_55567564/article/details/130249745

版权

作业头

这个作业属于哪个课程	自然语言处理
这个作业要求在哪里	https://bbs.csdn.net/topics/614556240
我在这个课程的目标是	熟练掌握知识，并能用运用到现实生活，进行实践化
这个作业在那个具体方面帮助我实现目标	了解HMM模型，实现词性标注
参考文献	http://t.csdn.cn/KqT17

一、作业内容

1.利用“1998人民日报词性标注语料库”进行模型的训练。

2.根据数据估计HMM的模型参数：全部的词性集合Q，全部的词集合V ，初始概率向量PI ，词性到词性的转移矩阵A ，词性到词的转移矩阵B 。可以采用频率估计概率的方法计算模型参数，但需要进一步采用拉普拉斯平滑处理。

3.在模型预测阶段基于维特比算法进行解码，并给出测试文本：“那个球状闪电呈橘红色，拖着一条不太长的尾迹，在夜空中沿一条变换的曲线飘行着。”的词性标注结果。

二、HMM模型

隐马尔可夫模型（Hidden Markov Model，HMM）是一种基本的统计模型，最早在上世纪80年代被提出，可以被应用在语音识别，自然语言处理，生物信息，模式识别等等领域。虽然目前神经网络在一些方面已经取代了隐马尔可夫模型，隐马尔可夫模型依旧是机器学习中非常重要并且值得学习的一个模型
在这里插入图片描述
隐马尔可夫模型用来描述一个含有隐含未知参数的马尔可夫过程(Markov chain)。其中，马尔科夫过程是指一个未来状态的条件概率分布仅依赖于当前状态的随机过程。也就是说，隐马尔可夫模型是关于时序的概率模型，描述由一个隐藏的马尔科夫链生成不可观测的状态随机序列，再由各个状态生成观测随机序列的过程。

如图1中，Zi为状态随机序列，简称状态序列。Xi为状态序列生成的观测随机序列，简称观测序列。状态序列Zi满足马尔科夫过程的要求，并且观测序列Xi
受制于Zi。在Zi不可知的情况下，该模型中的Xi与Xi+1是不独立的。

三、HMM模型参数

隐马尔可夫模型由初始概率分布π、状态转移概率分布A
以及观测概率分布B确定。若我们用字母λ来表示一个隐马尔科夫模型，则可写作λ=(A,B,π)。对于状态序列和观测序列可能出现的值，我们表示如下：

Q是所有可能的状态的集合，Q={q1,q2,⋅⋅⋅,qN}
V是所有可能的观测的集合，V={v1,v2,⋅⋅⋅,vM}

可以发现，N是可能的状态数，而M是可能的观测数。事实上，在少数情况中，V可能是一个连续的分布而不是离散的集合，我们暂时不考虑这种情况。
紧接着，我们用I表示长度为T的状态序列，即I={i1,i2,…,iT}。O则表示对应的观测序列，即O={o1,o2,…,oT}。
那么，状态转移概率矩阵A为
A=[aij]N×Nwhereaij=P(it+1=qj|it=qi)即aij是在时刻t状态qi的条件下时刻t+1转移到状态qj的概率。
除此之外，状态转移概率矩阵B为B=[bik]N×Mwherebik=P(ot=vk|it=qi)即bik是在时刻t处于状态qi的条件下生成观测vk的概率。
最后，π是初始状态概率向量π=(πi)Nwhereπ=P(i1=qi)，也就是说，πi是时刻t=1时处于状态qi的概率。

四、代码实现

# 统计words和tags
words = set()
tags = set()
for words_with_tag in sentences:
    for word_with_tag in words_with_tag:
        word, tag = word_with_tag
        words.add(word)
        tags.add(tag)
words = list(words)
tags = list(tags)
# 统计 词性到词性转移矩阵A 词性到词转移矩阵B 初始向量pi
# 先初始化
A = {tag: {tag: 0 for tag in tags} for tag in tags}
B = {tag: {word: 0 for word in words} for tag in tags}
pi = {tag: 0 for tag in tags}
# 统计A，B
for words_with_tag in sentences:
    head_word, head_tag = words_with_tag[0]
    pi[head_tag] += 1
    B[head_tag][head_word] += 1
    for i in range(1, len(words_with_tag)):
        A[words_with_tag[i-1][1]][words_with_tag[i][1]] += 1
        B[words_with_tag[i][1]][words_with_tag[i][0]] += 1
# 拉普拉斯平滑处理并转换成概率
sum_pi_tag = sum(pi.values())
for tag in tags:
    pi[tag] = (pi[tag] + 1) / (sum_pi_tag + len(tags))
    sum_A_tag = sum(A[tag].values())
    sum_B_tag = sum(B[tag].values())
    for next_tag in tags:
        A[tag][next_tag] = (A[tag][next_tag] + 1) / (sum_A_tag + len(tags))
    for word in words:
        B[tag][word] = (B[tag][word] + 1) / (sum_B_tag + len(words))在这里插入代码片

五、基于维特比算法进行解码

在这里插入代码片
```def decode_by_viterbi(sentence):
    words = sentence.split()
    sen_length = len(words)
    T1 = [{tag: float('-inf') for tag in tags} for i in range(sen_length)]
    T2 = [{tag: None for tag in tags} for i in range(sen_length)]
    # 先进行第一步
    for tag in tags:
        T1[0][tag] = math.log(pi[tag]) + math.log(B[tag][words[0]])
    # 继续后续解码
    for i in range(1, sen_length):
        for tag in tags:
            for pre_tag in tags:
                current_prob = T1[i-1][pre_tag] + math.log(A[pre_tag][tag]) + math.log(B[tag][words[i]])
                if current_prob > T1[i][tag]:
                    T1[i][tag] = current_prob
                    T2[i][tag] = pre_tag
    # 获取最后一步的解码结果
    last_step_result = [(tag, prob) for tag, prob in T1[sen_length-1].items()]
    last_step_result.sort(key=lambda x: -1*x[1])
    last_step_tag = last_step_result[0][0]
    # 向前解码
    step = sen_length - 1
    result = [last_step_tag]
    while step > 0:
        last_step_tag = T2[step][last_step_tag]
        result.append(last_step_tag)
        step -= 1
    result.reverse()
    return list(zip(words, result))