游离态GLZ的NLP任务2——用维特比算法实现词性标注

最新推荐文章于 2023-04-22 15:45:27 发布

游离态GLZ不可能是金融技术宅

最新推荐文章于 2023-04-22 15:45:27 发布

阅读量302

点赞数

分类专栏： NLP 文章标签：算法 NLP

本文链接：https://blog.csdn.net/qq_37477357/article/details/107255847

版权

NLP 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

1.词性标注任务的基本分析

在这里插入图片描述

2.训练集所需要构筑的数据

A:词—词性统计库
B:bigram模型下 $w_i—w_{i-1}$ 统计库
pi:句首词词性统计库

#给每个单词和词性一个标号，并留下标号和词性转换的字典
def generate_id_dict():
    word2id,id2word = {},{} #word2id:{apple:0,banana:2...} id2word:{0:apple,1:banana}
    tag2id,id2tag = {},{} #tag2id:{NNP:0,VB:1...} id2tag:{0:NNP,1:VB}
    with open("traindata.txt") as file:
        for line in file:
            line = line.split('/')
            word,tag = line[0],line[1].rstrip()
            
            #如果字典中没有出现过该单词或者tag，则新开一个，id正好为当前字典长度
            if word not in word2id:
                word2id[word] = len(word2id)
                id2word[len(word2id)] = word
                
            if tag not in tag2id:
                tag2id[tag] = len(tag2id)
                id2tag[len(tag2id)] = tag
    
    return [word2id,id2word,tag2id,id2tag]
    
word2id,id2word,tag2id,id2tag = generate_id_dict()

def generate_parameter():
    #词性个数
    N = len(tag2id)
    #单词个数
    M = len(word2id)
    
    #初始化A、B、pi
    A = np.zeros((N,M))
    B = np.zeros((M,M))
    pi = np.zeros(N)
    
    #统计计算A、B、pi
    with open("traindata.txt") as file:
        prev_tag = ""
        for line in file:
            line = line.split('/')
            wordId,tagId = word2id[line[0]],tag2id[line[1].rstrip()]
            
            #根据是否是首单词进行更新参数
            A[tagId][wordId] += 1
            if prev_tag == "":
                pi[tagId] += 1
            else:
                B[tag2id[prev_tag]][tagId] += 1
            
            #更新prev_tag
            if line[0] == '.':
                prev_tag = ""
            else:
                prev_tag = line[1].rstrip()
    
    pi /= sum(pi)
    for i in range(N):
        A[i] /= sum(A[i])
        B[i] /= sum(B[i])
        
    return [N,M,A,B,pi]

3维特比算法求最优路径

在这里插入图片描述

def veterbi(x):
   #把英文句子变成单词id列表
   x = [word2id[word] for word in x.split(" ")]
   T = len(x)
   #dp(i,j)表示从一开始到第i个单词选第j个词性的最高分数
   dp = np.zeros((T,N))
   #ptr用来标注选择
   #ptr(i,j)代表第i个单词选j，则上一个单词选ptr(i,j)
   ptr = np.zeros((T,N),dtype=int)
   
   #base case
   for j in range(N):
       dp[0][j] = np.log(pi[j]) + np.log(A[j][x[0]])
       
   for i in range(1,T):#每个单词
       for j in range(N):#每个词性
           dp[i][j] = -999999999
           for k in range(N):#遍历i-1，i之间每一个路径
               score = dp[i-1][k] + np.log(B[k][j]) + np.log(A[j][x[i]])
               if score > dp[i][j]:
                   dp[i][j] = score
                   ptr[i][j] = k
   
   best_seq = [0 for i in range(T)]
   #找出最后一个单词的词性
   best_seq[T-1] = np.argmax(dp[T-1])
   
   #从后往前找每个词的词性
   for i in range(T-2,-1,-1):
       best_seq[i] = ptr[i+1][best_seq[i+1]]
       
   #返回序列
   for i,value in enumerate(best_seq):
       best_seq[i] = id2tag[value+1]
   return best_seq

输入句子测试

x = "Social Security number , passport number and details about the services provided for the payment"
viterbi(x)

测试效果：
在这里插入图片描述

游离态GLZ不可能是金融技术宅

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
游离态GLZ的NLP任务2——用维特比算法实现词性标注

1.词性标注任务的基本分析2.训练集所需要构筑的数据A:词—词性统计库B:bigram模型下wi—wi−1w_i—w_{i-1}wi—wi−1统计库pi:句首词词性统计库#给每个单词和词性一个标号，并留下标号和词性转换的字典def generate_id_dict(): word2id,id2word = {},{} #word2id:{apple:0,banana:2...} id2word:{0:apple,1:banana} tag2id,id2tag = {},{}
复制链接

扫一扫