维特比最短路径问题 - 统计分词（unigram）

最新推荐文章于 2024-01-22 13:34:22 发布

K5niper

最新推荐文章于 2024-01-22 13:34:22 发布

阅读量1.5k

点赞数 7

分类专栏：自然语言处理

本文链接：https://blog.csdn.net/zhaoyin214/article/details/103138158

版权

本文介绍了统计分词中的维特比算法，通过建立词库和计算分词概率，找到最短路径进行分词。针对给定字符串"我们学习人工智能，人工智能是未来"，展示了如何枚举所有分割方式并利用维特比算法找出最优解。讨论了算法的时间复杂度和空间复杂度。

摘要由CSDN通过智能技术生成

统计分词（unigram）

1 词库

词库中最长字符串为 $m$ ，输入字符串长度为 $n$ ，一般 $\gg m$

import xlrd

def create_dic_words(file_path, sheet_index=0):
    workbook = xlrd.open_workbook(filename=file_path)
    worksheet  = workbook.sheet_by_index(sheet_index)
    
    dic_words = {
   }
    max_len_word = 0
    for idx in range(worksheet.nrows):
        word = worksheet.row(idx)[0].value.strip()
        dic_words[word] = 0.00001
        
        len_word = len(word)
        if len_word > max_len_word:
            max_len_word = len_word
            
    return dic_words, max_len_word

# TODO: 第一步： 从综合类中文词库.xlsx 中读取所有中文词。
#  hint: 思考一下用什么数据结构来存储这个词典会比较好？ 要考虑我们每次查询一个单词的效率。 
dic_path = "./综合类中文词库.xlsx"
dic_words, max_len_word = create_dic_words(file_path=dic_path)    # 保存词典库中读取的单词

# 以下是每一个单词出现的概率。为了问题的简化，我们只列出了一小部分单词的概率。
# 在这里没有出现的的单词但是出现在词典里的，统一把概率设置成为0.00001
# 比如 p("学院")=p("概率")=...0.00001

word_prob = {
   
    "北京": 0.03, "的": 0.08, "天": 0.005, "气": 0.005, "天气": 0.06, "真":0.04, "好": 0.05,
    "真好": 0.04, "啊": 0.01, "真好啊": 0.02, "今": 0.01, "今天": 0.07, "课程": 0.06, "内容": 0.06,
    "有": 0.05, "很": 0.03, "很有": 0.04, "意思": 0.06, "有意思": 0.005, "课": 0.01, "程": 0.005,
    "经常": 0.08, "意见": 0.08, "意": 0.01, "见": 0.005, "有意见": 0.02, "分歧": 0.04, "分": 0.02, "歧": 0.005
}

print (sum(word_prob.values()))
print(max_len_word)

for key, value in word_prob.items():
    dic_words[key] = value