NLP之最短路径分词(五)

动态规划 + viterbi最短路径 + 1阶马尔可夫链

最短路径分词是将可能性最大的句子切分出来。首先对句子进行全切分,找出所有可能的字词,利用动态规划生成词图,并利用1阶马尔可夫链计算出所有的路径权值,找出图中最短的路径,属于机械式规则+统计的分词方法。

 

在句子头尾分别加上B 和 E,找出B和E的最短路径即可。图中的数字表示次数,只是为了方便标注,可以理解为上面的最长路径即是要求的最短路径,当然次数是随意标注的,并不是从字典查找得来的。

 具体实现见 https://github.com/hankcs/HanLP

PS: 
1阶马尔可夫链 
P(AB) = P(A)•P(B|A)
以下计算方式来自张华平教授的开源ICTCLAS
weight= -Math.log(平滑参数 * A词的总频率 / (所有词的频率) + (1 - 平滑参数) * ((1 - 平滑因子) * A词后面出现B词的频率 / A词的总频率 + 平滑因子));

打印词图:========按终点打印========
to: 1, from: 0, weight:04.58, word:始##始@李
to: 2, from: 1, weight:10.90, word:李@胜
to: 3, from: 1, weight:10.90, word:李@胜利
to: 4, from: 2, weight:11.41, word:胜@利
to: 5, from: 3, weight:11.35, word:胜利@说
to: 5, from: 4, weight:04.47, word:利@说
to: 6, from: 5, weight:03.24, word:说@的
to: 7, from: 5, weight:08.46, word:说@的确
to: 8, from: 6, weight:05.68, word:的@确
to: 9, from: 6, weight:05.67, word:的@确实
to: 10, from: 7, weight:11.23, word:的确@实
to: 10, from: 8, weight:11.46, word:确@实
to: 11, from: 7, weight:11.23, word:的确@实在
to: 11, from: 8, weight:11.46, word:确@实在
to: 12, from: 9, weight:03.17, word:确实@在
to: 12, from: 10, weight:11.21, word:实@在
to: 13, from: 9, weight:10.87, word:确实@在理
to: 13, from: 10, weight:11.21, word:实@在理
to: 14, from: 11, weight:11.18, word:实在@理
to: 14, from: 12, weight:07.00, word:在@理
to: 15, from: 13, weight:11.61, word:在理@末##末
to: 15, from: 14, weight:05.59, word:理@末##末

 

转载于:https://www.cnblogs.com/hx78/p/7309535.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值