Moses 解码工作原理研究 - 解码

最新推荐文章于 2021-01-17 05:15:00 发布

the3gwireless

最新推荐文章于 2021-01-17 05:15:00 发布

阅读量815

点赞数

分类专栏： Localization 文章标签：自然语言处理

本文链接：https://blog.csdn.net/the3gwireless/article/details/9303145

版权

Localization 专栏收录该内容

19 篇文章 0 订阅

订阅专栏

翻译: this is a small house

一、加载语言模型（参见前面的文章）
二、加载短语表（参见前面的文章）

三、解码

1. 将this is a small house 拆分成短语如下：

this, this is, this is a, is, is a, is a small, a, a small, a small house, small, small house, house...
2. 根据加载的短语表，参照上面拆分后的短语生成翻译选择, 这里需要根据moses.ini 的ttable-limit参数（例如=10）对翻译选择进行剪裁，选概率最大的前10名,举例如下:
[this ; 0-0]
这 :0-0 : transScore=-0.17631271, ngramScore=0.0, fullScore=-0.4629019
此 :0-0 : transScore=-0.08303408, ngramScore=0.0, fullScore=-0.5492474
本 :0-0 : transScore=-0.15286377, ngramScore=0.0, fullScore=-0.5936865
这种 :0-0 : transScore=-0.28396568, ngramScore=0.0, fullScore=-0.67154866
该 :0-0 : transScore=-0.26927504, ngramScore=0.0, fullScore=-0.69091433
这个 :0-0 : transScore=-0.33787173, ngramScore=0.0, fullScore=-0.69774145
这一 :0-0 : transScore=-0.3776023, ngramScore=0.0, fullScore=-0.70066893
, 这 :0-1 : transScore=-0.5075526, ngramScore=0.0, fullScore=-0.76311255
这样 :0-0 : transScore=-0.42533168, ngramScore=0.0, fullScore=-0.8186509
, 这种 :0-1 : transScore=-0.61747473, ngramScore=0.0, fullScore=-0.93730986

[this is ; 0-1]
这是 :0-0 1-1 : transScore=-0.13185829, ngramScore=0.0, fullScore=-0.50451744
, 这是 :0-1 1-2 : transScore=-0.42933097, ngramScore=-0.08881984, fullScore=-0.60304374
时 , 这是 :0-2 1-3 : transScore=-0.5500913, ngramScore=-0.4338539, fullScore=-0.8187349
这 :0-0 : transScore=-0.6021664, ngramScore=0.0, fullScore=-0.8887556
这就是 :0-0 1-1 1-2 : transScore=-0.46652117, ngramScore=-0.08029249, fullScore=-0.9260251
这是一种 :0-0 1-1 1-2 1-3 : transScore=-0.68230426, ngramScore=-0.24731043, fullScore=-0.9609399
这一 :0-0 1-1 : transScore=-0.6766306, ngramScore=0.0, fullScore=-0.9996972
是 :0-0 1-0 : transScore=-0.8496002, ngramScore=0.0, fullScore=-1.0277934
这是一 :0-0 1-1 : transScore=-0.6806995, ngramScore=-0.15030792, fullScore=-1.0329995
我是 :0-0 1-1 : transScore=-0.6666119, ngramScore=0.0, fullScore=-1.0341808

[this is a ; 0-2]
这是一种 :0-0 1-1 2-2 2-3 : transScore=-0.36358035, ngramScore=-0.24731043, fullScore=-0.64221597
这是一个 :0-0 1-1 2-2 : transScore=-0.21216409, ngramScore=-0.2582455, fullScore=-0.6724017
这是一 :0-0 1-1 2-2 : transScore=-0.34449613, ngramScore=-0.15030792, fullScore=-0.69679624
这是 :0-0 1-1 2-1 : transScore=-0.35286462, ngramScore=0.0, fullScore=-0.7255238
, 这是一个 :0-1 1-2 2-3 : transScore=-0.4826488, ngramScore=-0.3470653, fullScore=-0.7439401
, 这是一种 :0-1 1-2 2-3 2-4 : transScore=-0.857316, ngramScore=-0.33613026, fullScore=-0.9370052
, 这是一 :0-1 1-2 2-3 : transScore=-0.8450625, ngramScore=-0.23912776, fullScore=-0.9984162
时 , 这是一个 :0-2 1-3 2-4 : transScore=-0.7015024, ngramScore=-0.6920994, fullScore=-1.0577245
这是一次 :0-0 1-1 2-2 2-3 : transScore=-0.5953888, ngramScore=-0.4648093, fullScore=-1.0915234
这一 :0-0 2-1 : transScore=-0.7742093, ngramScore=0.0, fullScore=-1.0972759

[is ; 1-1]
是 :0-0 : transScore=-0.14137065, ngramScore=0.0, fullScore=-0.31956387
为 :0-0 : transScore=-0.3349248, ngramScore=0.0, fullScore=-0.64058596
的是 :0-1 : transScore=-0.47829998, ngramScore=0.0, fullScore=-0.6865634
被 :0-0 : transScore=-0.38559377, ngramScore=0.0, fullScore=-0.6995167
时 :0-0 : transScore=-0.41431916, ngramScore=0.0, fullScore=-0.7161525
时 , :0-0 : transScore=-0.5482109, ngramScore=0.0, fullScore=-0.7243346
都 :0-0 : transScore=-0.47560424, ngramScore=0.0, fullScore=-0.7339959
就是 :0-0 0-1 : transScore=-0.47240877, ngramScore=0.0, fullScore=-0.7436588
已 :0-0 : transScore=-0.38845202, ngramScore=0.0, fullScore=-0.7609691
会 :0-0 : transScore=-0.4524555, ngramScore=0.0, fullScore=-0.7685862

[a ; 2-2]
的 :0-0 : transScore=-0.35968602, ngramScore=0.0, fullScore=-0.41077167
一 :0-0 : transScore=-0.24054566, ngramScore=0.0, fullScore=-0.47820085
一个 :0-0 : transScore=-0.17229484, ngramScore=0.0, fullScore=-0.48406982
一种 :0-0 0-1 : transScore=-0.32365996, ngramScore=0.0, fullScore=-0.58680063
了 :0-0 : transScore=-0.46956912, ngramScore=0.0, fullScore=-0.6240009
在 :0-0 : transScore=-0.42969412, ngramScore=0.0, fullScore=-0.6272611
的一 :0-0 0-1 : transScore=-0.4632901, ngramScore=0.0, fullScore=-0.65581596
的一个 :0-0 0-1 : transScore=-0.4457689, ngramScore=0.0, fullScore=-0.7019355
了一个 :0-0 0-1 : transScore=-0.46170878, ngramScore=0.0, fullScore=-0.7208837
的一种 :0-1 0-2 : transScore=-0.58426756, ngramScore=-0.11560842, fullScore=-0.7217348

[small ; 3-3]
小 :0-0 : transScore=-0.15287913, ngramScore=0.0, fullScore=-0.52539617
小的 :0-0 : transScore=-0.34542146, ngramScore=0.0, fullScore=-0.77285254
的小 :0-1 : transScore=-0.42255682, ngramScore=0.0, fullScore=-0.8358102
小 , :0-0 : transScore=-0.50137746, ngramScore=0.0, fullScore=-0.9609436
一小 :0-1 : transScore=-0.6135466, ngramScore=0.0, fullScore=-1.0145993
, 小 :0-1 : transScore=-0.6167816, ngramScore=0.0, fullScore=-1.1090256
少 :0-0 : transScore=-0.56998473, ngramScore=0.0, fullScore=-1.1505985
很小 :0-0 0-1 : transScore=-0.47863993, ngramScore=0.0, fullScore=-1.1824566
一小的 :0-1 : transScore=-0.7328835, ngramScore=-0.23370379, fullScore=-1.1969728
的小的 :0-1 : transScore=-0.73342204, ngramScore=-0.22558104, fullScore=-1.2015893

[house ; 4-4]
房间 :0-0 : transScore=-0.3652439, ngramScore=0.0, fullScore=-0.8978848
内部 :0-0 : transScore=-0.49974135, ngramScore=0.0, fullScore=-1.0323822
家中 :0-0 : transScore=-0.48957917, ngramScore=0.0, fullScore=-1.070193
房子 :0-0 : transScore=-0.5858134, ngramScore=0.0, fullScore=-1.0896761
罩 :0-0 : transScore=-0.5311585, ngramScore=0.0, fullScore=-1.1117723
屋 :0-0 : transScore=-0.5533101, ngramScore=0.0, fullScore=-1.133924
的房子 :0-1 : transScore=-0.5701914, ngramScore=0.0, fullScore=-1.1583592
家里 :0-0 : transScore=-0.70180434, ngramScore=0.0, fullScore=-1.205667
房屋 :0-0 : transScore=-0.69276404, ngramScore=0.0, fullScore=-1.2416928
房 :0-0 : transScore=-0.7778376, ngramScore=0.0, fullScore=-1.3104784

3. 计算每种可能性的未来开销 (future cost) 如下：
future cost from 0 to 0 is -0.4629019
future cost from 0 to 1 is -0.50451744
future cost from 0 to 2 is -0.6422157
future cost from 0 to 3 is -1.1676118
future cost from 0 to 4 is -2.0654967
future cost from 1 to 1 is -0.31956387
future cost from 1 to 2 is -0.73033553
future cost from 1 to 3 is -1.2557317
future cost from 1 to 4 is -2.1536164
future cost from 2 to 2 is -0.41077167
future cost from 2 to 3 is -0.93616784
future cost from 2 to 4 is -1.8340526
future cost from 3 to 3 is -0.52539617
future cost from 3 to 4 is -1.423281
future cost from 4 to 4 is -0.8978848

注意，未来开销是非常重要的参数,除了短语自己的翻译概率,还要加上未来开销作为整体得分来考虑,这样做的目的是防止较好的可能在一开始被剪枝

3. 开始解码
根据句子的单词数生成一个包含(句子的单词数+1)个栈的对象，本例是6个，第一个栈是初始栈，里面只有一个对象，是根。
第二个栈里保存包含一个词的所有可能翻译假设，翻译假设是从前面提到的可选短语翻译扩展来的，举例： is 的各种可能翻译，this 的各种可能翻译，house的各种可能翻译。。。
同理，第三个栈里保存包含两个词的所有可能翻译假设，举例：this is, small house, is this ....
第四个栈里保存包含三个词的所有可能翻译假设，举例：this is a, a small house, is a small, small a is...
第五个栈里保存包含四个词的所有可能翻译假设，举例：this is a small, is a small house, small a house is....
第六个栈里保存包含五个词的所有可能翻译假设，举例：this is a small house, a small house is this ....

注意: 第一栈到最后一个栈的翻译假设扩展是顺序进行的，其间，会计算翻译假设的概率并根据参数进行剪枝，里面涉及到一个栈的最大长度以及一个柱搜索宽度（beamWidth），这个应该好理解，目的是把较差翻译假设的剪枝，另外一个重要的概念是重组(recombination)，这样可以把从不同搜索路径得到的相同翻译假设合并,可以将差的那个安全地舍弃掉。举例:

creating hypothesis 1 from 0 ( <s> )
      base score 0.0
      covering 0-0: this
      translated as: 这 :0-0 : transScore=-0.17631271, ngramScore=0.0, fullScore=-0.4629019
      score -0.2710216 + future cost -2.1536164 = -2.424638
      unweighted feature scores: <<p0=0.0, p1=-1.0, p2=0.0, p3=-3.3599875, p4=-0.55711037, p5=-0.72308487, p6=-1.9655273, p7=-1.9556606, p8=0.9998963>>
added hyp to stack, best on stack, now size 1

creating hypothesis 2 from 0 ( <s> )
      base score 0.0
      covering 0-0: this
      translated as: 此 :0-0 : transScore=-0.08303408, ngramScore=0.0, fullScore=-0.5492474
      score -0.4973371 + future cost -2.1536164 = -2.6509535
      unweighted feature scores: <<p0=0.0, p1=-1.0, p2=0.0, p3=-7.406447, p4=-0.3108975, p5=-0.3400173, p6=-0.99388206, p7=-0.99157476, p8=0.9998963>>
added hyp to stack, now size 2

creating hypothesis 3 from 0 ( <s> )
      base score 0.0
      covering 0-0: this
      translated as: 本 :0-0 : transScore=-0.15286377, ngramScore=0.0, fullScore=-0.5936865
      score -0.6334932 + future cost -2.1536164 = -2.7871096
      unweighted feature scores: <<p0=0.0, p1=-1.0, p2=0.0, p3=-8.246221, p4=-0.36767662, p5=-0.49863848, p6=-1.8613604, p7=-1.847803, p8=0.9998963>>
added hyp to stack, now size 3

。。。。

creating hypothesis 18 from 0 ( <s> )
      base score 0.0
      covering 0-1: this is
      translated as: 是 :0-0 1-0 : transScore=-0.8496002, ngramScore=0.0, fullScore=-1.0277934
      score -1.0970447 + future cost -1.8340526 = -2.9310973
      unweighted feature scores: <<p0=0.0, p1=-1.0, p2=0.0, p3=-5.293809, p4=-6.979709, p5=-7.6139755, p6=-4.1303096, p7=-2.383111, p8=0.9998963>>
added hyp to stack, now size 7

。。。。

creating hypothesis 1741 from 3 ( ... 本  )
      base score -0.6334932
      covering 1-1: is
      translated as: 是 :0-0 : transScore=-0.14137065, ngramScore=0.0, fullScore=-0.31956387
      score -1.5968711 + future cost -1.8340526 = -3.4309237
      unweighted feature scores: <<p0=0.0, p1=-2.0, p2=0.0, p3=-12.7939005, p4=-1.1784818, p5=-1.3951486, p6=-3.0983973, p7=-3.5451927, p8=1.9997926>>
equiv hypo exists
worse than matching hyp 18, recombining

creating hypothesis 1742 from 3 ( ... 本  )
      base score -0.6334932
      covering 1-1: is
      translated as: 为 :0-0 : transScore=-0.3349248, ngramScore=0.0, fullScore=-0.64058596
      score -1.9178934 + future cost -1.8340526 = -3.751946
      unweighted feature scores: <<p0=0.0, p1=-2.0, p2=0.0, p3=-14.407804, p4=-2.525175, p5=-2.659915, p6=-4.372409, p7=-4.8097196, p8=1.9997926>>
equiv hypo exists
worse than matching hyp 1702, recombining

creating hypothesis 1743 from 3 ( ... 本  )
      base score -0.6334932
      covering 1-1: is
      translated as: 的是 :0-1 : transScore=-0.47829998, ngramScore=0.0, fullScore=-0.6865634
      score -1.9474349 + future cost -1.8340526 = -3.7814875
      unweighted feature scores: <<p0=0.0, p1=-3.0, p2=0.0, p3=-15.127386, p4=-1.9342322, p5=-1.3951486, p6=-7.3437715, p7=-4.990869, p8=1.9997926>>
equiv hypo exists
worse than matching hyp 1703, recombining

。。。。。

creating hypothesis 2575 from 1062 ( ... 为  )
      base score -10.722214
      covering 0-0: this
      translated as: 该 :0-0 : transScore=-0.26927504, ngramScore=0.0, fullScore=-0.69091433
      score -20.614857 + future cost -1.3086565 = -21.923513
      unweighted feature scores: <<p0=-8.0, p1=-3.0, p2=0.0, p3=-20.860205, p4=-4.4369936, p5=-5.5207715, p6=-5.7828665, p7=-6.230873, p8=2.9996889>>
discarded, too bad for stack

creating hypothesis 2576 from 1062 ( ... 为  )
      base score -10.722214
      covering 0-0: this
      translated as: 这个 :0-0 : transScore=-0.33787173, ngramScore=0.0, fullScore=-0.69774145
      score -20.621683 + future cost -1.3086565 = -21.93034
      unweighted feature scores: <<p0=-8.0, p1=-3.0, p2=0.0, p3=-20.078125, p4=-3.5377154, p5=-4.6409087, p6=-7.7203236, p7=-8.149171, p8=2.9996889>>
discarded, too bad for stack

creating hypothesis 2577 from 1062 ( ... 为  )
      base score -10.722214
      covering 0-0: this
      translated as: 这一 :0-0 : transScore=-0.3776023, ngramScore=0.0, fullScore=-0.70066893
      score -20.62461 + future cost -1.3086565 = -21.933268
      unweighted feature scores: <<p0=-8.0, p1=-4.0, p2=0.0, p3=-21.773006, p4=-3.9289849, p5=-4.709192, p6=-8.007003, p7=-11.231518, p8=2.9996889>>
discarded, too bad for stack

..........

都分析完后,根据最后一个栈(覆盖了所有单词的翻译假设)进行排序,将最好的翻译结果显示出来:

total hypotheses considered = 14381
         number not built = 0
   number discarded early = 0
         number discarded = 4067
      number recombined = 9373
            number pruned = 228
time to collect opts 0.093 (6%)
      create hyps    0.0 (0%)
      estimate score  0.0 (0%)
      calc lm       0.064 (4%)
      other hyp score 0.0 (0%)
      manage stacks 0.389 (28%)
      other          0.798 (59%)
total source words = 5
   words deleted = 0 ()
words inserted = 0 ()
Search took 1.297 seconds
这是一个 |0-2|小 |3-3|房间|4-4|
BEST TRANSLATION: 这是一个小房间 [11111]  [total=-3.274862] <<p0=0.0, p1=-5.0, p2=0.0, p3=-30.024462, p4=-5.826433, p5=-8.26403, p6=-3.249391, p7=-8.328738, p8=2.9996889>>
Sentence Decoding Time: : [3.797] seconds
Source and Target Units:this is a small house ] 房间 :0-0 : transScore=-0.3652439, ngramScore=0.0, fullScore=-0.8978848:[4..4][] 房间 :0-0 : transScore=-0.3652439, ngramScore=0.0, fullScore=-0.8978848:[4..4][ ] 小 :0-0 : transScore=-0.15287913, ngramScore=0.0, fullScore=-0.52539617:[3..3][] 房间 :0-0 : transScore=-0.3652439, ngramScore=0.0, fullScore=-0.8978848:[4..4][ ] 小 :0-0 : transScore=-0.15287913, ngramScore=0.0, fullScore=-0.52539617:[3..3][ ] 这是一个 :0-0 1-1 2-2 : transScore=-0.21216409, ngramScore=-0.2582455, fullScore=-0.6724017:[0..2][
Translation took 1.359 seconds
Finished translating

注意，本人还没研究到调序(reorderring), 所以在运行的时候在moses.ini里将调序去掉了，研究后再发上来。
写得比较粗略，有些地方是靠记忆，可能有不妥之处，欢迎大家指正，谢谢！