Moses 解码工作原理研究 - 解码

翻译: this is a small house

一、加载语言模型 (参见前面的文章)
二、加载短语表(参见前面的文章)

三、解码

1. 将this is a small house 拆分成短语如下:

this, this is, this is a, is, is a, is a small, a, a small, a small house, small, small house, house...
2. 根据加载的短语表,参照上面拆分后的短语生成翻译选择, 这里需要根据moses.ini 的ttable-limit参数 (例如=10)对翻译选择进行剪裁,选概率最大的前10名,举例如下:
[this ; 0-0]
        这 :0-0 : transScore=-0.17631271, ngramScore=0.0, fullScore=-0.4629019
        此 :0-0 : transScore=-0.08303408, ngramScore=0.0, fullScore=-0.5492474
        本 :0-0 : transScore=-0.15286377, ngramScore=0.0, fullScore=-0.5936865
        这种 :0-0 : transScore=-0.28396568, ngramScore=0.0, fullScore=-0.67154866
        该 :0-0 : transScore=-0.26927504, ngramScore=0.0, fullScore=-0.69091433
        这个 :0-0 : transScore=-0.33787173, ngramScore=0.0, fullScore=-0.69774145
        这 一 :0-0 : transScore=-0.3776023, ngramScore=0.0, fullScore=-0.70066893
        , 这 :0-1 : transScore=-0.5075526, ngramScore=0.0, fullScore=-0.76311255
        这样 :0-0 : transScore=-0.42533168, ngramScore=0.0, fullScore=-0.8186509
        , 这种 :0-1 : transScore=-0.61747473, ngramScore=0.0, fullScore=-0.93730986

[this is ; 0-1]
        这 是 :0-0 1-1 : transScore=-0.13185829, ngramScore=0.0, fullScore=-0.50451744
        , 这 是 :0-1 1-2 : transScore=-0.42933097, ngramScore=-0.08881984, fullScore=-0.60304374
        时 , 这 是 :0-2 1-3 : transScore=-0.5500913, ngramScore=-0.4338539, fullScore=-0.8187349
        这 :0-0 : transScore=-0.6021664, ngramScore=0.0, fullScore=-0.8887556
        这 就 是 :0-0 1-1 1-2 : transScore=-0.46652117, ngramScore=-0.08029249, fullScore=-0.9260251
        这 是 一 种 :0-0 1-1 1-2 1-3 : transScore=-0.68230426, ngramScore=-0.24731043, fullScore=-0.9609399
        这 一 :0-0 1-1 : transScore=-0.6766306, ngramScore=0.0, fullScore=-0.9996972
        是 :0-0 1-0 : transScore=-0.8496002, ngramScore=0.0, fullScore=-1.0277934
        这 是 一 :0-0 1-1 : transScore=-0.6806995, ngramScore=-0.15030792, fullScore=-1.0329995
        我 是 :0-0 1-1 : transScore=-0.6666119, ngramScore=0.0, fullScore=-1.0341808

[this is a ; 0-2]
        这 是 一 种 :0-0 1-1 2-2 2-3 : transScore=-0.36358035, ngramScore=-0.24731043, fullScore=-0.64221597
        这 是 一个 :0-0 1-1 2-2 : transScore=-0.21216409, ngramScore=-0.2582455, fullScore=-0.6724017
        这 是 一 :0-0 1-1 2-2 : transScore=-0.34449613, ngramScore=-0.15030792, fullScore=-0.69679624
        这 是 :0-0 1-1 2-1 : transScore=-0.35286462, ngramScore=0.0, fullScore=-0.7255238
        , 这 是 一个 :0-1 1-2 2-3 : transScore=-0.4826488, ngramScore=-0.3470653, fullScore=-0.7439401
        , 这 是 一 种 :0-1 1-2 2-3 2-4 : transScore=-0.857316, ngramScore=-0.33613026, fullScore=-0.9370052
        , 这 是 一 :0-1 1-2 2-3 : transScore=-0.8450625, ngramScore=-0.23912776, fullScore=-0.9984162
        时 , 这 是 一个 :0-2 1-3 2-4 : transScore=-0.7015024, ngramScore=-0.6920994, fullScore=-1.0577245
        这 是 一 次 :0-0 1-1 2-2 2-3 : transScore=-0.5953888, ngramScore=-0.4648093, fullScore=-1.0915234
        这 一 :0-0 2-1 : transScore=-0.7742093, ngramScore=0.0, fullScore=-1.0972759

[is ; 1-1]
        是 :0-0 : transScore=-0.14137065, ngramScore=0.0, fullScore=-0.31956387
        为 :0-0 : transScore=-0.3349248, ngramScore=0.0, fullScore=-0.64058596
        的 是 :0-1 : transScore=-0.47829998, ngramScore=0.0, fullScore=-0.6865634
        被 :0-0 : transScore=-0.38559377, ngramScore=0.0, fullScore=-0.6995167
        时 :0-0 : transScore=-0.41431916, ngramScore=0.0, fullScore=-0.7161525
        时 , :0-0 : transScore=-0.5482109, ngramScore=0.0, fullScore=-0.7243346
        都 :0-0 : transScore=-0.47560424, ngramScore=0.0, fullScore=-0.7339959
        就 是 :0-0 0-1 : transScore=-0.47240877, ngramScore=0.0, fullScore=-0.7436588
        已 :0-0 : transScore=-0.38845202, ngramScore=0.0, fullScore=-0.7609691
        会 :0-0 : transScore=-0.4524555, ngramScore=0.0, fullScore=-0.7685862

[a ; 2-2]
        的 :0-0 : transScore=-0.35968602, ngramScore=0.0, fullScore=-0.41077167
        一 :0-0 : transScore=-0.24054566, ngramScore=0.0, fullScore=-0.47820085
        一个 :0-0 : transScore=-0.17229484, ngramScore=0.0, fullScore=-0.48406982
        一 种 :0-0 0-1 : transScore=-0.32365996, ngramScore=0.0, fullScore=-0.58680063
        了 :0-0 : transScore=-0.46956912, ngramScore=0.0, fullScore=-0.6240009
        在 :0-0 : transScore=-0.42969412, ngramScore=0.0, fullScore=-0.6272611
        的 一 :0-0 0-1 : transScore=-0.4632901, ngramScore=0.0, fullScore=-0.65581596
        的 一个 :0-0 0-1 : transScore=-0.4457689, ngramScore=0.0, fullScore=-0.7019355
        了 一个 :0-0 0-1 : transScore=-0.46170878, ngramScore=0.0, fullScore=-0.7208837
        的 一 种 :0-1 0-2 : transScore=-0.58426756, ngramScore=-0.11560842, fullScore=-0.7217348

[small ; 3-3]
        小 :0-0 : transScore=-0.15287913, ngramScore=0.0, fullScore=-0.52539617
        小 的 :0-0 : transScore=-0.34542146, ngramScore=0.0, fullScore=-0.77285254
        的 小 :0-1 : transScore=-0.42255682, ngramScore=0.0, fullScore=-0.8358102
        小 , :0-0 : transScore=-0.50137746, ngramScore=0.0, fullScore=-0.9609436
        一 小 :0-1 : transScore=-0.6135466, ngramScore=0.0, fullScore=-1.0145993
        , 小 :0-1 : transScore=-0.6167816, ngramScore=0.0, fullScore=-1.1090256
        少 :0-0 : transScore=-0.56998473, ngramScore=0.0, fullScore=-1.1505985
        很 小 :0-0 0-1 : transScore=-0.47863993, ngramScore=0.0, fullScore=-1.1824566
        一 小 的 :0-1 : transScore=-0.7328835, ngramScore=-0.23370379, fullScore=-1.1969728
        的 小 的 :0-1 : transScore=-0.73342204, ngramScore=-0.22558104, fullScore=-1.2015893

[house ; 4-4]
        房间 :0-0 : transScore=-0.3652439, ngramScore=0.0, fullScore=-0.8978848
        内部 :0-0 : transScore=-0.49974135, ngramScore=0.0, fullScore=-1.0323822
        家中 :0-0 : transScore=-0.48957917, ngramScore=0.0, fullScore=-1.070193
        房子 :0-0 : transScore=-0.5858134, ngramScore=0.0, fullScore=-1.0896761
        罩 :0-0 : transScore=-0.5311585, ngramScore=0.0, fullScore=-1.1117723
        屋 :0-0 : transScore=-0.5533101, ngramScore=0.0, fullScore=-1.133924
        的 房子 :0-1 : transScore=-0.5701914, ngramScore=0.0, fullScore=-1.1583592
        家里 :0-0 : transScore=-0.70180434, ngramScore=0.0, fullScore=-1.205667
        房屋 :0-0 : transScore=-0.69276404, ngramScore=0.0, fullScore=-1.2416928
        房 :0-0 : transScore=-0.7778376, ngramScore=0.0, fullScore=-1.3104784

3. 计算每种可能性的未来开销 (future cost) 如下:
future cost from 0 to 0 is -0.4629019
future cost from 0 to 1 is -0.50451744
future cost from 0 to 2 is -0.6422157
future cost from 0 to 3 is -1.1676118
future cost from 0 to 4 is -2.0654967
future cost from 1 to 1 is -0.31956387
future cost from 1 to 2 is -0.73033553
future cost from 1 to 3 is -1.2557317
future cost from 1 to 4 is -2.1536164
future cost from 2 to 2 is -0.41077167
future cost from 2 to 3 is -0.93616784
future cost from 2 to 4 is -1.8340526
future cost from 3 to 3 is -0.52539617
future cost from 3 to 4 is -1.423281
future cost from 4 to 4 is -0.8978848

注意,未来开销是非常重要的参数,除了短语自己的翻译概率,还要加上未来开销作为整体得分来考虑,这样做的目的是防止较好的可能在一开始被剪枝

3. 开始解码
根据句子的单词数生成一个包含(句子的单词数+1)个栈的对象,本例是6个,第一个栈是初始栈,里面只有一个对象,是根。
第二个栈里保存包含一个词的所有可能翻译假设,翻译假设是从前面提到的可选短语翻译扩展来的,举例: is 的 各种可能翻译,this 的各种可能翻译,house的各种可能翻译。。。
同理,第三个栈里保存包含两个词的所有可能翻译假设,举例:this is, small house, is this ....
第四个栈里保存包含三个词的所有可能翻译假设,举例:this is a, a small house, is a small, small a is...
第五个栈里保存包含四个词的所有可能翻译假设,举例:this is a small, is a small house, small a house is....
第六个栈里保存包含五个词的所有可能翻译假设,举例:this is a small house, a small house is this ....

注意: 第一栈到最后一个栈的翻译假设扩展是顺序进行的,其间,会计算翻译假设的概率并根据参数进行剪枝,里面涉及到一个栈的最大长度以及一个柱搜索宽度(beamWidth),这个应该好理解,目的是把较差翻译假设的剪枝,另外一个重要的概念是重组(recombination),这样可以把从不同搜索路径得到的相同翻译假设合并,可以将差的那个安全地舍弃掉。举例:

creating hypothesis 1 from 0 ( <s> )
        base score 0.0
        covering 0-0: this
        translated as: 这 :0-0 : transScore=-0.17631271, ngramScore=0.0, fullScore=-0.4629019
        score -0.2710216 + future cost -2.1536164 = -2.424638
        unweighted feature scores: <<p0=0.0, p1=-1.0, p2=0.0, p3=-3.3599875, p4=-0.55711037, p5=-0.72308487, p6=-1.9655273, p7=-1.9556606, p8=0.9998963>>
added hyp to stack, best on stack, now size 1

creating hypothesis 2 from 0 ( <s> )
        base score 0.0
        covering 0-0: this
        translated as: 此 :0-0 : transScore=-0.08303408, ngramScore=0.0, fullScore=-0.5492474
        score -0.4973371 + future cost -2.1536164 = -2.6509535
        unweighted feature scores: <<p0=0.0, p1=-1.0, p2=0.0, p3=-7.406447, p4=-0.3108975, p5=-0.3400173, p6=-0.99388206, p7=-0.99157476, p8=0.9998963>>
added hyp to stack, now size 2

creating hypothesis 3 from 0 ( <s> )
        base score 0.0
        covering 0-0: this
        translated as: 本 :0-0 : transScore=-0.15286377, ngramScore=0.0, fullScore=-0.5936865
        score -0.6334932 + future cost -2.1536164 = -2.7871096
        unweighted feature scores: <<p0=0.0, p1=-1.0, p2=0.0, p3=-8.246221, p4=-0.36767662, p5=-0.49863848, p6=-1.8613604, p7=-1.847803, p8=0.9998963>>
added hyp to stack, now size 3

。。。。

creating hypothesis 18 from 0 ( <s> )
        base score 0.0
        covering 0-1: this is
        translated as: 是 :0-0 1-0 : transScore=-0.8496002, ngramScore=0.0, fullScore=-1.0277934
        score -1.0970447 + future cost -1.8340526 = -2.9310973
        unweighted feature scores: <<p0=0.0, p1=-1.0, p2=0.0, p3=-5.293809, p4=-6.979709, p5=-7.6139755, p6=-4.1303096, p7=-2.383111, p8=0.9998963>>
added hyp to stack, now size 7

。。。。

creating hypothesis 1741 from 3 ( ... 本  )
        base score -0.6334932
        covering 1-1: is
        translated as: 是 :0-0 : transScore=-0.14137065, ngramScore=0.0, fullScore=-0.31956387
        score -1.5968711 + future cost -1.8340526 = -3.4309237
        unweighted feature scores: <<p0=0.0, p1=-2.0, p2=0.0, p3=-12.7939005, p4=-1.1784818, p5=-1.3951486, p6=-3.0983973, p7=-3.5451927, p8=1.9997926>>
equiv hypo exists
worse than matching hyp 18, recombining

creating hypothesis 1742 from 3 ( ... 本  )
        base score -0.6334932
        covering 1-1: is
        translated as: 为 :0-0 : transScore=-0.3349248, ngramScore=0.0, fullScore=-0.64058596
        score -1.9178934 + future cost -1.8340526 = -3.751946
        unweighted feature scores: <<p0=0.0, p1=-2.0, p2=0.0, p3=-14.407804, p4=-2.525175, p5=-2.659915, p6=-4.372409, p7=-4.8097196, p8=1.9997926>>
equiv hypo exists
worse than matching hyp 1702, recombining

creating hypothesis 1743 from 3 ( ... 本  )
        base score -0.6334932
        covering 1-1: is
        translated as: 的 是 :0-1 : transScore=-0.47829998, ngramScore=0.0, fullScore=-0.6865634
        score -1.9474349 + future cost -1.8340526 = -3.7814875
        unweighted feature scores: <<p0=0.0, p1=-3.0, p2=0.0, p3=-15.127386, p4=-1.9342322, p5=-1.3951486, p6=-7.3437715, p7=-4.990869, p8=1.9997926>>
equiv hypo exists
worse than matching hyp 1703, recombining

。。。。。

creating hypothesis 2575 from 1062 ( ... 为  )
        base score -10.722214
        covering 0-0: this
        translated as: 该 :0-0 : transScore=-0.26927504, ngramScore=0.0, fullScore=-0.69091433
        score -20.614857 + future cost -1.3086565 = -21.923513
        unweighted feature scores: <<p0=-8.0, p1=-3.0, p2=0.0, p3=-20.860205, p4=-4.4369936, p5=-5.5207715, p6=-5.7828665, p7=-6.230873, p8=2.9996889>>
discarded, too bad for stack

creating hypothesis 2576 from 1062 ( ... 为  )
        base score -10.722214
        covering 0-0: this
        translated as: 这个 :0-0 : transScore=-0.33787173, ngramScore=0.0, fullScore=-0.69774145
        score -20.621683 + future cost -1.3086565 = -21.93034
        unweighted feature scores: <<p0=-8.0, p1=-3.0, p2=0.0, p3=-20.078125, p4=-3.5377154, p5=-4.6409087, p6=-7.7203236, p7=-8.149171, p8=2.9996889>>
discarded, too bad for stack

creating hypothesis 2577 from 1062 ( ... 为  )
        base score -10.722214
        covering 0-0: this
        translated as: 这 一 :0-0 : transScore=-0.3776023, ngramScore=0.0, fullScore=-0.70066893
        score -20.62461 + future cost -1.3086565 = -21.933268
        unweighted feature scores: <<p0=-8.0, p1=-4.0, p2=0.0, p3=-21.773006, p4=-3.9289849, p5=-4.709192, p6=-8.007003, p7=-11.231518, p8=2.9996889>>
discarded, too bad for stack

..........

都分析完后,根据最后一个栈(覆盖了所有单词的翻译假设)进行排序,将最好的翻译结果显示出来:


total hypotheses considered = 14381
           number not built = 0
     number discarded early = 0
           number discarded = 4067
          number recombined = 9373
              number pruned = 228
time to collect opts    0.093 (6%)
        create hyps     0.0 (0%)
        estimate score  0.0 (0%)
        calc lm         0.064 (4%)
        other hyp score 0.0 (0%)
        manage stacks   0.389 (28%)
        other           0.798 (59%)
total source words = 5
     words deleted = 0 ()
    words inserted = 0 ()
Search took 1.297 seconds
这 是 一个 |0-2|小 |3-3|房间|4-4|
BEST TRANSLATION: 这 是 一个 小 房间 [11111]  [total=-3.274862] <<p0=0.0, p1=-5.0, p2=0.0, p3=-30.024462, p4=-5.826433, p5=-8.26403, p6=-3.249391, p7=-8.328738, p8=2.9996889>>
Sentence Decoding Time: : [3.797] seconds
Source and Target Units:this is a small house ] 房间 :0-0 : transScore=-0.3652439, ngramScore=0.0, fullScore=-0.8978848:[4..4][] 房间 :0-0 : transScore=-0.3652439, ngramScore=0.0, fullScore=-0.8978848:[4..4][ ] 小 :0-0 : transScore=-0.15287913, ngramScore=0.0, fullScore=-0.52539617:[3..3][] 房间 :0-0 : transScore=-0.3652439, ngramScore=0.0, fullScore=-0.8978848:[4..4][ ] 小 :0-0 : transScore=-0.15287913, ngramScore=0.0, fullScore=-0.52539617:[3..3][ ] 这 是 一个 :0-0 1-1 2-2 : transScore=-0.21216409, ngramScore=-0.2582455, fullScore=-0.6724017:[0..2][
Translation took 1.359 seconds
Finished translating

注意,本人还没研究到调序(reorderring), 所以在运行的时候在moses.ini里将调序去掉了,研究后再发上来。
写得比较粗略,有些地方是靠记忆,可能有不妥之处,欢迎大家指正,谢谢!

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值