简单来说,IBM model1希望得到,其中c(e,f)表示目标单词e和外语单词f对齐的次数,c(f)表示单词f和任意单词对齐的次数。但是c(e,f)和c(f)并不直接好求,所以想出了下面的推导:
p(a|e,f)是次对齐模型,t(e|f)是翻译模型,最终不断累加通过EM算法求解。
本文利用贝叶斯chain rule 对IBM model1模型进行了目标函数的推导与代码层面的一些实现,仅为学习时记录,理解不到位情况还请批评指正
一.重要概念说明
最后来了参考代码:
import operator
from functools import reduce
CORPUS_CH = [['一本', '书'], ['一本', '杂志'], ['这本', '书'], ['这本', '杂志'], ]
CORPUS_EN = [['a', 'book'], ['a', 'magazine'], ['this', 'book'], ['this', 'magazine'], ]
f_word_lst = list(set(reduce(operator.add, CORPUS_CH)))
e_word_lst = list(set(reduce(operator.add, CORPUS_EN)))
T = {}
for k in range(5):
C = {}
for f_sentence, e_sentence in zip(CORPUS_CH, CORPUS_EN):
if k == 0:
for fi in f_sentence:
for ej in e_sentence:
fi_ej_key = " % s| % s" % (fi, ej)
if fi_ej_key not in T:
T[fi_ej_key] = 1.0 / len(e_word_lst)
for i, fi in enumerate(f_sentence):
sum_t = sum([T[" % s| % s" % (fi, ej)] for ej in e_sentence]) * 1.0
for j, ej in enumerate(e_sentence):
delta = T[" % s| % s" % (fi, ej)] / sum_t
# print('delta={}'.format(delta))
C[" % s % s" % (ej, fi)] = C.get(" % s % s" % (ej, fi), 0) + delta
C[" % s" % (ej)] = C.get(" % s" % (ej), 0) + delta
print("---iteration: % s---" % (k))
for key in T:
print(key, ":", T[key])
for f in f_word_lst:
for e in e_word_lst:
if " % s % s" % (e, f) in C and " % s" % (e) in C:
T[" % s| % s" % (f, e)] = C[" % s % s" % (e, f)] / C[" % s" % (e)]
print("---iteration: % s---" % (k + 1))
for key in T:
print( key, ":", T[key])