Stanford自然语言推理(SNLI)数据集语句的语义树构造算法
本来想造把枪,结果发现在造子弹时就卡死了我去.目前在实现一个基于Stanford自然语言推理(SNLI)数据集的计算语义的模型,结果发现好像很简单的数据集语句的语义树构造居然很难实现,最后终于搞出来了,防止忘记,在此记录一下思路;
问题描述
其实就是把sentence_binary_parse(二分格式)的句子格式构造成语义树,结果我发现由于深度学习的大行其道,所有baseline model都没有兴趣去处理"语义树"这种结构信息,直接都是把词向量暴力按序喂进神经网络;因此只能自己来完成,考虑如下语句:
u'sentence1_binary_parse': u'( Children ( ( ( smiling and ) waving ) ( at camera ) ) )',
我们的目标就是要把它转化为如下的树结构:
Children
/ \
waving at
| \
and - smiling camera
这个看起来像一个常规的数据结构的算法问题;
难点分析
但是仔细研究一下,考虑一下实现就会发现许多难点:
- 父节点不一定出现在字节点前面(这个就和什么先序/后序/中序重构二叉树的问题不一样了);
- 同一层的一个点的词个数可能不止一个;
- 这不是二叉树,子节点的个数也是不确定的;
因此,初步考虑的如下方案在实现过程中遇到了瓶颈:
- 按序遍历的方法:需要一个额外的表来记录每一层是哪些点;但是会遭遇解析判断子节点归属的问题;
- 分路径记录:这样会清晰一些,将一条自顶向下到叶节点的轨迹定义为一条路径;但是在合并解析时会很复杂,且仍绕不过父节点和子节点乱序的问题;
最终方案
最终,发现了一个很微妙的细节,如果我们定义一个计数括号的变量 N shift N_{\text{shift}} Nshift,称之为"括号漂移计数",规则如下:
{ N shift = N shift + 1 , s = ′ ( ′ , N shift = N shift − 1 , s = ′ ) ′ , \begin{cases} N_{\text{shift}} = N_{\text{shift}}+1,\quad \ \ & s = '(',\\ N_{\text{shift}} = N_{\text{shift}}-1, \quad \ \ & s = ')', \end{cases} {Nshift=Nshift+1, Nshift=Nshift−1, s=′(′,s=′)′,
就是如此简单的一个规则,当找到第一个新的单词时(也就是第一个子节点),我们就再定义一个记录当前 N shift N_{\text{shift}} Nshift的变量 N shift ′ N'_{\text{shift}} Nshift′,那么继续更新 N shift N_{\text{shift}} Nshift,最终如果出现 N shift = N_{\text{shift}}= Nshift=N’_{\text{shift}}$$那我们就找到了一个新的子节点!这点可以用例句来验证:
u'sentence1_binary_parse': u'( Children ( ( ( smiling and ) waving ) ( at camera ) ) )',
另一个潜在的问题就是解决多个单词为一个节点的情况,我们不妨将这种节点定义为平行节点;事实上从语法上也是如此;最终,我们可以定义出解决这个问题的递归算法:
class SemancticTree(object):
"""Semanctic Tree"""
def __init__(self, sentence):
super(SemancticTree, self).__init__()
self.sentence = sentence
self.ROOT = 'EMPTY';
self.PRONOUN = [];
class SemancticTreeNode(object):
"""Node in SemancticTreeNode"""
def __init__(self, word):
super(SemancticTreeNode, self).__init__()
self.word = word;
self.next = [];
self.prev = 'EMPTY';
self.end_i = 0;
def make_sentences_trees_pair_list(sentence):
"""
Resolving a sentence into a semantic tree,then make it into a formula tree;
u'( They ( are ( smiling ( at ( their parents ) ) ) ) )'
They
|
are
|
smiling
|
at
\
their-parents
"""
brackets = ['(',')'];
sentence_arr = sentence.split(' ');
TREE = [];
START_INDEX = 0;
NEXT_IS_WORD = 0;
for i in range(len(sentence_arr)):
if sentence_arr[i] not in brackets:
START_INDEX = i;
NOW_NODE = sentence_arr[START_INDEX];
while sentence_arr[i+NEXT_IS_WORD+1] not in brackets: NEXT_IS_WORD+=1;
break;
for i in range(1,NEXT_IS_WORD+1):
NOW_NODE += ' '+sentence_arr[START_INDEX+i];
TREE.append(SemancticTreeNode(NOW_NODE))
return make_sentences_trees_from_sentence(sentence,sentence_arr,TREE,START_INDEX+NEXT_IS_WORD);
def complete_multi_strings_node(sentence_arr,START_INDEX,NOW_NODE):
brackets = ['(',')'];
NEXT_IS_WORD = 0;
for i in range(START_INDEX+1,len(sentence_arr)):
if sentence_arr[i] not in brackets:
NOW_NODE += ' '+sentence_arr[START_INDEX+i];
END_INDEX = i;
if sentence_arr[i] in brackets:break;
return NOW_NODE,END_INDEX;
def make_sentences_trees_from_sentence(sentence,sentence_arr,TREE,START_INDEX):
brackets = ['(',')'];
INDEX = START_INDEX;
START_BRACKET_NUM = 0;
subnodes_of_this_root = [];
while INDEX < len(sentence_arr):
if sentence[INDEX] == '(':SHIFT_BRACKET_NUM+=1;
if sentence[INDEX] == ')':SHIFT_BRACKET_NUM-=1;
# find the first sub node;
if sentence_arr[INDEX] not in brackets:
SUB_BRACKET_NUM = SHIFT_BRACKET_NUM;
TREE.append(SemancticTreeNode(sentence_arr[INDEX]));
TREE[-1].word,TREE[-1].end_i = complete_multi_strings_node(sentence_arr,INDEX,TREE[-1].word);
subnodes_of_this_root.append(TREE[-1]);
for INDEX_FOR_OTHER_SUB in range(INDEX,len(sentence_arr)):
if sentence[INDEX] == '(':SHIFT_BRACKET_NUM+=1;
if sentence[INDEX] == ')':SHIFT_BRACKET_NUM-=1;
# find another sub-node
if (sentence_arr[INDEX_FOR_OTHER_SUB] not in brackets) and SUB_BRACKET_NUM==SHIFT_BRACKET_NUM:
TREE.append(sentence_arr[INDEX_FOR_OTHER_SUB]);
TREE[-1].word,TREE[-1].end_i = complete_multi_strings_node(sentence_arr,INDEX_FOR_OTHER_SUB,TREE[-1].word);
subnodes_of_this_root.append(TREE[-1]);
for SUB_NODE in subnodes_of_this_root:
# Rec entry point;
TREE = make_sentences_trees_from_sentence(sentence,sentence_arr,TREE,SUB_NODE.next_i);
return TREE;