最近在看一篇关系抽取的文章,里面用到了依存关系树。当撰写预处理代码把从stanfordNLP等工具得到的关系树转换成邻接矩阵时,发现对一句话的转换结果较容易得到,这里的一句话是指,只有一个ROOT。但是,在处理cross-sentence这类型的input时,转换结果有些问题,所以在这里记录一下修改思路。
1. 利用 stanfordNLP,得到dependency parsing output
#input sentence
sentence=I like swimming . But Tom likes singing. Can you help me ?
print ('Tokenize:', nlp.word_tokenize(sentence))
#['I', 'like', 'swimming', '.', 'But', 'Tom', 'likes', 'singing', '.', 'Can', 'you', 'help', 'me', '?']
print( 'Dependency Parsing:', nlp.dependency_parse(sentence))
# [('ROOT', 0, 2), ('nsubj', 2, 1), ('obj', 2, 3), ('punct', 2, 4), ('ROOT', 0, 3), ('cc', 3, 1), ('nsubj', 3, 2), ('obj', 3, 4), ('punct', 3, 5), ('ROOT', 0, 3), ('aux', 3, 1), ('nsubj', 3, 2), ('obj', 3, 4), ('punct', 3, 5)]
(1)首先,可以这样理解denpendency parsing的结果:(‘ROOT’,0,2)表示第二个词(like)是“I like swimming .”这句话的根节点;(‘nsubj’,2,1)表示第一个词(I)的父节点(也就是它的head)是第二个词like;依次类推。
(2)其次,可以看到,这里得到的Dependency Parsing结果中,把sentence按照句末标点,解析为三个独立的句子,所以出现了三个ROOT。且从每个新的ROOT开始,三元组中的第二项和第三项重新开始从1索引。
2. 转换dependency parsing output,并得到head list
dep_outputs=[('ROOT', 0, 2), ('nsubj', 2, 1), ('obj', 2, 3), ('punct', 2, 4), ('ROOT', 0, 3), ('cc', 3, 1), ('nsubj', 3, 2), ('obj', 3, 4), ('punct', 3, 5), ('ROOT', 0, 3), ('aux', 3, 1), ('nsubj', 3, 2), ('obj', 3, 4), ('punct', 3, 5)]
tokens=['I', 'like', 'swimming', '.', 'But', 'Tom', 'likes', 'singing', '.', 'Can', 'you', 'help', 'me', '?']
'''查找根结点对应的索引'''
root_index=[]
for i in range(len(dep_outputs)):
if dep_outputs[i][0]=='ROOT':
root_index.append(i)
'''修改依存关系三元组'''
new_dep_outputs=[]
for i in range(len(dep_outputs)):
for index in root_index:
if i+1>index:
tag=index
if dep_outputs[i][0]=='ROOT':
dep_output=(dep_outputs[i][0],dep_outputs[i][1],dep_outputs[i][2]+tag)
else:
dep_output = (dep_outputs[i][0], dep_outputs[i][1] + tag, dep_outputs[i][2] + tag)
new_dep_outputs.append(dep_output)
print(new_dep_outputs)
# [('ROOT', 0, 2), ('nsubj', 2, 1), ('obj', 2, 3), ('punct', 2, 4), ('ROOT', 0, 7), ('cc', 7, 5), ('nsubj', 7, 6), ('obj', 7, 8), ('punct', 7, 9), ('ROOT', 0, 12), ('aux', 12, 10), ('nsubj', 12, 11), ('obj', 12, 13), ('punct', 12, 14)]
new_dep_outputs得到的结果中,索引依次递增。
[(‘ROOT’, 0, 2), (‘nsubj’, 2, 1), (‘obj’, 2, 3), (‘punct’, 2, 4), (‘ROOT’, 0, 7), (‘cc’, 7, 5), (‘nsubj’, 7, 6), (‘obj’, 7, 8), (‘punct’, 7, 9), (‘ROOT’, 0, 12), (‘aux’, 12, 10), (‘nsubj’, 12, 11), (‘obj’, 12, 13), (‘punct’, 12, 14)]
求解head list:
for i in range(len(tokens)):
for dep_output in new_dep_outputs:
if dep_output[-1] == i + 1:
head_list.append(int(dep_output[1]))
print(head_list)
# [2, 0, 2, 2, 7, 7, 0, 7, 7, 12, 12, 0, 12, 12]
3. 从head_list到邻接矩阵
def head_to_adj(head,max_sent_len):
ret = np.zeros((max_sent_len, max_sent_len), dtype=np.float32)
for i in range(len(head)):
j=head[i]
if j!=0:
ret[i,j-1]=1
ret[j-1,i]=1
return ret
#max_sent_len=15
#head=[2, 0, 2, 2, 7, 7, 0, 7, 7, 12, 12, 0, 12, 12]
[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]