将stanfordNLP依存关系抽取结果转换成邻接矩阵 特针对多个句子的情况

最近在看一篇关系抽取的文章,里面用到了依存关系树。当撰写预处理代码把从stanfordNLP等工具得到的关系树转换成邻接矩阵时,发现对一句话的转换结果较容易得到,这里的一句话是指,只有一个ROOT。但是,在处理cross-sentence这类型的input时,转换结果有些问题,所以在这里记录一下修改思路。

1. 利用 stanfordNLP,得到dependency parsing output

#input sentence
sentence=I like swimming . But Tom likes singing. Can you help me ?

print ('Tokenize:', nlp.word_tokenize(sentence))
#['I', 'like', 'swimming', '.', 'But', 'Tom', 'likes', 'singing', '.', 'Can', 'you', 'help', 'me', '?']

print( 'Dependency Parsing:', nlp.dependency_parse(sentence))
# [('ROOT', 0, 2), ('nsubj', 2, 1), ('obj', 2, 3), ('punct', 2, 4), ('ROOT', 0, 3), ('cc', 3, 1), ('nsubj', 3, 2), ('obj', 3, 4), ('punct', 3, 5), ('ROOT', 0, 3), ('aux', 3, 1), ('nsubj', 3, 2), ('obj', 3, 4), ('punct', 3, 5)]

(1)首先,可以这样理解denpendency parsing的结果:(‘ROOT’,0,2)表示第二个词(like)是“I like swimming .”这句话的根节点;(‘nsubj’,2,1)表示第一个词(I)的父节点(也就是它的head)是第二个词like;依次类推。
(2)其次,可以看到,这里得到的Dependency Parsing结果中,把sentence按照句末标点,解析为三个独立的句子,所以出现了三个ROOT。且从每个新的ROOT开始,三元组中的第二项和第三项重新开始从1索引。

2. 转换dependency parsing output,并得到head list

dep_outputs=[('ROOT', 0, 2), ('nsubj', 2, 1), ('obj', 2, 3), ('punct', 2, 4), ('ROOT', 0, 3), ('cc', 3, 1), ('nsubj', 3, 2), ('obj', 3, 4), ('punct', 3, 5), ('ROOT', 0, 3), ('aux', 3, 1), ('nsubj', 3, 2), ('obj', 3, 4), ('punct', 3, 5)]
tokens=['I', 'like', 'swimming', '.', 'But', 'Tom', 'likes', 'singing', '.', 'Can', 'you', 'help', 'me', '?']

'''查找根结点对应的索引'''
root_index=[]
for i in range(len(dep_outputs)):
    if dep_outputs[i][0]=='ROOT':
        root_index.append(i)

'''修改依存关系三元组'''
new_dep_outputs=[]
for i in range(len(dep_outputs)):
    for index in root_index:
        if i+1>index:
            tag=index

    if dep_outputs[i][0]=='ROOT':	
        dep_output=(dep_outputs[i][0],dep_outputs[i][1],dep_outputs[i][2]+tag)
    else:
        dep_output = (dep_outputs[i][0], dep_outputs[i][1] + tag, dep_outputs[i][2] + tag)
    new_dep_outputs.append(dep_output)

print(new_dep_outputs)
# [('ROOT', 0, 2), ('nsubj', 2, 1), ('obj', 2, 3), ('punct', 2, 4), ('ROOT', 0, 7), ('cc', 7, 5), ('nsubj', 7, 6), ('obj', 7, 8), ('punct', 7, 9), ('ROOT', 0, 12), ('aux', 12, 10), ('nsubj', 12, 11), ('obj', 12, 13), ('punct', 12, 14)]

new_dep_outputs得到的结果中,索引依次递增。

[(‘ROOT’, 0, 2), (‘nsubj’, 2, 1), (‘obj’, 2, 3), (‘punct’, 2, 4), (‘ROOT’, 0, 7), (‘cc’, 7, 5), (‘nsubj’, 7, 6), (‘obj’, 7, 8), (‘punct’, 7, 9), (‘ROOT’, 0, 12), (‘aux’, 12, 10), (‘nsubj’, 12, 11), (‘obj’, 12, 13), (‘punct’, 12, 14)]

求解head list:

for i in range(len(tokens)):
    for dep_output in new_dep_outputs:
        if dep_output[-1] == i + 1:
            head_list.append(int(dep_output[1]))

print(head_list)
# [2, 0, 2, 2, 7, 7, 0, 7, 7, 12, 12, 0, 12, 12]

3. 从head_list到邻接矩阵

def head_to_adj(head,max_sent_len):
    ret = np.zeros((max_sent_len, max_sent_len), dtype=np.float32)
    for i in range(len(head)):
        j=head[i]
        if j!=0:
            ret[i,j-1]=1
            ret[j-1,i]=1

    return ret

#max_sent_len=15
#head=[2, 0, 2, 2, 7, 7, 0, 7, 7, 12, 12, 0, 12, 12]
[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值