从文本中自动抽取结构化三元组

从文本中自动抽取结构化三元组

参考文献【1】:ICDM2019 Knowledge Graph Contest: Team UWA
参考文献【2】:Seq2KG: An End-to-End Neural Model for Domain Agnostic Knowledge Graph (not Text Graph) Construction from Text
GitHub:https://github.com/Michael-Stewart-Webdev/Seq2KG


现有工作缺点:
  OpenIE:可以自动化构建三元组,但OpenIE生成的三元组过多,且质量不能保证;
方法:
  本文尝试寻找一个最好的表现效果,通过维持一个高度简单的,并利用基于pipeline的方法可以实现最佳性能。主要包括:
(1)NLTK;https://www.nltk.org/
(2)SpaCy;(https://spacy.io/
(3)tokenisation;
(4)POS;
(5)NER;
(6)Coreference resolution;
(7)Noun/verb phrase chunking

  本文的目标是:应用一个基于pipeline的方法将给定的问答自动转换为一组三元组,主要包含7个步骤(如上图):
(1)Text cleaning(文本清理): 对连字符、引号等进行处理,取出断句之间的空格等;
(2)Text processing(文本处理): 通过SpaCy工具实现分词(tokenisation)、词性标注(POS)、实体识别(NER)、依存解析(Dependency Parse)
  例如对于句子:

“Ford Motor Company is an American multinational automaker that has its main headquarters in Dearborn, Michigan, a suburb of Detroit. The company was founded by Henry Ford and incorporated on June 16, 1903.”

的处理结果是:

(3)Chunking(分块): 有些名词或动词短语可以视为一个整体,例如名词短语“an American multinational automaker”,由多个名词组成,动词短语“was founded by”则由动词和一些其他词组成,因此需要对单词组合起来形成名词短语和动词短语,称为分块。如图,上述的例句对应分块后的结果是:


  chunking算法是:


  对实体分块时,如果名词的前后存在“()”,或相邻的两个名词之间有“of”,或两个名词相邻则可合并;对于动词短语,VERB与PART和ADP可分别前后匹配,或连续两个VERB相邻;

(4)Coreference Resolution(共指消解): 使用NeuralCoref工具(https://github.com/huggingface/neuralcoref)实现共指消解。通过将原始短语替换为每个item的引用短语,可以在三元组上解析共引用项目。本文不考虑代词的消解问题。

**(5)Triple Mapping:**构建三元组的算法如下所示:

  首先头尾实体及其关系从句子中抽取出;其次从这些三元组中创建一个图,以在单独的句子中揭示命名实体之间的关系。;基于介词(in,on,at)等也可以构建实体之间的边

**(6)Triple Filtering(三元组过滤):**过滤掉一些无用的三元组。例如当停用词出现在三元组的head或tail时,去掉该三元组;停用词包括NLTK中定义的停用词,以及星期和月份单词;

**(7)Ariticle Removal(冠词清除):**此时三元组中的实体可能会携带一些冠词等,因此对三元组中的头尾实体删除一些token,包括冠词(例如,a,an,the),所有格代词(例如,its,their)和指示代词(例如,hat, these)。

**可视化:**使用可视化工具(https://github.com/Michael-Stewart-Webdev/text2kg-visualisation)展现每个三元组

计算 degree/betweenness calculation使用NetworkX工具(https://networkx.github.io)

  • degree:一个节点其所有相连的边的数量
  • betweenness:任意两个结点之间计算最短路径的个数,所有最短路径中经过某一个结点c的占比,可作为结点c的betweenness。

参考(度中心性,近邻中心性、中介中心性):https://blog.csdn.net/cyydjt/article/details/87783587

  为每个结点计算degree和betweenness值,值越大的说明越重要,在可视化中更加突出(比如结点圆的半径越大等)

  其次在关系抽取阶段,将每个relation原生态的短语映射到SemEval语料中的关系词上。使用att-bi-lstm预训练好的模型按照semeval的数据格式处理文本,并预测对应的关系标签。

  实体识别部分则为每个实体标注其对应的类型(PER、ORG、LOC、MISC和O)。具体做法是:对原始的文本,先使用SpaCy工具返回一组实体集合,然后对于构建的三元组中每个head和tail与这组实体集合逐一匹配,当当前的三元组的head或tail与实体集中的实体在编辑距离上很相似,则直接将对应的实体类型赋予到head或tail上。


spacy + neuralcoref安装笔记:

安装spacy:

pip3 install spacy==2.1.0 # spacy最好使用2.1版本,否则会奇怪地报错
python3 -m spacy download en_core_web_sm # 下载英文
python3 -m spacy download zh_core_web_sm # 下载中文
pip3 install neuralcoref==4.0.0 # 共指消解

neuralcoref使用测试:

import neuralcoref
import spacy
nlp = spacy.load("en_core_web_sm")
neuralcoref.add_to_pipe(nlp) # 将共指消解加入spacy的处理流程中
text = "My sister has a dog. She loves him"
doc = nlp(text) # 将text进行pipeline处理

返回的结果:

doc._.has_coref # True 表示是否存在共指消解
doc._.coref_resolved # My sister has a dog. My sister loves a dog.
doc._.coref_clusters # [My sister:[My sister, She], a dog:[a dog, him]]
for cluster in doc._.coref_clusters:
	cluster.main # My sister
	cluster.mentions # [My sister, She]

源程序:

(1)process.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
import pandas as pd
import csv

from triples_from_text import extract_triples

def process_all():
    while(True):
        text = input("input a text:")
        triples = extract_triples(text)
        print("\n\n===============the result=============\n\n")
        print(triples)

# Reads data file and creates the submission.csv file
if __name__ == "__main__":
    process_all()
    print("Finished the process.")

(2)triples_from_text.py

# -*- coding: utf-8 -*-
import os
import pandas as pd
import re
import spacy
from spacy.attrs import intify_attrs
nlp = spacy.load("en_core_web_sm")

import neuralcoref

import networkx as nx
# import matplotlib.pyplot as plt

#nltk.download('stopwords')
from nltk.corpus import stopwords
all_stop_words = ['many', 'us', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday',
                  'today', 'january', 'february', 'march', 'april', 'may', 'june', 'july', 'august',
                  'september', 'october', 'november', 'december', 'today', 'old', 'new']
all_stop_words = sorted(list(set(all_stop_words + list(stopwords.words('english')))))

abspath = os.path.abspath('') ## String which contains absolute path to the script file
#print(abspath)
os.chdir(abspath)

### ==================================================================================================
# Tagger

def get_tags_spacy(nlp, text):
    doc = nlp(text) # 生成词对象
    entities_spacy = [] # Entities that Spacy NER found
    for ent in doc.ents: # doc.ents表示每个token的实体识别结果
        entities_spacy.append([ent.text, ent.start_char, ent.end_char, ent.label_])
    return entities_spacy

def tag_all(nlp, text, entities_spacy):
    if ('neuralcoref' in nlp.pipe_names):
        nlp.pipeline.remove('neuralcoref')    
    neuralcoref.add_to_pipe(nlp) # Add neural coref to SpaCy's pipe    
    doc = nlp(text)
    return doc

def filter_spans(spans):
    # Filter a sequence of spans so they don't contain overlaps
    get_sort_key = lambda span: (span.end - span.start, span.start)
    sorted_spans = sorted(spans, key=get_sort_key, reverse=True)
    result = []
    seen_tokens = set()
    for span in sorted_spans:
        if span.start not in seen_tokens and span.end - 1 not in seen_tokens:
            result.append(span)
            seen_tokens.update(range(span.start, span.end))
    return result

def tag_chunks(doc):
    spans = list(doc.ents) + list(doc.noun_chunks)
    spans = filter_spans(spans)
    with doc.retokenize() as retokenizer:
        string_store = doc.vocab.strings
        for span in spans:
            start = span.start
            end = span.end
            retokenizer.merge(doc[start: end], attrs=intify_attrs({'ent_type': 'ENTITY'}, string_store))

def tag_chunks_spans(doc, spans, ent_type):
    spans = filter_spans(spans)
    with doc.retokenize() as retokenizer:
        string_store = doc.vocab.strings
        for span in spans:
            start = span.start
            end = span.end
            retokenizer.merge(doc[start: end], attrs=intify_attrs({'ent_type': ent_type}, string_store))

def clean(text):
    # 文本清理
    text = text.strip('[(),- :\'\"\n]\s*')
    text = text.replace('—', ' - ')
    text = re.sub('([A-Za-z0-9\)]{2,}\.)([A-Z]+[a-z]*)', r"\g<1> \g<2>", text, flags=re.UNICODE)
    text = re.sub('([A-Za-z0-9]{2,}\.)(\"\w+)', r"\g<1> \g<2>", text, flags=re.UNICODE)
    text = re.sub('([A-Za-z0-9]{2,}\.\/)(\w+)', r"\g<1> \g<2>", text, flags=re.UNICODE)
    text = re.sub('([[A-Z]{1}[[.]{1}[[A-Z]{1}[[.]{1}) ([[A-Z]{1}[a-z]{1,2} )', r"\g<1> . \g<2>", text, flags=re.UNICODE)
    text = re.sub('([A-Za-z]{3,}\.)([A-Z]+[a-z]+)', r"\g<1> \g<2>", text, flags=re.UNICODE)
    text = re.sub('([[A-Z]{1}[[.]{1}[[A-Z]{1}[[.]{1}) ([[A-Z]{1}[a-z]{1,2} )', r"\g<1> . \g<2>", text, flags=re.UNICODE)
    text = re.sub('([A-Za-z0-9]{2,}\.)([A-Za-z]+)', r"\g<1> \g<2>", text, flags=re.UNICODE)
    
    text = re.sub('’', "'", text, flags=re.UNICODE)           # curly apostrophe
    text = re.sub('‘', "'", text, flags=re.UNICODE)           # curly apostrophe
    text = re.sub('“', ' "', text, flags=re.UNICODE)
    text = re.sub('”', ' "', text, flags=re.UNICODE)
    text = re.sub("\|", ", ", text, flags=re.UNICODE)
    text = text.replace('\t', ' ')
    text = re.sub('…', '.', text, flags=re.UNICODE)           # elipsis
    text = re.sub('…', '.', text, flags=re.UNICODE)          
    text = re.sub('–', '-', text)           # long hyphen
    text = re.sub('\s+', ' ', text, flags=re.UNICODE).strip()
    text = re.sub(' – ', ' . ', text, flags=re.UNICODE).strip()

    return text

def tagger(text):  
    df_out = pd.DataFrame(columns=['Document#', 'Sentence#', 'Word#', 'Word', 'EntityType', 'EntityIOB', 'Lemma', 'POS', 'POSTag', 'Start', 'End', 'Dependency'])
    corefs = [] # 保存所有指代的词
    text = clean(text) # 文本清理
    
    nlp = spacy.load("en_core_web_sm")
    entities_spacy = get_tags_spacy(nlp, text) # 获得每个token的实体识别结果
    #print("SPACY entities:\n", ([ent for ent in entities_spacy]), '\n\n')
    document = tag_all(nlp, text, entities_spacy) # 融入共指消解工具
    #for token in document:
    #    print([token.i, token.text, token.ent_type_, token.ent_iob_, token.lemma_, token.pos_, token.tag_, token.idx, token.idx+len(token)-1, token.dep_])
    
    ### Coreferences
    # 
    if document._.has_coref:
        for cluster in document._.coref_clusters:
            main = cluster.main # 共指的词
            for m in cluster.mentions: # 所有指代的词(包括其本身)                   
                if (str(m).strip() == str(main).strip()): # 如果是其本身,则跳过
                    continue
                corefs.append([str(m), str(main)]) # 将所有指代的词加入corefs列表
    tag_chunks(document)    
    
    # chunk - somethin OF something 名词分块
    spans_change = []
    for i in range(2, len(document)):
        w_left = document[i-2]
        w_middle = document[i-1]
        w_right = document[i]
        if w_left.dep_ == 'attr':
            continue
        if w_left.ent_type_ == 'ENTITY' and w_right.ent_type_ == 'ENTITY' and (w_middle.text == 'of'): # or w_middle.text == 'for'): #  or w_middle.text == 'with'
            spans_change.append(document[w_left.i : w_right.i + 1])
    tag_chunks_spans(document, spans_change, 'ENTITY')
    
    # chunk verbs with multiple words: 'were exhibited' 动词分块
    spans_change_verbs = []
    for i in range(1, len(document)):
        w_left = document[i-1]
        w_right = document[i]
        if w_left.pos_ == 'VERB' and (w_right.pos_ == 'VERB'):
            spans_change_verbs.append(document[w_left.i : w_right.i + 1])
    tag_chunks_spans(document, spans_change_verbs, 'VERB')

    # chunk: verb + adp; verb + part 
    spans_change_verbs = []
    for i in range(1, len(document)):
        w_left = document[i-1]
        w_right = document[i]
        if w_left.pos_ == 'VERB' and (w_right.pos_ == 'ADP' or w_right.pos_ == 'PART'):
            spans_change_verbs.append(document[w_left.i : w_right.i + 1])
    tag_chunks_spans(document, spans_change_verbs, 'VERB')

    # chunk: adp + verb; part  + verb
    spans_change_verbs = []
    for i in range(1, len(document)):
        w_left = document[i-1]
        w_right = document[i]
        if w_right.pos_ == 'VERB' and (w_left.pos_ == 'ADP' or w_left.pos_ == 'PART'):
            spans_change_verbs.append(document[w_left.i : w_right.i + 1])
    tag_chunks_spans(document, spans_change_verbs, 'VERB')
    
    # chunk verbs with multiple words: 'were exhibited'
    spans_change_verbs = []
    for i in range(1, len(document)):
        w_left = document[i-1]
        w_right = document[i]
        if w_left.pos_ == 'VERB' and (w_right.pos_ == 'VERB'):
            spans_change_verbs.append(document[w_left.i : w_right.i + 1])
    tag_chunks_spans(document, spans_change_verbs, 'VERB')

    # chunk all between LRB- -RRB- (something between brackets)
    start = 0
    end = 0
    spans_between_brackets = []
    for i in range(0, len(document)):
        if ('-LRB-' == document[i].tag_ or r"(" in document[i].text):
            start = document[i].i
            continue
        if ('-RRB-' == document[i].tag_ or r')' in document[i].text):
            end = document[i].i + 1
        if (end > start and not start == 0):
            span = document[start:end]
            try:
                assert (u"(" in span.text and u")" in span.text)
            except:
                pass
                #print(span)
            spans_between_brackets.append(span)
            start = 0
            end = 0
    tag_chunks_spans(document, spans_between_brackets, 'ENTITY')
            
    # chunk entities  两个实体相邻时,合并
    spans_change_verbs = []
    for i in range(1, len(document)):
        w_left = document[i-1]
        w_right = document[i]
        if w_left.ent_type_ == 'ENTITY' and w_right.ent_type_ == 'ENTITY':
            spans_change_verbs.append(document[w_left.i : w_right.i + 1])
    tag_chunks_spans(document, spans_change_verbs, 'ENTITY')
    
    doc_id = 1
    count_sentences = 0
    prev_dep = 'nsubj'
    for token in document:
        if (token.dep_ == 'ROOT'):
            if token.pos_ == 'VERB':
                #  将pipeline的输出保存到csv,列名:['Document#', 'Sentence#', 'Word#', 'Word', 'EntityType', 'EntityIOB', 'Lemma', 'POS', 'POSTag', 'Start', 'End', 'Dependency']
                df_out.loc[len(df_out)] = [doc_id, count_sentences, token.i, token.text, token.ent_type_, token.ent_iob_, token.lemma_, token.pos_, token.tag_, token.idx, token.idx+len(token)-1, token.dep_]
            else:
                df_out.loc[len(df_out)] = [doc_id, count_sentences, token.i, token.text, token.ent_type_, token.ent_iob_, token.lemma_, token.pos_, token.tag_, token.idx, token.idx+len(token)-1, prev_dep]
        else:
            df_out.loc[len(df_out)] = [doc_id, count_sentences, token.i, token.text, token.ent_type_, token.ent_iob_, token.lemma_, token.pos_, token.tag_, token.idx, token.idx+len(token)-1, token.dep_]
                  
        if (token.text == '.'):
            count_sentences += 1
        prev_dep = token.dep_
        
    return df_out, corefs

### ==================================================================================================
### triple extractor

def get_predicate(s):
    pred_ids = {}
    for w, index, spo in s:
        if spo == 'predicate' and w != "'s" and w != "\"": #= 11.95
            pred_ids[index] = w
    predicates = {}
    for key, value in pred_ids.items():
        predicates[key] = value
    return predicates

def get_subjects(s, start, end, adps):
    subjects = {}
    for w, index, spo in s:
        if index >= start and index <= end:
            if 'subject' in spo or 'entity' in spo or 'object' in spo:
                subjects[index] = w
    return subjects
    
def get_objects(s, start, end, adps):
    objects = {}
    for w, index, spo in s:
        if index >= start and index <= end:
            if 'object' in spo or 'entity' in spo or 'subject' in spo:
                objects[index] = w
    return objects

def get_positions(s, start, end):
    adps = {}
    for w, index, spo in s:        
        if index >= start and index <= end:
            if 'of' == spo or 'at' == spo:
                adps[index] = w
    return adps

def create_triples(df_text, corefs):
    ## 创建三元组
    sentences = [] # 所有句子
    aSentence = [] # 某个句子
    
    for index, row in df_text.iterrows():
        d_id, s_id, word_id, word, ent, ent_iob, lemma, cg_pos, pos, start, end, dep = row.items()
        if 'subj' in dep[1]:
            aSentence.append([word[1], word_id[1], 'subject'])
        elif 'ROOT' in dep[1] or 'VERB' in cg_pos[1] or pos[1] == 'IN':
            aSentence.append([word[1], word_id[1], 'predicate'])
        elif 'obj' in dep[1]:
            aSentence.append([word[1], word_id[1], 'object'])
        elif ent[1] == 'ENTITY':
            aSentence.append([word[1], word_id[1], 'entity'])        
        elif word[1] == '.':
            sentences.append(aSentence)
            aSentence = []
        else:
            aSentence.append([word[1], word_id[1], pos[1]])
    
    relations = []
    #loose_entities = []
    for s in sentences:
        if len(s) == 0: continue
        preds = get_predicate(s) # Get all verbs
        """
        if preds == {}: 
            preds = {p[1]:p[0] for p in s if (p[2] == 'JJ' or p[2] == 'IN' or p[2] == 'CC' or
                     p[2] == 'RP' or p[2] == ':' or p[2] == 'predicate' or
                     p[2] =='-LRB-' or p[2] =='-RRB-') }
            if preds == {}:
                #print('\npred = 0', s)
                preds = {p[1]:p[0] for p in s if (p[2] == ',')}
                if preds == {}:
                    ents = [e[0] for e in s if e[2] == 'entity']
                    if (ents):
                        loose_entities = ents # not significant for now
                        #print("Loose entities = ", ents)
        """
        if preds:
            if (len(preds) == 1):
                #print("preds = ", preds)
                predicate = list(preds.values())[0]
                if (len(predicate) < 2):
                    predicate = 'is'
                #print(s)
                ents = [e[0] for e in s if e[2] == 'entity']
                #print('ents = ', ents)
                for i in range(1, len(ents)):
                    relations.append([ents[0], predicate, ents[i]])

            pred_ids = list(preds.keys())
            pred_ids.append(s[0][1])
            pred_ids.append(s[len(s)-1][1])
            pred_ids.sort()
                    
            for i in range(1, len(pred_ids)-1):
                predicate = preds[pred_ids[i]]
                adps_subjs = get_positions(s, pred_ids[i-1], pred_ids[i])
                subjs = get_subjects(s, pred_ids[i-1], pred_ids[i], adps_subjs)
                adps_objs = get_positions(s, pred_ids[i], pred_ids[i+1])
                objs = get_objects(s, pred_ids[i], pred_ids[i+1], adps_objs)
                for k_s, subj in subjs.items():                
                    for k_o, obj in objs.items():
                        obj_prev_id = int(k_o) - 1
                        if obj_prev_id in adps_objs: # at, in, of
                            relations.append([subj, predicate + ' ' + adps_objs[obj_prev_id], obj])
                        else:
                            relations.append([subj, predicate, obj])
    
    ### Read coreferences: coreference files are TAB separated values
    coreferences = []
    for val in corefs:
        if val[0].strip() != val[1].strip():
            if len(val[0]) <= 50 and len(val[1]) <= 50:
                co_word = val[0]
                real_word = val[1].strip('[,- \'\n]*')
                real_word = re.sub("'s$", '', real_word, flags=re.UNICODE)
                if (co_word != real_word):
                    coreferences.append([co_word, real_word])
            else:
                co_word = val[0]
                real_word = ' '.join((val[1].strip('[,- \'\n]*')).split()[:7])
                real_word = re.sub("'s$", '', real_word, flags=re.UNICODE)
                if (co_word != real_word):
                    coreferences.append([co_word, real_word])
                
    # Resolve corefs
    triples_object_coref_resolved = []
    triples_all_coref_resolved = []
    for s, p, o in relations:
        coref_resolved = False
        for co in coreferences:
            if (s == co[0]):
                subj = co[1]
                triples_object_coref_resolved.append([subj, p, o])
                coref_resolved = True
                break
        if not coref_resolved:
            triples_object_coref_resolved.append([s, p, o])

    for s, p, o in triples_object_coref_resolved:
        coref_resolved = False
        for co in coreferences:
            if (o == co[0]):
                obj = co[1]
                triples_all_coref_resolved.append([s, p, obj])
                coref_resolved = True
                break
        if not coref_resolved:
            triples_all_coref_resolved.append([s, p, o])
    return(triples_all_coref_resolved)

### ==================================================================================================
## Get more using Network shortest_paths

def get_graph(triples):
    G = nx.DiGraph()
    for s, p, o in triples:
        G.add_edge(s, o, key=p)
    return G

def get_entities_with_capitals(G):
    entities = []
    for node in G.nodes():
        if (any(ch.isupper() for ch in list(node))):
            entities.append(node)
    return entities

def get_paths_between_capitalised_entities(triples):
    
    g = get_graph(triples)
    ents_capitals = get_entities_with_capitals(g)
    paths = []
    #print('\nShortest paths among capitalised words -------------------')
    for i in range(0, len(ents_capitals)):
        n1 = ents_capitals[i]
        for j in range(1, len(ents_capitals)):
            try:
                n2 = ents_capitals[j]
                path = nx.shortest_path(g, source=n1, target=n2)
                if path and len(path) > 2:
                    paths.append(path)
                path = nx.shortest_path(g, source=n2, target=n1)
                if path and len(path) > 2:
                    paths.append(path)
            except Exception:
                continue
    return g, paths

def get_paths(doc_triples):
    triples = []
    g, paths = get_paths_between_capitalised_entities(doc_triples)
    for p in paths:
        path = [(u, g[u][v]['key'], v) for (u, v) in zip(p[0:], p[1:])]
        length = len(p)
        if (path[length-2][1] == 'in' or path[length-2][1] == 'at' or path[length-2][1] == 'on'):
            if [path[0][0], path[length-2][1], path[length-2][2]] not in triples:
                triples.append([path[0][0], path[length-2][1], path[length-2][2]])
        elif (' in' in path[length-2][1] or ' at' in path[length-2][1] or ' on' in path[length-2][1]):
            if [path[0][0], path[length-2][1], path[length-2][2]] not in triples:
                triples.append([path[0][0], 'in', path[length-2][2]])
    for t in doc_triples:
        if t not in triples:
            triples.append(t)
    return triples

def get_center(nodes):
    center = ''
    if (len(nodes) == 1):
        center = nodes[0]
    else:   
        # Capital letters and longer is preferred
        cap_ents = [e for e in nodes if any(x.isupper() for x in e)]
        if (cap_ents):
            center = max(cap_ents, key=len)
        else:
            center = max(nodes, key=len)
    return center

def connect_graphs(mytriples):
    G = nx.DiGraph()
    for s, p, o in mytriples:
        G.add_edge(s, o, p=p)        
    
    """
    # Get components
    graphs = list(nx.connected_component_subgraphs(G.to_undirected()))
    
    # Get the largest component
    largest_g = max(graphs, key=len)
    largest_graph_center = ''
    largest_graph_center = get_center(nx.center(largest_g))
    
    # for each graph, find the centre node
    smaller_graph_centers = []
    for g in graphs:        
        center = get_center(nx.center(g))
        smaller_graph_centers.append(center)

    for n in smaller_graph_centers:
        if (largest_graph_center is not n):
            G.add_edge(largest_graph_center, n, p='with')
    """
    return G
        
def rank_by_degree(mytriples): #, limit):
    G = connect_graphs(mytriples)
    degree_dict = dict(G.degree(G.nodes()))
    nx.set_node_attributes(G, degree_dict, 'degree')
    
    # Use this to draw the graph
    #draw_graph_centrality(G, degree_dict)

    Egos = nx.DiGraph()
    for a, data in sorted(G.nodes(data=True), key=lambda x: x[1]['degree'], reverse=True):
        ego = nx.ego_graph(G, a)
        Egos.add_edges_from(ego.edges(data=True))
        Egos.add_nodes_from(ego.nodes(data=True))
        
        #if (nx.number_of_edges(Egos) > 20):
        #    break
       
    ranked_triples = []
    for u, v, d in Egos.edges(data=True):
        ranked_triples.append([u, d['p'], v])
    return ranked_triples

# 抽取三元组
def extract_triples(text):
    df_tagged, corefs = tagger(text) # pipeline处理文本,并返回每个token的特征,以及共指消解的结果
    doc_triples = create_triples(df_tagged, corefs)
    all_triples = get_paths(doc_triples)
    filtered_triples = []    
    for s, p, o in all_triples:
        if ([s, p, o] not in filtered_triples):
            if s.lower() in all_stop_words or o.lower() in all_stop_words:
                continue
            elif s == p:
                continue
            if s.isdigit() or o.isdigit():
                continue
            if '%' in o or '%' in s: #= 11.96
                continue
            if (len(s) < 2) or (len(o) < 2):
                continue
            if (s.islower() and len(s) < 4) or (o.islower() and len(o) < 4):
                continue
            if s == o:
                continue            
            subj = s.strip('[,- :\'\"\n]*')
            pred = p.strip('[- :\'\"\n]*.')
            obj = o.strip('[,- :\'\"\n]*')
            
            for sw in ['a', 'an', 'the', 'its', 'their', 'his', 'her', 'our', 'all', 'old', 'new', 'latest', 'who', 'that', 'this', 'these', 'those']:
                subj = ' '.join(word for word in subj.split() if not word == sw)
                obj = ' '.join(word for word in obj.split()  if not word == sw)
            subj = re.sub("\s\s+", " ", subj)
            obj = re.sub("\s\s+", " ", obj)
            
            if subj and pred and obj:
                filtered_triples.append([subj, pred, obj])

    #TRIPLES = rank_by_degree(filtered_triples)
    return filtered_triples

def draw_graph_centrality(G, dictionary):
    # plt.figure(figsize=(12,10))
    # pos = nx.spring_layout(G)
    # #print("Nodes\n", G.nodes(True))
    # #print("Edges\n", G.edges())
    
    # nx.draw_networkx_nodes(G, pos, 
    #         nodelist=dictionary.keys(),
    #         with_labels=False,
    #         edge_color='black',
    #         width=1,
    #         linewidths=1,
    #         node_size = [v * 150 for v in dictionary.values()],
    #         node_color='blue',
    #         alpha=0.5)
    # edge_labels = {(u, v): d["p"] for u, v, d in G.edges(data=True)}
    # #print(edge_labels)
    # nx.draw_networkx_edge_labels(G, pos,
    #                        font_size=10,
    #                        edge_labels=edge_labels,
    #                        font_color='blue')
    # nx.draw(G, pos, with_labels=True, node_size=1, node_color='blue')
    pass
    
if __name__ == "__main__":
    """
    Celebrity chef Jamie Oliver's British restaurant chain has become insolvent, putting 1,300 jobs at risk. The firm said Tuesday that it had gone into administration, a form of bankruptcy protection, and appointed KPMG to oversee the process.The company operates 23 Jamie's Italian restaurants in the U.K. The company had been seeking buyers amid increased competition from casual dining rivals, according to The Guardian. Oliver began his restaurant empire in 2002 when he opened Fifteen in London. Oliver, known around the world for his cookbooks and television shows, said he was "deeply saddened by this outcome and would like to thank all of the staff and our suppliers who have put their hearts and souls into this business for over a decade. "He said "I appreciate how difficult this is for everyone affected." I’m devastated that our much-loved UK restaurants have gone into administration.
    """
    """BYD debuted its E-SEED GT concept car and Song Pro SUV alongside its all-new e-series models at the Shanghai International Automobile Industry Exhibition. The company also showcased its latest Dynasty series of vehicles, which were recently unveiled at the company’s spring product launch in Beijing."""
    text = """
    BYD debuted its E-SEED GT concept car and Song Pro SUV alongside its all-new e-series models at the Shanghai International Automobile Industry Exhibition. The company also showcased its latest Dynasty series of vehicles, which were recently unveiled at the company’s spring product launch in Beijing. A total of 23 new car models were exhibited at the event, held at Shanghai’s National Convention and Exhibition Center, fully demonstrating the BYD New Architecture (BNA) design, the 3rd generation of Dual Mode technology, plus the e-platform framework. Today, China’s new energy vehicles have entered the ‘fast lane’, ushering in an even larger market outbreak. Presently, we stand at the intersection of old and new kinetic energy conversion for mobility, but also a new starting point for high-quality development. To meet the arrival of complete electrification, BYD has formulated a series of strategies, and is well prepared.
    """
    """
    An arson fire caused an estimated $50,000 damage at a house on Mt. Soledad that was being renovated, authorities said Friday.San Diego police were looking for the arsonist, described as a Latino man who was wearing a red hat, blue shirt and brown pants, and may have driven away in a small, black four-door car.A resident on Palomino Court, off Soledad Mountain Road, called 9-1-1 about 9:45 a.m. to report the house next door on fire, with black smoke coming out of the roof, police said. Firefighters had the flames knocked down 20 minutes later, holding the damage to the attic and roof, said City spokesperson Alec Phillip. No one was injured.Metro Arson Strike Team investigators were called and they determined the blaze had been set intentionally, Phillip said.Police said one or more witnesses saw the suspect run south from the house and possibly leave in the black car.
    """
    mytriples = extract_triples(text)
    
    print('\n\nFINAL TRIPLES = ', len(mytriples))
    for t in mytriples:
        print(t)

效果:
键入句子:

Walter Otto Davis was a Welsh professional footballer who played at centre forward for Millwall foouth East London, which was Founded as Millwall Rovers in 1885.

My sister has a dog, and she loves him.

  缺点:对于一些候选的关系词,例如in,其机械式地认为in两头的实体可以表示head和tail,然而事实并非如此。例如第一个例子,三元组[“East London”, “was founded as”, “Millwall Rovers”]很显然是错的。
  优点:不需要与现有的知识库对齐,直接从文本中抽取三元组;
  展望:如何应用在中文场景下?

已标记关键词 清除标记
相关推荐
©️2020 CSDN 皮肤主题: 黑客帝国 设计师:白松林 返回首页