OAG-BERT(开放式学术图谱BERT)

介绍:

CogDL库中有两个版本的OAG-BERT,OAG-BERT是一个异构的实体增强学术语言模型,它不仅能够理解学术文本,还能够理解OAG中的异构实体知识。

论文原文:《OAG-BERT: Pre-train Heterogeneous Entity-augmented Academic Language Model》

版本1:普通版本

这是一个基本版本的 OAG-BERT。

与 SciBERT (opens new window) 类似,我们在 Open Academic Graph 的学术文本语料库上预训练 BERT 模型,包括论文标题、摘要和正文。

OAG-BERT的用法和普通的SciBERT或BERT一样。

例如,您可以使用以下代码对两个文本序列进行编码并检索它们的输出。

#导入模块
from cogdl import oagbert
#引入分词器以及模型
tokenizer, bert_model = oagbert()
#创建语句示例
sequence = ["CogDL is developed by KEG, Tsinghua.",
 "OAGBert is developed by KEG, Tsinghua."]
#利用tokenizer对语句进行分词,这里的语句传入了一个列表
tokens = tokenizer(sequence, return_tensors="pt", padding=True)
#将得到的token传入bert_model,这里的**是用来解析参数的
outputs = bert_model(**tokens)

注意:对于传入参数前面的**不了解可以去看下这篇博客

版本2:实体增强的版本

这是普通OAG-BERT 的扩展版本。

我们在 Open Academic Graph 中加入了丰富的实体信息,例如作者和研究领域。

因此,您可以在 OAG-BERT v2 中编码各种类型的实体。例如对BERT的论文进行编码,可以使用如下代码

#引入模块,同时导入pytorch
from cogdl import oagbert
import torch
#加载模型,注意,这里选择使用oagbert-v2版本
tokenizer, model = oagbert("oagbert-v2")
#设置标题、摘要、作者、期刊等信息
title = 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'
abstract = 'We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation...'
authors = ['Jacob Devlin', 'Ming-Wei Chang', 'Kenton Lee', 'Kristina Toutanova']
venue = 'north american chapter of the association for computational linguistics'
affiliations = ['Google']
concepts = ['language model', 'natural language inference', 'question answering']
# build model inputs
# 设置模型输入
input_ids, input_masks, token_type_ids, masked_lm_labels, position_ids, position_ids_second, masked_positions, num_spans = 
model.build_inputs(title=title, abstract=abstract, venue=venue, authors=authors, concepts=concepts, affiliations=affiliations)
# 使用模型进行前向传播
sequence_output, pooled_output = model.bert.forward(
    input_ids=torch.LongTensor(input_ids).unsqueeze(0),
    token_type_ids=torch.LongTensor(token_type_ids).unsqueeze(0),
    attention_mask=torch.LongTensor(input_masks).unsqueeze(0),
    output_all_encoded_layers=False,
    checkpoint_activations=False,
    position_ids=torch.LongTensor(position_ids).unsqueeze(0),
    position_ids_second=torch.LongTensor(position_ids).unsqueeze(0)
)

除此之外,你也可以使用直接使用OAG-BERT V2版本的一些继承功能,比如说使用 decode_beamsearch来基于已经存在的上下位产生实体。

比如说,如果想要去为BERT论文生成两个token的概念,可以利用下面的方法:

model.eval()
candidates = model.decode_beamsearch(
    title=title,
    abstract=abstract,
    venue=venue,
    authors=authors,
    affiliations=affiliations,
    decode_span_type='FOS',
    decode_span_length=2,
    beam_width=8,
    force_forward=False
)

OAG-BERT在广泛的实体意识任务上超过了其他学术语言模型,同时在普通的NLP任务上保持了其性能。

其他

我们还为用户发布了另外两个 V2 版本

一种是基于生成的版本,可用于基于其他信息生成文本。

例如,使用以下代码自动生成带有摘要的论文标题。

from cogdl import oagbert

tokenizer, model = oagbert('oagbert-v2-lm')
model.eval()

for seq, prob in model.generate_title(abstract="To enrich language models with domain knowledge is crucial but difficult. Based on the world's largest public academic graph Open Academic Graph (OAG), we pre-train an academic language model, namely OAG-BERT, which integrates massive heterogeneous entities including paper, author, concept, venue, and affiliation. To better endow OAG-BERT with the ability to capture entity information, we develop novel pre-training strategies including heterogeneous entity type embedding, entity-aware 2D positional encoding, and span-aware entity masking. For zero-shot inference, we design a special decoding strategy to allow OAG-BERT to generate entity names from scratch. We evaluate the OAG-BERT on various downstream academic tasks, including NLP benchmarks, zero-shot entity inference, heterogeneous graph link prediction, and author name disambiguation. Results demonstrate the effectiveness of the proposed pre-training approach to both comprehending academic texts and modeling knowledge from heterogeneous entities. OAG-BERT has been deployed to multiple real-world applications, such as reviewer recommendations for NSFC (National Nature Science Foundation of China) and paper tagging in the AMiner system. It is also available to the public through the CogDL package."):

print('Title: %s' % seq)
print('Perplexity: %.4f' % prob)

# One of our generations: "pre-training oag-bert: an academic language model for enriching academic texts with domain knowledge"

除此之外,我们微调了OAG-BERT来基于命名消歧任务来计算论文的相似性。

下面的代码是一个实例,来利用OAG-BERT计算论文相似性:

import os
from cogdl import oagbert
import torch
import torch.nn.functional as F
import numpy as np


# load time
tokenizer, model = oagbert("oagbert-v2-sim")
model.eval()

# Paper 1
title = 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'
abstract = 'We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation...'
authors = ['Jacob Devlin', 'Ming-Wei Chang', 'Kenton Lee', 'Kristina Toutanova']
venue = 'north american chapter of the association for computational linguistics'
affiliations = ['Google']
concepts = ['language model', 'natural language inference', 'question answering']

# encode first paper
input_ids, input_masks, token_type_ids, masked_lm_labels, position_ids, position_ids_second, masked_positions, num_spans = model.build_inputs(
    title=title, abstract=abstract, venue=venue, authors=authors, concepts=concepts, affiliations=affiliations
)
_, paper_embed_1 = model.bert.forward(
    input_ids=torch.LongTensor(input_ids).unsqueeze(0),
    token_type_ids=torch.LongTensor(token_type_ids).unsqueeze(0),
    attention_mask=torch.LongTensor(input_masks).unsqueeze(0),
    output_all_encoded_layers=False,
    checkpoint_activations=False,
    position_ids=torch.LongTensor(position_ids).unsqueeze(0),
    position_ids_second=torch.LongTensor(position_ids_second).unsqueeze(0)
)

# Positive Paper 2
title = 'Attention Is All You Need'
abstract = 'We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely...'
authors = ['Ashish Vaswani', 'Noam Shazeer', 'Niki Parmar', 'Jakob Uszkoreit']
venue = 'neural information processing systems'
affiliations = ['Google']
concepts = ['machine translation', 'computation and language', 'language model']

input_ids, input_masks, token_type_ids, masked_lm_labels, position_ids, position_ids_second, masked_positions, num_spans = model.build_inputs(
    title=title, abstract=abstract, venue=venue, authors=authors, concepts=concepts, affiliations=affiliations
)
# encode second paper
_, paper_embed_2 = model.bert.forward(
    input_ids=torch.LongTensor(input_ids).unsqueeze(0),
    token_type_ids=torch.LongTensor(token_type_ids).unsqueeze(0),
    attention_mask=torch.LongTensor(input_masks).unsqueeze(0),
    output_all_encoded_layers=False,
    checkpoint_activations=False,
    position_ids=torch.LongTensor(position_ids).unsqueeze(0),
    position_ids_second=torch.LongTensor(position_ids_second).unsqueeze(0)
)

# Negative Paper 3
title = "Traceability and international comparison of ultraviolet irradiance"
abstract = "NIM took part in the CIPM Key Comparison of ″Spectral Irradiance 250 to 2500 nm″. In UV and NIR wavelength, the international comparison results showed that the consistency between Chinese value and the international reference one"
authors =  ['Jing Yu', 'Bo Huang', 'Jia-Lin Yu', 'Yan-Dong Lin', 'Cai-Hong Dai']
veune = 'Jiliang Xuebao/Acta Metrologica Sinica'
affiliations = ['Department of Electronic Engineering']
concept= ['Optical Division']

input_ids, input_masks, token_type_ids, masked_lm_labels, position_ids, position_ids_second, masked_positions, num_spans = model.build_inputs(
    title=title, abstract=abstract, venue=venue, authors=authors, concepts=concepts, affiliations=affiliations
)
# encode thrid paper
_, paper_embed_3 = model.bert.forward(
    input_ids=torch.LongTensor(input_ids).unsqueeze(0),
    token_type_ids=torch.LongTensor(token_type_ids).unsqueeze(0),
    attention_mask=torch.LongTensor(input_masks).unsqueeze(0),
    output_all_encoded_layers=False,
    checkpoint_activations=False,
    position_ids=torch.LongTensor(position_ids).unsqueeze(0),
    position_ids_second=torch.LongTensor(position_ids_second).unsqueeze(0)
)

# calulate text similarity
# normalize
paper_embed_1 = F.normalize(paper_embed_1, p=2, dim=1)
paper_embed_2 = F.normalize(paper_embed_2, p=2, dim=1)
paper_embed_3 = F.normalize(paper_embed_3, p=2, dim=1)

# cosine sim.
sim12 = torch.mm(paper_embed_1, paper_embed_2.transpose(0, 1))
sim13 = torch.mm(paper_embed_1, paper_embed_3.transpose(0, 1))
print(sim12, sim13)

 这种微调是在whoiswho的名称消歧任务中进行的。

由相同作者撰写的论文被视为正数对,其余的被视为负数对。

我们对0.40百万个正对和1.6百万个负对进行抽样,并使用常量学习来微调OAG-BERT(版本2)。

对于50%的实例,我们只使用论文标题,而另外50%使用所有的异质信息。

我们使用平均互换等级来评估性能,较高的数值表示更好的结果。测试集上的性能如下所示。

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值