《Joint Extraction of Entities and Relations based on a novel tagging scheme》论文复现
论文介绍
论文标题:Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme(基于新标注方案的实体与关系联合抽取)ACL2017 Outstanding Paper
论文地址:https://www.aclweb.org/anthology/P17-1113.pdf
数据处理(word2vec.py文件)
数据集
文章使用由远程监督方法(Ren et al., 2017)生成的公共数据集NYT。训练数据总共包含353k三元组,测试集包含3,880三元组。此外,关系集的大小是24。训练集和测试集分别在“data/demo/train.json”和"data/demo/test.json"中。
构建语料库
分别将train.json和test.json文件中的“sentText”字段提取出来,并保存在corpus.txt中
提取代码(make_corpus函数):
def func(fin, fout):
for line in fin:
line = line.strip()
if not line:
continue
sentence = json.loads(line)
sentence = sentence["sentText"].strip().strip('"').lower()
fout.write(sentence + '\n')
def make_corpus():
with open('data/demo/corpus.txt', 'wt', encoding='utf-8') as fout:
with open('data/demo/train.json', 'rt', encoding='utf-8') as fin:
func(fin, fout)
with open('data/demo/test.json', 'rt', encoding='utf-8') as fin:
func(fin, fout)
词嵌入word2vec
使用gensim.models.word2vec中的LineSentence, Word2Vec进行词嵌入,代码如下:
# 将原始的训练语料转化成一个sentence的迭代器,每一次迭代返回的sentence是一个word(utf8格式)的列表。
sentences = LineSentence('data/demo/corpus.txt')
# sentences:可以是一个·ist,对于大语料集,建议使用BrownCorpus,Text8Corpus或·ineSentence构建。
# · sg: 用于设置训练算法,默认为0,对应CBOW算法;sg=1则采用skip-gram算法。
# · size:是指特征向量的维度,默认为100。大的size需要更多的训练数据,但是效果会更好. 推荐值为几十到几百。
# · window:表示当前词与预测词在一个句子中的最大距离是多少
# · alpha: 是学习速率
# · seed:用于随机数发生器。与初始化词向量有关。
# · min_count: 可以对字典做截断. 词频少于min_count次数的单词会被丢弃掉, 默认值为5
# · max_vocab_size: 设置词向量构建期间的RAM限制。如果所有独立单词个数超过这个,则就消除掉其中最不频繁的一个。每一千万个单词需要大约1GB的RAM。设置成None则没有限制。
# · sample: 高频词汇的随机降采样的配置阈值,默认为1e-3,范围是(0,1e-5)
# · workers参数控制训练的并行数。
# · hs: 如果为1则会采用hierarchica·softmax技巧。如果设置为0(defau·t),则negative sampling会被使用。
# · negative: 如果>0,则会采用negativesamp·ing,用于设置多少个noise words
# · cbow_mean: 如果为0,则采用上下文词向量的和,如果为1(defau·t)则采用均值。只有使用CBOW的时候才起作用。
# · hashfxn: hash函数来初始化权重。默认使用python的hash函数
# · iter: 迭代次数,默认为5
# · trim_rule: 用于设置词汇表的整理规则,指定那些单词要留下,哪些要被删除。可以设置为None(min_count会被使用)或者一个接受()并返回RU·E_DISCARD,uti·s.RU·E_KEEP或者uti·s.RU·E_DEFAU·T的函数。
# · sorted_vocab: 如果为1(defau·t),则在分配word index 的时候会先对单词基于频率降序排序。
# · batch_words:每一批的传递给线程的单词的数量,默认为10000
model = Word2Vec(sentences, sg=1, vector_size=300, workers=4, epochs=8, negative=8)
# model.wv.vectors将其扩展为2D 的NumPy矩阵
word_vectors = model.wv
# print(type(word_vectors))
word_vectors.save('data/demo/word2vec')
word_vectors.save_word2vec_format('data/demo/word2vec.txt', fvocab='data/demo/vocab.txt')
其中vocab.txt保存了每个单词对应出现的频率。
进行标注(data.py文件)
Index类(utils.py文件)
Index类中包含两种数据类型:字典类型的key2idx和列表类型的idx2key。并提供了下列功能:
- add:存入key并返回元素所在的位置
- getitem:返回元素id
- len:返回元素数量
- save:将元素输出到指定文件中
- load:从文件中导入元素
代码实现如下:
class Index(object):
def __init__(self):
self.key2idx = {}
self.idx2key = []
# 存入key并返回元素所在的位置
def add(self, key):
if key not in self.key2idx:
self.key2idx[key] = len(self.idx2key)
self.idx2key.append(key)
return self.key2idx[key]
def __getitem__(self, key):
if isinstance(key, str):
return self.key2idx[key]
if isinstance(key, int):
return self.idx2key[key]
def __len__(self):
return len(self.idx2key)
def save(self, f):
with open(f, 'wt', encoding='utf-8') as fout:
for index, key in enumerate(self.idx2key):
fout.write(key + '\t' + str(index) + '\n')
def load(self, f):
with open(f, 'rt', encoding='utf-8') as fin:
for line in fin:
line = line.strip()
if not line:
continue
key = line.split()[0]
self.add(key)
将实体标签和关系标签定义成Index类型数据,方便后续处理。
relation_labels = Index()
entity_labels = Index()
tag_set = Index()
tag_set.add("O")
以下面的图为例,该句子中:
- relation_labels:存放“Country-President”和“Company-Founder”
- entity_labels:存放“Country”、“Person”等。(“Trump”的类型为“Person”、“United States”的类型为“Country”)
- tag_set:存放“O”、“B-CP-1”、“E-CP-1”、“B-CF-1”、“E-CF-1”、“B-CF-2”、“I-CF-2”、“E-CF-2”等。
标注方案
文章中的标注方案如图所示:
每个单词都被分配一个标签,用于提取结果。标签“O”代表“Other”标签,这意味着相应的单词与提取结果无关。除了“O”之外,其他标签由三部分组成:实体中的单词位置、关系类型和关系角色。我们使用“BIES”(Begin, Inside, End, Single)符号来表示单词在实体中的位置信息。关系类型信息是从一组预定义的关系中获得的,关系角色信息由数字“1”和“2”表示。提取的结果由三元组表示:(Entity1,RelationType,Entity2)。“1”表示该词属于三元组中的第一个实体,而“2”则属于该关系类型后面的第二个实体。
上图是一个说明文章标注方法的例子。输入句子包含两个三元组:{United States, Country-President, Trump}和{Apple Inc, Company-Founder, Steven Paul Jobs},其中“Country-President”和“Company-Founder”是预定义的关系类型。United”,“States”,“Trump”,“Apple”,“Inc” ,“Steven”, “Paul”和“Jobs”等词都与最终提取的结果有关。因此,他们根据我们的特殊标签进行标注。例如“United”这个词是“United States”实体的第一个词,与“Country-President”关系有关,所以它的标签是“B-CP-1”。对应于“United States”的另一个实体“Trump”被标记为“S-CP-2”。此外,与最终结果无关的其他字词标记为“O”。
代码实现
主函数
relation_labels = Index()
entity_labels = Index()
tag_set = Index()
tag_set.add("O")
with open("overlap.txt", "wt", encoding="utf-8") as fout:
train = []
with open('data/demo/train.json', 'rt', encoding='utf-8') as fin:
# 输入
# fin:输入文件流,这里指打开train.json文件。
# relation_label:自定义Index类型,存放出现过的关系标签,如“Country-President”、“Company-Founder”等
# entity_labels:存放出现过的实体类型标签,如如PERSON,LOCATION等
# tag_set:存放句子中出现的标记
# train:前面定义的列表,用于存放句子对应的标记结果
# fout:文件输出流,将有重叠的句子写入overlap.txt中
# 输出
# res:fin中有重叠句子的个数
res = prepare_data_set(fin, relation_labels, entity_labels, tag_set, train, fout)
print("# of overlaps in train data: {}".format(res))
# 将train列表保存为train.pk文件,方面后续模型训练使用
save(train, 'data/demo/train.pk')
# 处理方式与上述train一致
test = []
with open('data/demo/test.json', 'rt', encoding='utf-8') as fin:
res = prepare_data_set(fin, relation_labels, entity_labels, tag_set, test, fout)
print("# of overlaps in test data: {}".format(res))
save(test, 'data/demo/test.pk')
# 将relation_labels、entity_labels、tag_set分别保存在txt文件中
# 调用Index类中的save函数
relation_labels.save('data/demo/relation_labels.txt')
entity_labels.save('data/demo/entity_labels.txt')
tag_set.save("data/demo/tag2id.txt")
# of overlaps in train data: 42924
# of overlaps in test data: 18
prepare_data_set实现主要的标记功能:
def prepare_data_set(fin, relation_labels, entity_labels, tag_set, dataset, fout):
num_overlap = 0
# 逐行读取fin中的信息
for line in fin:
overlap = False
line = line.strip()
if not line:
continue
# 将str类型的line转换为dict类型的sentence
sentence = json.loads(line)
# 在train.json和test.json中,三元组标签保存在relationMentions字段下
# 即该字段对应一个列表,列表中存放了这个句子中所有的 实体1-关系-实体2 对
# 下面的操作在遍历一个列表:sentence["relationMentions"]
for relation_mention in sentence["relationMentions"]:
# 把列表中的实体1-关系-实体2 对提取出来,保存在relation_labels中。
relation_labels.add(relation_mention["label"])
# 根据关系类型生成可能标签,储存在tag_set中
make_tag_set(tag_set, relation_mention["label"])
# 同理,sentence["entityMentions"]也是一个列表
# 存放了某个句子中所有实体的起始位置、实体标签、实体名称。
for entity_mention in sentence["entityMentions"]:
# 将实体的标签,如PERSON,LOCATION等,存入entity_labels中。
entity_labels.add(entity_mention["label"])
# 若strip函数没有传递参数,则返回的是删除字符串前导和尾随空格的字符串副本
sentence_text = sentence["sentText"].strip().strip('"').split()
# 句子中词数量,若大于指定的MAX_SENT_LENGTH,则跳过,进行下一次循环
length_sent = len(sentence_text)
if length_sent > MAX_SENT_LENGTH:
continue
# 初始化每个单词的标签,将每个单词标签均初始化为"O"
tags_idx = [tag_set["O"]] * length_sent
# 遍历sentence["relationMentions"]列表
for relation_mention in sentence["relationMentions"]:
# 如果关系标签为None,则跳过。
if relation_mention["label"] == "None":
continue
em1_text = relation_mention["em1Text"].split()
# 更新句子中标签tags_idx
# res1和res2分别为标记实体1和实体2中是否发生冲突,1为冲突,0为不冲突
res1 = update_tag_seq(em1_text, sentence_text, relation_mention["label"], 1, tag_set, tags_idx)
em2_text = relation_mention["em2Text"].split()
res2 = update_tag_seq(em2_text, sentence_text, relation_mention["label"], 2, tag_set, tags_idx)
if res1 or res2:
num_overlap += 1
overlap = True
# sentence_idx:句子中每个单词标记的标签所对应的id
# tags_idx:所有标签对应的id
dataset.append((sentence_idx, tags_idx))
# 将冲突的句子写入overlap.txt中
if overlap:
fout.write(line+"\n")
return num_overlap
make_tag_set函数生成可能的标记:
def make_tag_set(tag_set, relation_label):
if relation_label == "None":
return
# 对于某一个关系,如“Country-President”,生成实体位置(BIES)和实体角色(12)的任意组合
for pos in "BIES":
for role in "12":
# 将pos-relation_label-role存入tag_set中
tag_set.add("-".join([pos, relation_label, role]))
update_tag_seq函数对句子中的标记进行更新:
def update_tag_seq(em_text, sentence_text, relation_label, role, tag_set, tags_idx):
overlap = False
# 找到实体em_text在句子sentence_text中出现的位置。
start = search(em_text, sentence_text)
# 生成第一个标记
tag = "-".join(["S", relation_label, str(role)])
# 实体的单词数为1时,如"Asia"
if len(em_text) == 1:
# 如果实体首次出现的位置上的标签不为"O",则出现了标记冲突
if tags_idx[start] != tag_set["O"]:
overlap = True
tags_idx[start] = tag_set[tag]
# 实体的单词数不为1,如"Omar Vizquel"
else:
# 标记实体的第一个单词
tag = "B" + tag[1:]
if tags_idx[start] != tag_set["O"]:
overlap = True
#从tag_set中提取tag的id,并在start位置上赋值。
tags_idx[start] = tag_set[tag]
# 标记实体的最后一个单词
tag = "E" + tag[1:]
end = start + len(em_text) - 1
if tags_idx[end] != tag_set["O"]:
overlap = True
tags_idx[end] = tag_set[tag]
# 标记实体中间
tag = "I" + tag[1:]
for index in range(start + 1, end):
if tags_idx[index] != tag_set["O"]:
overlap = True
tags_idx[index] = tag_set[tag]
# 返回实体标记过程中是否发生冲突
return overlap
search函数找到单词在句子中首次出现的位置:
def search(pat, txt):#找到pat在txt子串的第一次出现位置
i, N = 0, len(txt)
j, M = 0, len(pat)
while i < N and j < M:
if txt[i] == pat[j]:
j = j + 1
else:
i -= j
j = 0
i = i + 1
if j == M:
return i - M
else:
return -1
输出结果
relation_labels.txt:
entity_labels.txt:
tag2id.txt:
Seq2Seq模型(model.py文件)
神经网络模型使用Pytorch进行搭建,Pytorch中各函数的详细用法可查看官网。上面都有各函数的详细介绍和简单的示例。
神经网络模型
文中提出的模型如图所示:
从图中我们可以看出,输入的句子每个单词,
- 首先在Embedding Layer转换为嵌入向量W
- 嵌入向量W进入Encoding Layer,该层由BiLSTM构成。输出隐藏向量h
- 隐藏向量h进入Decoding Layer,该层由LSTM构成。输出隐藏向量h’由tanh激活变为T
- T进行softmax后转换为标记输出
编码层
class Encoder(nn.Module):
# 编码层由嵌入层embed、激活层drop、biLSTM层构成。均使用torch.nn中现成函数进行搭建。
# embed_size:嵌入向量W维度,即biLSTM的输入维度
# weight:预训练好的(在wrod2vec.py中)词向量矩阵
# dropout:失活率
# hidden_size:biLSTM中的隐藏向量h维度,即biLSTM输出维度
def __init__(self,embed_size,weight,dropout,hidden_size):
super(Encoder,self).__init__()
self.hidden_size = hidden_size
# 使用预训练好的词向量初始化;freeze若为True,表示训练过程不更新
self.embed = nn.Embedding.from_pretrained(weight, freeze=False)
self.drop = nn.Dropout(dropout)
self.biLSTM = nn.LSTM(embed_size,hidden_size,batch_first=True,bidirectional=True)
# 编码层传播函数。X形式为(batch_size,seq_len)
def forward(self,X):
batch_size = X.size(0)
seq_len = X.size(1)
# (batch_size,seq_len)
# torch.Size([32, 30])
# 依次经过各层进行传递
embeddings = self.embed(X)
# (batch_size,seq_len,embedding_size)
# torch.Size([32, 30, 300])
# print(embeddings.size())
embeddings = self.drop(embeddings)
#随机初试化隐藏变量和记忆细胞变量
# [num_layers(=1) * num_directions(=2), batch_size, n_hidden]
hidden_state = torch.randn(1*2,batch_size,self.hidden_size)
cell_state = torch.randn(1*2,batch_size,self.hidden_size)
# 为了提高内存的利用率和效率,调用flatten_parameters让parameter的数据存放成contiguous chunk(连续的块)。
# 类似我们调用tensor.contiguous
self.biLSTM.flatten_parameters()
outputs, (hc, c) = self.biLSTM(embeddings, (hidden_state, cell_state))
# 其实每一层的输出,都直接拼接在一起了,而hc只表示最后一层的输出,所以output[-1]==hc的。
outputs = outputs # [batch_size, seq_length,n_hidden * 2]
# torch.Size([32, 30, 600])
# print(outputs.size())
return outputs
解码层
文章中写到解码层由LSTM构成,但根据其提供的公式,发现与标准的LSTM形式不太一样,所以自己根据各种门所对应的公式自己实现,详情请见此博客
文中给出的公式为:
代码实现:
class Decoder(nn.Module):
# input_size:编码层输出的隐藏向量h维度
# hidden_size:解码层隐藏向量h'维度
def __init__(self,input_size,hidden_size):
super(Decoder,self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
# 定义各种门中的权重矩阵w与偏差b
# 命名格式均与文中格式一致,可自行对应
# input gate
self.w_wi = Parameter(Tensor(hidden_size, input_size))
self.w_hi = Parameter(Tensor(hidden_size, hidden_size))
self.w_ti = Parameter(Tensor(hidden_size, input_size))
self.b_i = Parameter(Tensor(hidden_size, 1))
# forget gate
self.w_wf = Parameter(Tensor(hidden_size, input_size))
self.w_hf = Parameter(Tensor(hidden_size, hidden_size))
self.w_tf = Parameter(Tensor(hidden_size, input_size))
self.b_f = Parameter(Tensor(hidden_size, 1))
# output gate
self.w_wo = Parameter(Tensor(hidden_size, input_size))
self.w_ho = Parameter(Tensor(hidden_size, hidden_size))
self.w_to = Parameter(Tensor(hidden_size, input_size))
self.b_o = Parameter(Tensor(hidden_size, 1))
# cell
self.w_wg = Parameter(Tensor(hidden_size, input_size))
self.w_hg = Parameter(Tensor(hidden_size, hidden_size))
self.w_tg = Parameter(Tensor(hidden_size, input_size))
self.b_g = Parameter(Tensor(hidden_size, 1))
self.reset_weigths()
# 初始化权重
def reset_weigths(self):
"""reset weights
"""
stdv = 1.0 / math.sqrt(self.hidden_size)
for weight in self.parameters():
init.uniform_(weight, -stdv, stdv)
# 传播函数
def forward(self,X):
batch_size = X.size(0)
seq_len = X.size(1)
h_t = torch.zeros(1, self.hidden_size).t()
c_t = torch.zeros(1, self.hidden_size).t()
T = torch.zeros(batch_size,self.hidden_size).t()
T_seq = []
# seq_len为句子长度
for t in range(seq_len):
x = X[:, t, :].t()
# torch.Size([600, 32])
# t()转置的意思
# print(x.size())
# torch.Size([600, 1])
# print(h_t.size())
# input gate
# torch.Size([600, 32])
# print(w.size())
# 复现文章中的公式
i = torch.sigmoid(self.w_wi @ x + self.w_ti @ T + self.w_hi @ h_t +
self.b_i)
# forget gate
f = torch.sigmoid(self.w_wf @ x + self.w_tf @ T + self.w_hf @ h_t +
self.b_f)
# cell
g = torch.tanh(self.w_wg @ x + self.w_tg @ T + self.w_hg @ h_t
+ self.b_g)
# output gate
o = torch.sigmoid(self.w_wo @ x + self.w_to @ T + self.w_ho @ h_t +
self.b_o)
c_next = f * c_t + i * g
h_next = o * torch.tanh(c_next)
# torch.Size([600, 32])
# print(h_next.size())
T_next = torch.tanh(h_next)
# torch.Size([600, 32])
# print(T_next.size())
# 结果作为下一步输入
T = T_next
c_next_t = c_next.t().unsqueeze(0)
h_next_t = T_next.t().unsqueeze(0)
# print(h_next_t.size())
# torch.Size([1, 32, 600])
T_seq.append(h_next_t)
T_seq = torch.cat(T_seq, dim=0)
return T_seq.transpose(0, 1)
模型整合
class Model(nn.Module):
# tagnum:标记的数目,用于softmax分类
# 模型调用前面的Encoder和Decoder,并用感知机将输出映射到tagnum中
def __init__(self,embed_size,weight,dropout,en_hidden_size,de_input_size,de_hidden_size,tagnum):
super(Model,self).__init__()
self.encoder = Encoder(embed_size=embed_size,weight=weight,dropout=dropout,hidden_size=en_hidden_size)
self.decoder = Decoder(input_size=de_input_size,hidden_size=de_hidden_size)
self.tagnum = tagnum
self.hidden2tag = nn.Linear(de_hidden_size,self.tagnum)
# 传播函数
def forward(self,X):
encoder_output = self.encoder(X)
output = self.decoder(encoder_output)
y = self.hidden2tag(output)
# torch.Size([32, 40, 193])
# print(y.size())
# 对第2列作归一化
return F.log_softmax(y,dim=2)
模型训练(train.py文件)
载入数据
# 设置 (CPU) 生成随机数的种子,并返回一个torch.Generator对象。
# 一旦固定种子,后面依次生成的随机数其实都是固定的。
random_seed = 1111
torch.manual_seed(random_seed)
# 代表将torch.tensor分配到的设备的对象(简单点说,就是分配到你的CPU还是GPU上,以及哪块GPU上)。
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 加载之前处理的数据
tag_set = Index()
tag_set.load("data/demo/tag2id.txt")
relation_labels = Index()
relation_labels.load('data/demo/relation_labels.txt')
train_data = load('data/demo/train.pk')
test_data = load('data/demo/test.pk')
val_size = int(0.01 * len(train_data))
# 随机划分数据集
train_data, val_data = random_split(train_data, [len(train_data)-val_size, val_size])
# 根据句子长度区间分组
train_data_groups = group(train_data, [10, 20, 30, 40, 50, 60])
val_data_groups = group(val_data, [10, 20, 30, 40, 50, 60])
test_data_groups = group(test_data, [10, 20, 30, 40, 50, 60])
# 加载word_embedding
word_embeddings = torch.tensor(np.load("data/demo/word2vec.vectors.npy"))
word_embedding_size = word_embeddings.size(1)
训练参数选择
参数 | 取值 |
---|---|
词嵌入维度 | 300 |
dropout率 | 0.5 |
BiLSTM隐藏向量维度 | 300 |
LSTM输入维度 | 600 |
LSTM隐藏向量维度 | 600 |
学习率 | 1 |
epochs | 30 |
batch_size | 32 |
梯度阈值 | 0.35 |
进行训练
文中采用了如下损失函数:
|D|是训练集的大小,Lj是句子xj的长度,yi(j)是单词xj中词t的标注,pt(j)是在公式中定义的归一化标注概率。此外,I(O)是一个开关函数,以区分标注‘O’与可指示结果的相关标注间的损失。
if os.path.exists("model.pt"):
model=torch.load('model.pt')
else:
model = Model(embed_size=300,weight=word_embeddings,dropout=0.5,en_hidden_size=300,de_input_size=600,de_hidden_size=600,tagnum=len(tag_set)).to(device)
# 为“O”和其他标记设置不同的权重
# “O”权重为1,其他标记权重为10
weight = [10.0] * len(tag_set)
weight[tag_set["O"]] = 1
weight = torch.tensor(weight).to(device)
# 损失函数详情见:https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html
criterion = nn.NLLLoss(weight, size_average=False)
optimizer = getattr(optim, 'SGD')(model.parameters(), lr=4)
clip = 0.35
log_interval = 100
epochs = 30
batch_size = 32
best_val_loss = None
lr = 4
all_val_loss = []
all_precision = []
all_recall = []
all_f1 = []
# At any point you can hit Ctrl + C to break out of training early.
try:
start_time = time.time()
print("-" * 118)
# 开始训练
for epoch in range(1, epochs+1):
train()
val_loss, precision, recall, f1 = evaluate(val_data_groups)
elapsed = time.time() - start_time
print("-" * 118)
print("| End of Epoch {:2d} | Elapsed Time {:s} | Validation Loss {:5.3f} | Precision {:5.3f} "
"| Recall {:5.3f} | F1 {:5.3f} |".format(epoch, time_display(elapsed),
val_loss, precision, recall, f1))
print("-" * 118)
# Save the model if the validation loss is the best we've seen so far.
if not best_val_loss or val_loss < best_val_loss:
with open("model.pt", 'wb') as f:
torch.save(model, f)
best_val_loss = val_loss
else:
# Anneal the learning rate if no improvement has been seen in the validation dataset.
lr = lr / 4.0
for param_group in optimizer.param_groups:
param_group['lr'] = lr
all_val_loss.append(val_loss)
all_precision.append(precision)
all_recall.append(recall)
all_f1.append(f1)
except KeyboardInterrupt:
print('-' * 118)
print('Exiting from training early')
def group(data, breakpoints):
groups = [[] for _ in range(len(breakpoints)+1)]
# enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列
for idx, item in enumerate(data):
# bisect.bisect_left(a, x, lo=0, hi=len(a)):a是列表,x是要插入的元素。函数返回在a中插入x的位置,
# 如果a中已经存x,那么插入的位置在a中最左边的x的前面一位。
# 返回值把列表分成两部分,插入点左侧满足all(val < x for val in a[lo:i]),插入点右侧满足all(val >= x for val in a[i:hi])。
i = bisect.bisect_left(breakpoints, len(item[0]))
groups[i].append(idx)
data_groups = [Subset(data, g) for g in groups]
return data_groups
class GroupBatchRandomSampler(object):
def __init__(self, data_groups, batch_size, drop_last):
self.batch_indices = []
for data_group in data_groups:
self.batch_indices.extend(list(BatchSampler(SubsetRandomSampler(data_group.indices),
batch_size, drop_last=drop_last)))
def __iter__(self):
return (self.batch_indices[i] for i in torch.randperm(len(self.batch_indices)))
def __len__(self):
return len(self.batch_indices)
def get_batch(batch_indices, data):
batch = [data[idx] for idx in batch_indices]
sorted_batch = sorted(batch, key=lambda x: len(x[0]), reverse=True)
sentences, tags = zip(*sorted_batch)
# 进行对齐填充
padded_sentences, lengths = pad_packed_sequence(pack_sequence([torch.LongTensor(_) for _ in sentences]),
batch_first=True, padding_value=vocab["<pad>"])
padded_tags, _ = pad_packed_sequence(pack_sequence([torch.LongTensor(_) for _ in tags]),
batch_first=True, padding_value=tag_set["O"])
return padded_sentences.to(device), padded_tokens.to(device), padded_tags.to(device), lengths.to(device)
def train():
model.train()
total_loss = 0
count = 0
# 选择batch_size个样本
sampler = GroupBatchRandomSampler(train_data_groups, batch_size, drop_last=False)
# batch_indices:上一步sampler中选定的id号
for idx, batch_indices in enumerate(sampler):
sentences, tokens, targets, lengths = get_batch(batch_indices, train_data)
# 梯度清零
optimizer.zero_grad()
# 将sentences放入模型
output = model(sentences)
output = pack_padded_sequence(output, lengths, batch_first=True).data
targets = pack_padded_sequence(targets, lengths, batch_first=True).data
# 计算损失
loss = criterion(output, targets)
# 反向传播
loss.backward()
# 在BP过程中会产生梯度消失/爆炸(就是偏导无限接近0,导致长时记忆无法更新),
# 那么最简单粗暴的方法,设定阈值,当梯度小于/大于阈值时,更新的梯度为阈值.
if clip > 0:
nn.utils.clip_grad_norm_(model.parameters(), clip)
# optimizer.step()通常用在每个mini-batch之中,可以根据具体的需求来做。只有用了optimizer.step(),模型才会更新。
optimizer.step()
total_loss += loss.item()
count += len(targets)
# 计算进度
if (idx+1) % log_interval == 0:
cur_loss = total_loss / count
elapsed = time.time() - start_time
percent = ((epoch-1)*len(sampler)+(idx+1))/(epochs*len(sampler))
remaining = elapsed / percent - elapsed
print("| Epoch {:2d}/{:2d} | Batch {:5d}/{:5d} | Elapsed Time {:s} | Remaining Time {:s} | "
"lr {:4.2e} | Loss {:5.3f} |".format(epoch, epochs, idx+1, len(sampler), time_display(elapsed),
time_display(remaining), lr, cur_loss))
total_loss = 0
count = 0
# 计算模型准确度、召回率、F1分数
def evaluate(data_groups):
model.eval()
total_loss = 0
count = 0
TP = 0
TP_FP = 0
TP_FN = 0
with torch.no_grad():
for batch_indices in GroupBatchRandomSampler(data_groups, batch_size, drop_last=False):
sentences, tokens, targets, lengths = get_batch(batch_indices, train_data)
output = model(sentences, tokens)
tp, tp_fp, tp_fn = measure(output, targets, lengths)
TP += tp
TP_FP += tp_fp
TP_FN += tp_fn
output = pack_padded_sequence(output, lengths, batch_first=True).data
targets = pack_padded_sequence(targets, lengths, batch_first=True).data
loss = criterion(output, targets)
total_loss += loss.item()
count += len(targets)
return total_loss / count, TP/TP_FP, TP/TP_FN, 2*TP/(TP_FP+TP_FN)
def measure(output, targets, lengths):
assert output.size(0) == targets.size(0) and targets.size(0) == lengths.size(0)
tp = 0
tp_fp = 0
tp_fn = 0
batch_size = output.size(0)
output = torch.argmax(output, dim=-1)
for i in range(batch_size):
length = lengths[i]
out = output[i][:length].tolist()
target = targets[i][:length].tolist()
out_triplets = get_triplets(out)
tp_fp += len(out_triplets)
target_triplets = get_triplets(target)
tp_fn += len(target_triplets)
for target_triplet in target_triplets:
for out_triplet in out_triplets:
if out_triplet == target_triplet:
tp += 1
return tp, tp_fp, tp_fn
def get_triplets(tags):
temp = {}
triplets = []
for idx, tag in enumerate(tags):
if tag == tag_set["O"]:
continue
pos, relation_label, role = tag_set[tag].split("-")
if pos == "B" or pos == "S":
if relation_label not in temp:
temp[relation_label] = [[], []]
temp[relation_label][int(role) - 1].append(idx)
for relation_label in temp:
role1, role2 = temp[relation_label]
if role1 and role2:
len1, len2 = len(role1), len(role2)
if len1 > len2:
for e2 in role2:
idx = np.argmin([abs(e2 - e1) for e1 in role1])
e1 = role1[idx]
triplets.append((e1, relation_label, e2))
del role1[idx]
else:
for e1 in role1:
idx = np.argmin([abs(e2 - e1) for e2 in role2])
e2 = role2[idx]
triplets.append((e1, relation_label, e2))
del role2[idx]
return triplets
模型评价
# Load the best saved model.
with open("model.pt", 'rb') as f:
model = torch.load(f)
# Run on test data
test_loss, precision, recall, f1 = evaluate(test_data_groups)
print("=" * 118)
print("| End of Training | Test Loss {:5.3f} | Precision {:5.3f} "
"| Recall {:5.3f} | F1 {:5.3f} |".format(test_loss, precision, recall, f1))
print("=" * 118)
with open("record.tsv", "wt", encoding="utf-8") as f:
for idx in range(len(all_val_loss)):
f.write("{:d}\t{:5.3f}\t{:5.3f}\t{:5.3f}\t{:5.3f}\n"
.format(idx+1, all_val_loss[idx], all_precision[idx], all_recall[idx], all_f1[idx]))
f.write("\n{:5.3f}\t{:5.3f}\t{:5.3f}\t{:5.3f}\n".format(test_loss, precision, recall, f1))
训练结果记录record.tsv如图:
准确率、召回率、f1分数分别为0.697、0.878、0.777.