【LSTM】《Joint Extraction of Entities and Relations based on a novel tagging scheme》论文复现

论文介绍

论文标题:Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme(基于新标注方案的实体与关系联合抽取)ACL2017 Outstanding Paper
论文地址:https://www.aclweb.org/anthology/P17-1113.pdf

数据处理(word2vec.py文件)

数据集

文章使用由远程监督方法(Ren et al., 2017)生成的公共数据集NYT。训练数据总共包含353k三元组,测试集包含3,880三元组。此外,关系集的大小是24。训练集和测试集分别在“data/demo/train.json”和"data/demo/test.json"中。
train.json文件内容

构建语料库

分别将train.json和test.json文件中的“sentText”字段提取出来,并保存在corpus.txt中
在这里插入图片描述
提取代码(make_corpus函数):

def func(fin, fout):
    for line in fin:
        line = line.strip()
        if not line:
            continue
        sentence = json.loads(line)
        sentence = sentence["sentText"].strip().strip('"').lower()
        fout.write(sentence + '\n')


def make_corpus():
    with open('data/demo/corpus.txt', 'wt', encoding='utf-8') as fout:
        with open('data/demo/train.json', 'rt', encoding='utf-8') as fin:
            func(fin, fout)
        with open('data/demo/test.json', 'rt', encoding='utf-8') as fin:
            func(fin, fout)

词嵌入word2vec

使用gensim.models.word2vec中的LineSentence, Word2Vec进行词嵌入,代码如下:

# 将原始的训练语料转化成一个sentence的迭代器,每一次迭代返回的sentence是一个word(utf8格式)的列表。
    sentences = LineSentence('data/demo/corpus.txt')
    # sentences:可以是一个·ist,对于大语料集,建议使用BrownCorpus,Text8Corpus或·ineSentence构建。
# ·  sg: 用于设置训练算法,默认为0,对应CBOW算法;sg=1则采用skip-gram算法。
# ·  size:是指特征向量的维度,默认为100。大的size需要更多的训练数据,但是效果会更好. 推荐值为几十到几百。
# ·  window:表示当前词与预测词在一个句子中的最大距离是多少
# ·  alpha: 是学习速率
# ·  seed:用于随机数发生器。与初始化词向量有关。
# ·  min_count: 可以对字典做截断. 词频少于min_count次数的单词会被丢弃掉, 默认值为5
# ·  max_vocab_size: 设置词向量构建期间的RAM限制。如果所有独立单词个数超过这个,则就消除掉其中最不频繁的一个。每一千万个单词需要大约1GB的RAM。设置成None则没有限制。
# ·  sample: 高频词汇的随机降采样的配置阈值,默认为1e-3,范围是(0,1e-5)
# ·  workers参数控制训练的并行数。
# ·  hs: 如果为1则会采用hierarchica·softmax技巧。如果设置为0(defau·t),则negative sampling会被使用。
# ·  negative: 如果>0,则会采用negativesamp·ing,用于设置多少个noise words
# ·  cbow_mean: 如果为0,则采用上下文词向量的和,如果为1(defau·t)则采用均值。只有使用CBOW的时候才起作用。
# ·  hashfxn: hash函数来初始化权重。默认使用python的hash函数
# ·  iter: 迭代次数,默认为5
# ·  trim_rule: 用于设置词汇表的整理规则,指定那些单词要留下,哪些要被删除。可以设置为None(min_count会被使用)或者一个接受()并返回RU·E_DISCARD,uti·s.RU·E_KEEP或者uti·s.RU·E_DEFAU·T的函数。
# ·  sorted_vocab: 如果为1(defau·t),则在分配word index 的时候会先对单词基于频率降序排序。
# ·  batch_words:每一批的传递给线程的单词的数量,默认为10000
    model = Word2Vec(sentences, sg=1, vector_size=300, workers=4, epochs=8, negative=8)
    # model.wv.vectors将其扩展为2D 的NumPy矩阵
    word_vectors = model.wv
    # print(type(word_vectors))
    word_vectors.save('data/demo/word2vec')
    word_vectors.save_word2vec_format('data/demo/word2vec.txt', fvocab='data/demo/vocab.txt')

其中vocab.txt保存了每个单词对应出现的频率。

进行标注(data.py文件)

Index类(utils.py文件)

Index类中包含两种数据类型:字典类型的key2idx和列表类型的idx2key。并提供了下列功能:

  1. add:存入key并返回元素所在的位置
  2. getitem:返回元素id
  3. len:返回元素数量
  4. save:将元素输出到指定文件中
  5. load:从文件中导入元素

代码实现如下:

class Index(object):
    def __init__(self):
        self.key2idx = {}
        self.idx2key = []

    # 存入key并返回元素所在的位置
    def add(self, key):
        if key not in self.key2idx:
            self.key2idx[key] = len(self.idx2key)
            self.idx2key.append(key)
        return self.key2idx[key]

    def __getitem__(self, key):
        if isinstance(key, str):
            return self.key2idx[key]
        if isinstance(key, int):
            return self.idx2key[key]

    def __len__(self):
        return len(self.idx2key)

    def save(self, f):
        with open(f, 'wt', encoding='utf-8') as fout:
            for index, key in enumerate(self.idx2key):
                fout.write(key + '\t' + str(index) + '\n')

    def load(self, f):
        with open(f, 'rt', encoding='utf-8') as fin:
            for line in fin:
                line = line.strip()
                if not line:
                    continue
                key = line.split()[0]
                self.add(key)

将实体标签和关系标签定义成Index类型数据,方便后续处理。

    relation_labels = Index()
    entity_labels = Index()
    tag_set = Index()
    tag_set.add("O")

以下面的图为例,该句子中:

  • relation_labels:存放“Country-President”和“Company-Founder”
  • entity_labels:存放“Country”、“Person”等。(“Trump”的类型为“Person”、“United States”的类型为“Country”)
  • tag_set:存放“O”、“B-CP-1”、“E-CP-1”、“B-CF-1”、“E-CF-1”、“B-CF-2”、“I-CF-2”、“E-CF-2”等。

标注方案

文章中的标注方案如图所示:
在这里插入图片描述
每个单词都被分配一个标签,用于提取结果。标签“O”代表“Other”标签,这意味着相应的单词与提取结果无关。除了“O”之外,其他标签由三部分组成:实体中的单词位置、关系类型和关系角色。我们使用“BIES”(Begin, Inside, End, Single)符号来表示单词在实体中的位置信息。关系类型信息是从一组预定义的关系中获得的,关系角色信息由数字“1”和“2”表示。提取的结果由三元组表示:(Entity1,RelationType,Entity2)。“1”表示该词属于三元组中的第一个实体,而“2”则属于该关系类型后面的第二个实体。

上图是一个说明文章标注方法的例子。输入句子包含两个三元组:{United States, Country-President, Trump}和{Apple Inc, Company-Founder, Steven Paul Jobs},其中“Country-President”和“Company-Founder”是预定义的关系类型。United”,“States”,“Trump”,“Apple”,“Inc” ,“Steven”, “Paul”和“Jobs”等词都与最终提取的结果有关。因此,他们根据我们的特殊标签进行标注。例如“United”这个词是“United States”实体的第一个词,与“Country-President”关系有关,所以它的标签是“B-CP-1”。对应于“United States”的另一个实体“Trump”被标记为“S-CP-2”。此外,与最终结果无关的其他字词标记为“O”。

代码实现

主函数

    relation_labels = Index()
    entity_labels = Index()
    tag_set = Index()
    tag_set.add("O")

    with open("overlap.txt", "wt", encoding="utf-8") as fout:
        train = []
        with open('data/demo/train.json', 'rt', encoding='utf-8') as fin:
            # 输入
            # fin:输入文件流,这里指打开train.json文件。
            # relation_label:自定义Index类型,存放出现过的关系标签,如“Country-President”、“Company-Founder”等
            # entity_labels:存放出现过的实体类型标签,如如PERSON,LOCATION等
            # tag_set:存放句子中出现的标记
            # train:前面定义的列表,用于存放句子对应的标记结果
            # fout:文件输出流,将有重叠的句子写入overlap.txt中
            # 输出
            # res:fin中有重叠句子的个数
            res = prepare_data_set(fin, relation_labels, entity_labels, tag_set, train, fout)
            print("# of overlaps in train data: {}".format(res))
        # 将train列表保存为train.pk文件,方面后续模型训练使用
        save(train, 'data/demo/train.pk')

        # 处理方式与上述train一致
        test = []
        with open('data/demo/test.json', 'rt', encoding='utf-8') as fin:
            res = prepare_data_set(fin, relation_labels, entity_labels, tag_set, test, fout)
            print("# of overlaps in test data: {}".format(res))
        save(test, 'data/demo/test.pk')

    # 将relation_labels、entity_labels、tag_set分别保存在txt文件中
    # 调用Index类中的save函数
    relation_labels.save('data/demo/relation_labels.txt')
    entity_labels.save('data/demo/entity_labels.txt')
    tag_set.save("data/demo/tag2id.txt")

# of overlaps in train data: 42924
# of overlaps in test data: 18

prepare_data_set实现主要的标记功能:

def prepare_data_set(fin, relation_labels, entity_labels, tag_set, dataset, fout):
    num_overlap = 0
    # 逐行读取fin中的信息
    for line in fin:
        overlap = False
        line = line.strip()
        if not line:
            continue
        # 将str类型的line转换为dict类型的sentence
        sentence = json.loads(line)
        # 在train.json和test.json中,三元组标签保存在relationMentions字段下
        # 即该字段对应一个列表,列表中存放了这个句子中所有的 实体1-关系-实体2 对
        # 下面的操作在遍历一个列表:sentence["relationMentions"]
        for relation_mention in sentence["relationMentions"]:
            # 把列表中的实体1-关系-实体2 对提取出来,保存在relation_labels中。
            relation_labels.add(relation_mention["label"])
            # 根据关系类型生成可能标签,储存在tag_set中
            make_tag_set(tag_set, relation_mention["label"])

        # 同理,sentence["entityMentions"]也是一个列表
        # 存放了某个句子中所有实体的起始位置、实体标签、实体名称。
        for entity_mention in sentence["entityMentions"]:
            # 将实体的标签,如PERSON,LOCATION等,存入entity_labels中。
            entity_labels.add(entity_mention["label"])

        # 若strip函数没有传递参数,则返回的是删除字符串前导和尾随空格的字符串副本
        sentence_text = sentence["sentText"].strip().strip('"').split()
        # 句子中词数量,若大于指定的MAX_SENT_LENGTH,则跳过,进行下一次循环
        length_sent = len(sentence_text)
        if length_sent > MAX_SENT_LENGTH:
            continue

        # 初始化每个单词的标签,将每个单词标签均初始化为"O"
        tags_idx = [tag_set["O"]] * length_sent
        # 遍历sentence["relationMentions"]列表
        for relation_mention in sentence["relationMentions"]:
            # 如果关系标签为None,则跳过。
            if relation_mention["label"] == "None":
                continue
            em1_text = relation_mention["em1Text"].split()
            # 更新句子中标签tags_idx
            # res1和res2分别为标记实体1和实体2中是否发生冲突,1为冲突,0为不冲突
            res1 = update_tag_seq(em1_text, sentence_text, relation_mention["label"], 1, tag_set, tags_idx)
            em2_text = relation_mention["em2Text"].split()
            res2 = update_tag_seq(em2_text, sentence_text, relation_mention["label"], 2, tag_set, tags_idx)
            if res1 or res2:
                num_overlap += 1
                overlap = True
        # sentence_idx:句子中每个单词标记的标签所对应的id
        # tags_idx:所有标签对应的id
        dataset.append((sentence_idx, tags_idx))
        # 将冲突的句子写入overlap.txt中
        if overlap:
            fout.write(line+"\n")
    return num_overlap

make_tag_set函数生成可能的标记:

def make_tag_set(tag_set, relation_label):
    if relation_label == "None":
        return
    # 对于某一个关系,如“Country-President”,生成实体位置(BIES)和实体角色(12)的任意组合
    for pos in "BIES":
        for role in "12":
            # 将pos-relation_label-role存入tag_set中
            tag_set.add("-".join([pos, relation_label, role]))

update_tag_seq函数对句子中的标记进行更新:

def update_tag_seq(em_text, sentence_text, relation_label, role, tag_set, tags_idx):
    overlap = False
    # 找到实体em_text在句子sentence_text中出现的位置。
    start = search(em_text, sentence_text)
    # 生成第一个标记
    tag = "-".join(["S", relation_label, str(role)])
    # 实体的单词数为1时,如"Asia"
    if len(em_text) == 1:
        # 如果实体首次出现的位置上的标签不为"O",则出现了标记冲突
        if tags_idx[start] != tag_set["O"]:
            overlap = True
        tags_idx[start] = tag_set[tag]
    # 实体的单词数不为1,如"Omar Vizquel"
    else:
        # 标记实体的第一个单词
        tag = "B" + tag[1:]
        if tags_idx[start] != tag_set["O"]:
            overlap = True
        #从tag_set中提取tag的id,并在start位置上赋值。
        tags_idx[start] = tag_set[tag]
        # 标记实体的最后一个单词
        tag = "E" + tag[1:]
        end = start + len(em_text) - 1
        if tags_idx[end] != tag_set["O"]:
            overlap = True
        tags_idx[end] = tag_set[tag]
        # 标记实体中间
        tag = "I" + tag[1:]
        for index in range(start + 1, end):
            if tags_idx[index] != tag_set["O"]:
                overlap = True
            tags_idx[index] = tag_set[tag]
    # 返回实体标记过程中是否发生冲突
    return overlap

search函数找到单词在句子中首次出现的位置:

def search(pat, txt):#找到pat在txt子串的第一次出现位置
    i, N = 0, len(txt)
    j, M = 0, len(pat)
    while i < N and j < M:
        if txt[i] == pat[j]:
            j = j + 1
        else:
            i -= j
            j = 0
        i = i + 1
    if j == M:
        return i - M
    else:
        return -1

输出结果

relation_labels.txt:
在这里插入图片描述
entity_labels.txt:
在这里插入图片描述
tag2id.txt:
在这里插入图片描述

Seq2Seq模型(model.py文件)

神经网络模型使用Pytorch进行搭建,Pytorch中各函数的详细用法可查看官网。上面都有各函数的详细介绍和简单的示例。

神经网络模型

文中提出的模型如图所示:
在这里插入图片描述
从图中我们可以看出,输入的句子每个单词,

  1. 首先在Embedding Layer转换为嵌入向量W
  2. 嵌入向量W进入Encoding Layer,该层由BiLSTM构成。输出隐藏向量h
  3. 隐藏向量h进入Decoding Layer,该层由LSTM构成。输出隐藏向量h’由tanh激活变为T
  4. T进行softmax后转换为标记输出

编码层

class Encoder(nn.Module):
    # 编码层由嵌入层embed、激活层drop、biLSTM层构成。均使用torch.nn中现成函数进行搭建。
    # embed_size:嵌入向量W维度,即biLSTM的输入维度
    # weight:预训练好的(在wrod2vec.py中)词向量矩阵
    # dropout:失活率
    # hidden_size:biLSTM中的隐藏向量h维度,即biLSTM输出维度
    def __init__(self,embed_size,weight,dropout,hidden_size):
        super(Encoder,self).__init__()
        self.hidden_size = hidden_size
        # 使用预训练好的词向量初始化;freeze若为True,表示训练过程不更新
        self.embed = nn.Embedding.from_pretrained(weight, freeze=False)
        self.drop = nn.Dropout(dropout)
        self.biLSTM = nn.LSTM(embed_size,hidden_size,batch_first=True,bidirectional=True)

    # 编码层传播函数。X形式为(batch_size,seq_len)
    def forward(self,X):
        batch_size = X.size(0)
        seq_len = X.size(1)
        # (batch_size,seq_len)
        # torch.Size([32, 30])


        # 依次经过各层进行传递
        embeddings = self.embed(X)
        # (batch_size,seq_len,embedding_size) 
        # torch.Size([32, 30, 300])
        # print(embeddings.size())
        embeddings = self.drop(embeddings)
        #随机初试化隐藏变量和记忆细胞变量
        # [num_layers(=1) * num_directions(=2), batch_size, n_hidden]
        hidden_state = torch.randn(1*2,batch_size,self.hidden_size)
        cell_state = torch.randn(1*2,batch_size,self.hidden_size)

        # 为了提高内存的利用率和效率,调用flatten_parameters让parameter的数据存放成contiguous chunk(连续的块)。
        # 类似我们调用tensor.contiguous
        self.biLSTM.flatten_parameters()
        outputs, (hc, c) = self.biLSTM(embeddings, (hidden_state, cell_state))

        # 其实每一层的输出,都直接拼接在一起了,而hc只表示最后一层的输出,所以output[-1]==hc的。
        outputs = outputs  # [batch_size, seq_length,n_hidden * 2]
        # torch.Size([32, 30, 600])
        # print(outputs.size())
        return outputs

解码层

文章中写到解码层由LSTM构成,但根据其提供的公式,发现与标准的LSTM形式不太一样,所以自己根据各种门所对应的公式自己实现,详情请见此博客
文中给出的公式为:
在这里插入图片描述
在这里插入图片描述
代码实现:

class Decoder(nn.Module):
    # input_size:编码层输出的隐藏向量h维度
    # hidden_size:解码层隐藏向量h'维度
    def __init__(self,input_size,hidden_size):
        super(Decoder,self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size

        # 定义各种门中的权重矩阵w与偏差b
        # 命名格式均与文中格式一致,可自行对应
        # input gate
        self.w_wi = Parameter(Tensor(hidden_size, input_size))
        self.w_hi = Parameter(Tensor(hidden_size, hidden_size))
        self.w_ti = Parameter(Tensor(hidden_size, input_size))
        self.b_i = Parameter(Tensor(hidden_size, 1))
 
        # forget gate
        self.w_wf = Parameter(Tensor(hidden_size, input_size))
        self.w_hf = Parameter(Tensor(hidden_size, hidden_size))
        self.w_tf = Parameter(Tensor(hidden_size, input_size))
        self.b_f = Parameter(Tensor(hidden_size, 1))
 
        # output gate
        self.w_wo = Parameter(Tensor(hidden_size, input_size))
        self.w_ho = Parameter(Tensor(hidden_size, hidden_size))
        self.w_to = Parameter(Tensor(hidden_size, input_size))
        self.b_o = Parameter(Tensor(hidden_size, 1))
 
        # cell
        self.w_wg = Parameter(Tensor(hidden_size, input_size))
        self.w_hg = Parameter(Tensor(hidden_size, hidden_size))
        self.w_tg = Parameter(Tensor(hidden_size, input_size))
        self.b_g = Parameter(Tensor(hidden_size, 1))
 
        self.reset_weigths()
    
    # 初始化权重
    def reset_weigths(self):
        """reset weights
        """
        stdv = 1.0 / math.sqrt(self.hidden_size)
        for weight in self.parameters():
            init.uniform_(weight, -stdv, stdv)

    # 传播函数
    def forward(self,X):
        batch_size = X.size(0)
        seq_len = X.size(1)

        h_t = torch.zeros(1, self.hidden_size).t()
        c_t = torch.zeros(1, self.hidden_size).t()
        T = torch.zeros(batch_size,self.hidden_size).t()


        T_seq = []
        # seq_len为句子长度
        for t in range(seq_len):
            x = X[:, t, :].t()

            # torch.Size([600, 32])
            # t()转置的意思
            # print(x.size())

            # torch.Size([600, 1])
            # print(h_t.size())

            # input gate

            # torch.Size([600, 32])
            # print(w.size())
            # 复现文章中的公式
            i = torch.sigmoid(self.w_wi @ x + self.w_ti @ T + self.w_hi @ h_t +
                              self.b_i)

            # forget gate
            f = torch.sigmoid(self.w_wf @ x + self.w_tf @ T + self.w_hf @ h_t +
                              self.b_f)
            # cell
            g = torch.tanh(self.w_wg @ x + self.w_tg @ T + self.w_hg @ h_t
                           + self.b_g)
            # output gate
            o = torch.sigmoid(self.w_wo @ x + self.w_to @ T + self.w_ho @ h_t +
                              self.b_o)
 
            c_next = f * c_t + i * g
            h_next = o * torch.tanh(c_next)
            # torch.Size([600, 32])
            # print(h_next.size())
            T_next = torch.tanh(h_next)
            # torch.Size([600, 32])
            # print(T_next.size())
            # 结果作为下一步输入
            T = T_next
            c_next_t = c_next.t().unsqueeze(0)
            h_next_t = T_next.t().unsqueeze(0)
            # print(h_next_t.size())
            # torch.Size([1, 32, 600])
            T_seq.append(h_next_t)

        T_seq = torch.cat(T_seq, dim=0)
        return T_seq.transpose(0, 1)

模型整合

class Model(nn.Module):
    # tagnum:标记的数目,用于softmax分类
    # 模型调用前面的Encoder和Decoder,并用感知机将输出映射到tagnum中
    def __init__(self,embed_size,weight,dropout,en_hidden_size,de_input_size,de_hidden_size,tagnum):
        super(Model,self).__init__()
        self.encoder = Encoder(embed_size=embed_size,weight=weight,dropout=dropout,hidden_size=en_hidden_size)
        self.decoder = Decoder(input_size=de_input_size,hidden_size=de_hidden_size)
        self.tagnum = tagnum
        self.hidden2tag = nn.Linear(de_hidden_size,self.tagnum)

    # 传播函数
    def forward(self,X):
        encoder_output = self.encoder(X)
        output = self.decoder(encoder_output)
        y = self.hidden2tag(output)
        # torch.Size([32, 40, 193])
        # print(y.size())
        # 对第2列作归一化
        return F.log_softmax(y,dim=2)

模型训练(train.py文件)

载入数据

    # 设置 (CPU) 生成随机数的种子,并返回一个torch.Generator对象。
    # 一旦固定种子,后面依次生成的随机数其实都是固定的。
    random_seed = 1111
    torch.manual_seed(random_seed)
    # 代表将torch.tensor分配到的设备的对象(简单点说,就是分配到你的CPU还是GPU上,以及哪块GPU上)。
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    # 加载之前处理的数据
    tag_set = Index()
    tag_set.load("data/demo/tag2id.txt")
    relation_labels = Index()
    relation_labels.load('data/demo/relation_labels.txt')

    train_data = load('data/demo/train.pk')
    test_data = load('data/demo/test.pk')
    val_size = int(0.01 * len(train_data))
    # 随机划分数据集
    train_data, val_data = random_split(train_data, [len(train_data)-val_size, val_size])

    # 根据句子长度区间分组
    train_data_groups = group(train_data, [10, 20, 30, 40, 50, 60])
    val_data_groups = group(val_data, [10, 20, 30, 40, 50, 60])
    test_data_groups = group(test_data, [10, 20, 30, 40, 50, 60])

    # 加载word_embedding
    word_embeddings = torch.tensor(np.load("data/demo/word2vec.vectors.npy"))
    word_embedding_size = word_embeddings.size(1)

训练参数选择

参数取值
词嵌入维度300
dropout率0.5
BiLSTM隐藏向量维度300
LSTM输入维度600
LSTM隐藏向量维度600
学习率1
epochs30
batch_size32
梯度阈值0.35

进行训练

文中采用了如下损失函数:
在这里插入图片描述
|D|是训练集的大小,Lj是句子xj的长度,yi(j)是单词xj中词t的标注,pt(j)是在公式中定义的归一化标注概率。此外,I(O)是一个开关函数,以区分标注‘O’与可指示结果的相关标注间的损失。

    if os.path.exists("model.pt"):
        model=torch.load('model.pt')
    else:
        model = Model(embed_size=300,weight=word_embeddings,dropout=0.5,en_hidden_size=300,de_input_size=600,de_hidden_size=600,tagnum=len(tag_set)).to(device)
    
    # 为“O”和其他标记设置不同的权重
    # “O”权重为1,其他标记权重为10
    weight = [10.0] * len(tag_set)
    weight[tag_set["O"]] = 1
    weight = torch.tensor(weight).to(device)
    # 损失函数详情见:https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html
    criterion = nn.NLLLoss(weight, size_average=False)
    optimizer = getattr(optim, 'SGD')(model.parameters(), lr=4)
    clip = 0.35
    log_interval = 100
    epochs = 30
    batch_size = 32
    best_val_loss = None
    lr = 4
    all_val_loss = []
    all_precision = []
    all_recall = []
    all_f1 = []

    # At any point you can hit Ctrl + C to break out of training early.
    try:
        start_time = time.time()
        print("-" * 118)
        # 开始训练
        for epoch in range(1, epochs+1):
            train()
            val_loss, precision, recall, f1 = evaluate(val_data_groups)

            elapsed = time.time() - start_time
            print("-" * 118)
            print("| End of Epoch {:2d} | Elapsed Time {:s} | Validation Loss {:5.3f} | Precision {:5.3f} "
                  "| Recall {:5.3f} | F1 {:5.3f} |".format(epoch, time_display(elapsed),
                                                           val_loss, precision, recall, f1))
            print("-" * 118)

            # Save the model if the validation loss is the best we've seen so far.
            if not best_val_loss or val_loss < best_val_loss:
                with open("model.pt", 'wb') as f:
                    torch.save(model, f)
                best_val_loss = val_loss
            else:
                # Anneal the learning rate if no improvement has been seen in the validation dataset.
                lr = lr / 4.0
                for param_group in optimizer.param_groups:
                    param_group['lr'] = lr
            all_val_loss.append(val_loss)
            all_precision.append(precision)
            all_recall.append(recall)
            all_f1.append(f1)

    except KeyboardInterrupt:
        print('-' * 118)
        print('Exiting from training early')
def group(data, breakpoints):
    groups = [[] for _ in range(len(breakpoints)+1)]
    # enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列
    for idx, item in enumerate(data):
        # bisect.bisect_left(a, x, lo=0, hi=len(a)):a是列表,x是要插入的元素。函数返回在a中插入x的位置,
        # 如果a中已经存x,那么插入的位置在a中最左边的x的前面一位。
        # 返回值把列表分成两部分,插入点左侧满足all(val < x for val in a[lo:i]),插入点右侧满足all(val >= x for val in a[i:hi])。
        i = bisect.bisect_left(breakpoints, len(item[0]))
        groups[i].append(idx)
    data_groups = [Subset(data, g) for g in groups]
    return data_groups

class GroupBatchRandomSampler(object):
    def __init__(self, data_groups, batch_size, drop_last):
        self.batch_indices = []
        for data_group in data_groups:
            self.batch_indices.extend(list(BatchSampler(SubsetRandomSampler(data_group.indices),
                                                        batch_size, drop_last=drop_last)))

    def __iter__(self):
        return (self.batch_indices[i] for i in torch.randperm(len(self.batch_indices)))

    def __len__(self):
        return len(self.batch_indices)


def get_batch(batch_indices, data):
    batch = [data[idx] for idx in batch_indices]
    sorted_batch = sorted(batch, key=lambda x: len(x[0]), reverse=True)
    sentences, tags = zip(*sorted_batch)

    # 进行对齐填充
    padded_sentences, lengths = pad_packed_sequence(pack_sequence([torch.LongTensor(_) for _ in sentences]),
                                                    batch_first=True, padding_value=vocab["<pad>"])
    padded_tags, _ = pad_packed_sequence(pack_sequence([torch.LongTensor(_) for _ in tags]),
                                         batch_first=True, padding_value=tag_set["O"])

    return padded_sentences.to(device), padded_tokens.to(device), padded_tags.to(device), lengths.to(device)

def train():
        model.train()
        total_loss = 0
        count = 0
        # 选择batch_size个样本
        sampler = GroupBatchRandomSampler(train_data_groups, batch_size, drop_last=False)
        # batch_indices:上一步sampler中选定的id号
        for idx, batch_indices in enumerate(sampler):
            sentences, tokens, targets, lengths = get_batch(batch_indices, train_data)
            # 梯度清零
            optimizer.zero_grad()
            # 将sentences放入模型
            output = model(sentences)
            output = pack_padded_sequence(output, lengths, batch_first=True).data
            targets = pack_padded_sequence(targets, lengths, batch_first=True).data
            # 计算损失
            loss = criterion(output, targets)
            # 反向传播
            loss.backward()
            # 在BP过程中会产生梯度消失/爆炸(就是偏导无限接近0,导致长时记忆无法更新),
            # 那么最简单粗暴的方法,设定阈值,当梯度小于/大于阈值时,更新的梯度为阈值.
            if clip > 0:
                nn.utils.clip_grad_norm_(model.parameters(), clip)
            # optimizer.step()通常用在每个mini-batch之中,可以根据具体的需求来做。只有用了optimizer.step(),模型才会更新。
            optimizer.step()
            
            total_loss += loss.item()
            count += len(targets)
            # 计算进度
            if (idx+1) % log_interval == 0:
                cur_loss = total_loss / count
                elapsed = time.time() - start_time
                percent = ((epoch-1)*len(sampler)+(idx+1))/(epochs*len(sampler))
                remaining = elapsed / percent - elapsed
                print("| Epoch {:2d}/{:2d} | Batch {:5d}/{:5d} | Elapsed Time {:s} | Remaining Time {:s} | "
                    "lr {:4.2e} | Loss {:5.3f} |".format(epoch, epochs, idx+1, len(sampler), time_display(elapsed),
                                                        time_display(remaining), lr, cur_loss))
                total_loss = 0
                count = 0

# 计算模型准确度、召回率、F1分数
def evaluate(data_groups):
    model.eval()
    total_loss = 0
    count = 0
    TP = 0
    TP_FP = 0
    TP_FN = 0
    with torch.no_grad():
        for batch_indices in GroupBatchRandomSampler(data_groups, batch_size, drop_last=False):
            sentences, tokens, targets, lengths = get_batch(batch_indices, train_data)
            output = model(sentences, tokens)
            tp, tp_fp, tp_fn = measure(output, targets, lengths)
            TP += tp
            TP_FP += tp_fp
            TP_FN += tp_fn
            output = pack_padded_sequence(output, lengths, batch_first=True).data
            targets = pack_padded_sequence(targets, lengths, batch_first=True).data
            loss = criterion(output, targets)
            total_loss += loss.item()
            count += len(targets)
    return total_loss / count, TP/TP_FP, TP/TP_FN, 2*TP/(TP_FP+TP_FN)

def measure(output, targets, lengths):
    assert output.size(0) == targets.size(0) and targets.size(0) == lengths.size(0)
    tp = 0
    tp_fp = 0
    tp_fn = 0
    batch_size = output.size(0)
    output = torch.argmax(output, dim=-1)
    for i in range(batch_size):
        length = lengths[i]
        out = output[i][:length].tolist()
        target = targets[i][:length].tolist()
        out_triplets = get_triplets(out)
        tp_fp += len(out_triplets)
        target_triplets = get_triplets(target)
        tp_fn += len(target_triplets)
        for target_triplet in target_triplets:
            for out_triplet in out_triplets:
                if out_triplet == target_triplet:
                    tp += 1
    return tp, tp_fp, tp_fn
    
def get_triplets(tags):
    temp = {}
    triplets = []
    for idx, tag in enumerate(tags):
        if tag == tag_set["O"]:
            continue
        pos, relation_label, role = tag_set[tag].split("-")
        if pos == "B" or pos == "S":
            if relation_label not in temp:
                temp[relation_label] = [[], []]
            temp[relation_label][int(role) - 1].append(idx)
    for relation_label in temp:
        role1, role2 = temp[relation_label]
        if role1 and role2:
            len1, len2 = len(role1), len(role2)
            if len1 > len2:
                for e2 in role2:
                    idx = np.argmin([abs(e2 - e1) for e1 in role1])
                    e1 = role1[idx]
                    triplets.append((e1, relation_label, e2))
                    del role1[idx]
            else:
                for e1 in role1:
                    idx = np.argmin([abs(e2 - e1) for e2 in role2])
                    e2 = role2[idx]
                    triplets.append((e1, relation_label, e2))
                    del role2[idx]
    return triplets

模型评价

# Load the best saved model.
    with open("model.pt", 'rb') as f:
        model = torch.load(f)

    # Run on test data
    test_loss, precision, recall, f1 = evaluate(test_data_groups)
    print("=" * 118)
    print("| End of Training | Test Loss {:5.3f} | Precision {:5.3f} "
          "| Recall {:5.3f} | F1 {:5.3f} |".format(test_loss, precision, recall, f1))
    print("=" * 118)

    with open("record.tsv", "wt", encoding="utf-8") as f:
        for idx in range(len(all_val_loss)):
            f.write("{:d}\t{:5.3f}\t{:5.3f}\t{:5.3f}\t{:5.3f}\n"
                    .format(idx+1, all_val_loss[idx], all_precision[idx], all_recall[idx], all_f1[idx]))
        f.write("\n{:5.3f}\t{:5.3f}\t{:5.3f}\t{:5.3f}\n".format(test_loss, precision, recall, f1))

训练结果记录record.tsv如图:
在这里插入图片描述
准确率、召回率、f1分数分别为0.697、0.878、0.777.

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值