Bert生成句向量（tensorflow）

最新推荐文章于 2024-06-27 16:04:06 发布

huangcy_0518

最新推荐文章于 2024-06-27 16:04:06 发布

阅读量2.6k

点赞数 1

文章标签： tensorflow 自然语言处理

本文链接：https://blog.csdn.net/weixin_30034903/article/details/113523977

版权

BERT句向量

Bert包括两个版本，12层的transformers与24层的transformers，官方提供了12层的中文模型，下文也将基于12层的transformers来讲解

每一层的transformers的输出值，理论来说都可以作为句向量，但是到底该取哪一层呢，根据hanxiao大神的实验数据，最佳结果是取倒数第二层，最后一层太过于接近目标，前面几层可能语义还未充分的学习到。

接下来从代码的角度来进行详解。

先看下args.py，介绍几个重要参数。这里主要说一下layer_indexes参数，layer_indexes表示的是使用第几层的输出作为句向量，-2代表的是倒数第二层。max_seq_len表示的是序列的最大长度，因为输入的长度是不固定的，所以我们需要设置一个最大长度才能确保输出的维度是一样的，如果最大长度是20，当输入的序列长度小于20的时候，就会补0，如果大于20则会截取前面的部分，通常该值会取语料的长度的平均值+2，加2的原因是因为需要拼接两个占位符[CLS]（表示序列的开始）与[SEP]（表示序列的结束）。在这里，为了提高句子间的区分度，把list里最长的句子的值作为max_seq_len。

# -*- coding: utf-8 -*- 
# @Time : 2021/1/21 14:55 
# @Author : hcy
# @File : args.py

#配置文件
import os

root_path = os.path.dirname(__file__)

model_dir = os.path.join(root_path, 'model/bert/chinese_L-12_H-768_A-12/')
bert_config = os.path.join(model_dir, 'bert_config.json')
bert_ckpt = os.path.join(model_dir, 'bert_model.ckpt')
bert_vocab_file = os.path.join(model_dir, 'vocab.txt')

output_dir = os.path.join(root_path, 'output/')
data_dir = os.path.join(root_path, 'data/')

num_train_epochs = 10
batch_size = 128
learning_rate = 0.00005

# gpu使用率
gpu_memory_fraction = 0.8

# 默认取倒数第二层的输出值作为句向量
layer_indexes = [-2]

# # 序列的最大程度，取列表中最长句子的长度作为max_seq_len
# max_seq_len = 128

定义三个占位符，分别表示的是对应文本的index，mask与segment，其中index表示的是在词典中的index，mask表示的是该位置是否有内容，举个例子，例如序列的最大长度是20，有效的字符只有10个字，加上[CLS]与[SEP]两个占位符，那有8个字符是空的，该8个位置设置为0其他位置设置为1，segment_ids表示的是是否是第一个句子，是第一个句子则设置为1，因为该项目只有一个句子，所以均为1。

input_ids = tf.placeholder(tf.int32, shape=[None, None], name='input_ids')
input_mask = tf.placeholder(tf.int32, shape=[None, None], name='input_masks')
segment_ids = tf.placeholder(tf.int32, shape=[None, None], name='segment_ids')

根据上面定义的三个占位符，定义好输入的张量，实例化一个model对象，该对象就是预训练好的bert模型，然后从check_point文件中初始化权重

input_tensors = [input_ids, input_mask, segment_ids]

# 初始化BERT
model = modeling.BertModel(
    config=bert_config,
    is_training=False,
    input_ids=input_ids,
    input_mask=input_mask,
    token_type_ids=segment_ids,
    use_one_hot_embeddings=False
)

# 加载BERT模型
tf_vars = tf.trainable_variables()
(assignment, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tf_vars, args.bert_ckpt)
tf.train.init_from_checkpoint(args.bert_ckpt, assignment)

# 获取最后一层和倒数第二层
encoder_last_layer = model.get_sequence_output()
encoder_last2_layer = model.all_encoder_layers[-2]

# 读取数据
token = tokenization.FullTokenizer(vocab_file=args.bert_vocab_file)

接下来将args.index_layeres参数中的层数取出来，last2[:, 0, :]代表的就是句向量。

# 获取最后一层和倒数第二层
encoder_last_layer = model.get_sequence_output()
encoder_last2_layer = model.all_encoder_layers[args.layer_indexes[0]]

 with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    last2 = sess.run(encoder_last2_layer, feed_dict=feed_data)
    # last2 shape：(max_len, 1, 768)
    text_embeddings = last2[:, 0, :]

完整代码

args.py

# -*- coding: utf-8 -*- 
# @Time : 2021/1/21 14:55 
# @Author : hcy
# @File : args.py

#配置文件
import os

root_path = os.path.dirname(__file__)

model_dir = os.path.join(root_path, 'model/bert/chinese_L-12_H-768_A-12/')
bert_config = os.path.join(model_dir, 'bert_config.json')
bert_ckpt = os.path.join(model_dir, 'bert_model.ckpt')
bert_vocab_file = os.path.join(model_dir, 'vocab.txt')

output_dir = os.path.join(root_path, 'output/')
data_dir = os.path.join(root_path, 'data/')

num_train_epochs = 10
batch_size = 128
learning_rate = 0.00005

# gpu使用率
gpu_memory_fraction = 0.8

# 默认取倒数第二层的输出值作为句向量
layer_indexes = [-2]

# # 序列的最大程度，取列表中最长句子的长度作为max_seq_len
# max_seq_len = 128

extract_features.py

# -*- coding: utf-8 -*- 
# @Time : 2021/1/21 15:19 
# @Author : hcy
# @File : sentences_features.py
import modeling
import tokenization
import numpy as np
from scipy.spatial.distance import cosine
import tensorflow as tf
import args


bert_config = modeling.BertConfig.from_json_file(args.bert_config)
# graph
input_ids = tf.placeholder(tf.int32, shape=[None, None], name='input_ids')
input_mask = tf.placeholder(tf.int32, shape=[None, None], name='input_masks')
segment_ids = tf.placeholder(tf.int32, shape=[None, None], name='segment_ids')


def get_data(sentences):
    """产生句子向量"""

    word_mask = [[1] * (args.max_seq_len + 2)]
    word_segment_ids = [[0] * (args.max_seq_len + 2)]
    return [sentences], word_mask, word_segment_ids

def read_input(sentences):
    # sentences是一个list，每一个元素是一个str，代表输入文本
    # 现在需要转化成id_list
    word_id_list = []
    max_len = max([len(single) for single in sentences])  # 最大的句子长度
    args.max_seq_len = max_len
    for sentence in sentences:
        split_tokens = token.tokenize(sentence)
        # 在这里截取掉大于seq_len个句子的样本，保留其前seq_len个句子
        if len(split_tokens) > args.max_seq_len:
            split_tokens = split_tokens[:args.max_seq_len]
        else:
            while len(split_tokens) < args.max_seq_len:
                split_tokens.append('[PAD]')
        #句向量
        tokens = []
        tokens.append("[CLS]")
        for i_token in split_tokens:
            tokens.append(i_token)
        tokens.append("[SEP]")
        # 加个CLS头，加个SEP尾
        word_ids = token.convert_tokens_to_ids(tokens)
        word_id_list.append(word_ids)
    return word_id_list

# 初始化BERT
model = modeling.BertModel(
    config=bert_config,
    is_training=False,
    input_ids=input_ids,
    input_mask=input_mask,
    token_type_ids=segment_ids,
    use_one_hot_embeddings=False
)

# 加载BERT模型
tf_vars = tf.trainable_variables()
(assignment, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tf_vars, args.bert_ckpt)
tf.train.init_from_checkpoint(args.bert_ckpt, assignment)

# 获取最后一层和倒数第二层
encoder_last_layer = model.get_sequence_output()
encoder_last2_layer = model.all_encoder_layers[args.layer_indexes[0]]

# 读取数据
token = tokenization.FullTokenizer(vocab_file=args.bert_vocab_file)


def extract_features(sentences):
    """ 生成句向量"""
    embedding_features = []
    input_data = read_input(sentences)
    for sample in input_data:
        #生成句向量
        word_id, mask, segment = get_data(sample)
        print(word_id)
        feed_data = {input_ids: np.asarray(word_id), input_mask: np.asarray(mask), segment_ids: np.asarray(segment)}
        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())
            last2 = sess.run(encoder_last2_layer, feed_dict=feed_data)
            # print(last2.shape)
            # last2 shape：(max_len, 1, 768)
            text_embeddings = last2[:, 0, :]
            embedding_features.append(text_embeddings)
    return embedding_features


def similarity(sentences):
    """计算句向量的相似度"""
    distances = []
    similarity = []
    last_feature = None
    features = extract_features(sentences)
    for feature in features:
        if last_feature is None:
            last_feature = feature
        else:
            dis = cosine(feature, last_feature)
            last_feature = feature
            distances.append(dis)
            similarity.append(1-dis)
    return np.array(similarity)


if __name__ == '__main__':
    sentences = ["今天天气不错，适合出行。",
                 "今天是晴天，可以出去玩。"]
    # sentences = ["打开刘红的个人相情愿","分享名片"]
    sims = similarity(sentences)
    print(sims)

参考文章：使用BERT生成句向量

huangcy_0518

关注

1
点赞
踩
17

收藏

觉得还不错? 一键收藏
0
评论
Bert生成句向量（tensorflow）

BERT句向量Bert包括两个版本，12层的transformers与24层的transformers，官方提供了12层的中文模型，下文也将基于12层的transformers来讲解每一层的transformers的输出值，理论来说都可以作为句向量，但是到底该取哪一层呢，根据hanxiao大神的实验数据，最佳结果是取倒数第二层，最后一层太过于接近目标，前面几层可能语义还未充分的学习到。接下来从代码的角度来进行详解。先看下args.py，介绍几个重要参数。这里主要说一下layer_inde
复制链接

扫一扫