分类时为了能够有更好的文本表征,使用预训练的embedding替换之前随机的向量。
获取bert embedding
from tensorflow.python import pywrap_tensorflow
import numpy as np
reader=pywrap_tensorflow.NewCheckpointReader(ckpt_path)
param_dict=reader.get_variable_to_shape_map()
emb=reader.get_tensor("bert/embeddings/word_embeddings")
vocab_file="/mnt/data3/wuchunsheng/code/nlper/NLP_task/word2vec/bert/chineseGLUE/baselines/models/roberta/prev_trained_model/bert/publish/vocab.txt"
vocab=open(vocab_file).read().split("\n")
out=open("bert_embedding","w")
out.write(str(emb.shape[0])+" "+str(emb.shape[1])+"\n")
for index in range(0, emb.shape[0]):
out.write(vocab[index]+" "+" ".join([str(i) for i in emb[index,:]])+"\n")
out.close()
pytorch读入embedding
要使用外部的embedding,需要做两步事情。
- 训练词典的时候添加预训练模型
这一步主要是要将词典对应起来。
vectors=Vectors(name="E:/study_series/2020_3_data/data/need_bertembedding",cache="./")
vocab_maxsize=4000
vocab_minfreq=10
TEXT.build_vocab(train_file,max_size=vocab_maxsize, min_freq=vocab_minfreq, vectors=vectors)
TEXT.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)
- embedding初始化的时候,载入预训练的参数值。
这里主要有两种方式:
- pytorch提供了很方便的直接介入,但是要将之前的整个TEXT传递过来,需要用到里面的vectors.
self.embedding = nn.Embedding(len(TEXT.vocab), emb_dim).from_pretrained(TEXT.vocab.vectors,freeze=False)
- 抽取出对应的向量值,并存放到embedding的参数位置中
path="E:/study_series/2020_3_data/data/need_bertembedding"
lines=codecs.open(path,encoding="utf-8")
res = [line.replace("\n", "") for line in lines][1:-1]
emb_dim=768
embeddings=np.random.rand(len(res),emb_dim)
for index, line in enumerate(res):
line_seg=line.split(" ")
try:
embeddings[index]=[float(one) for one in line_seg[1:]]
except:
#print(embeddings[index])
pass
至此就完成了embedding的导入。
参考:
https://discuss.pytorch.org/t/aligning-torchtext-vocab-index-to-loaded-embedding-pre-trained-weights/20878/2