最简单的方式获取ELMo得到的词向量

最新推荐文章于 2024-03-22 11:08:20 发布

IndexFziQ

最新推荐文章于 2024-03-22 11:08:20 发布

阅读量7.4k

点赞数 6

分类专栏：预训练词向量文章标签： elmo 词向量

本文链接：https://blog.csdn.net/sinat_34611224/article/details/83147812

版权

预训练词向量专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Introduction

本文的目的就是用最简单的方式获取 elmo 得到的word representation，看了一些其他人的介绍，其实最后对我有用的就这么多了，我只想要他生成的词向量。
简单介绍一下 elmo：Allen NLP在NAACL2018上的Best paper - Deep contextualized word representations，使用elmo让原有的模型在NLI等Task上效果提升。
那好，直接说怎么得到这个elmo。现在有tf，pytorch，keras各种版本。本文使用的官方给出的elmo片段方式，不用加在模型当中，直接获得词向量的Tensor，因为我只想用他的词向量，训练他的模型又耗时有耗机器。

Environment

首先在conda中新建环境：

conda create -n allennlp python=3.6

接着安装allennlp[保证你电脑里gcc是OK的，编译时需要C++的环境]

pip install allennlp

别断网就OK了，东西有点多，pytorch啥的全套。
然后，下载allennlp给出的训练好的参数和模型
网址：

这样方便你重复使用。

Method

下面就是用这两个文件怎么得到词向量了：

from allennlp.commands.elmo import ElmoEmbedder

options_file = "/files/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "/files/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"

elmo = ElmoEmbedder(options_file, weight_file)

# use batch_to_ids to convert sentences to character ids
context_tokens = [['I', 'love', 'you', '.'], ['Sorry', ',', 'I', 'don', "'t", 'love', 'you', '.']] #references
elmo_embedding, elmo_mask = elmo.batch_to_embeddings(context_tokens)

print(elmo_embedding)
print(elmo_mask)

Result

Embedding:
tensor([[[[ 0.6923, -0.3261,  0.2283,  ...,  0.1757,  0.2660, -0.1013],
          [-0.7348, -0.0965, -0.1411,  ..., -0.3411,  0.3681,  0.5445],
          [ 0.3645, -0.1415, -0.0662,  ...,  0.1163,  0.1783, -0.7290],
          ...,
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

         [[-1.1051, -0.4092, -0.4365,  ..., -0.6326,  0.4735, -0.2577],
          [ 0.0899, -0.4828, -0.5596,  ...,  0.4372,  0.3840, -0.7343],
          [-0.5538, -0.1473, -0.2441,  ...,  0.2551,  0.0873,  0.2774],
          ...,
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

         [[-3.2634, -0.9448, -0.3199,  ..., -1.2070,  0.6930, -0.2016],
          [-0.3688, -0.7632, -0.0715,  ...,  0.6294,  1.6869, -0.6655],
          [-1.0870, -1.4243, -0.2445,  ...,  0.0825,  0.5020,  0.2765],
          ...,
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]]],


        [[[ 0.5042, -0.6629, -0.0231,  ..., -0.3084, -0.9741, -0.7230],
          [ 0.1131,  0.1575,  0.1414,  ...,  0.3718, -0.1432, -0.0248],
          [ 0.6923, -0.3261,  0.2283,  ...,  0.1757,  0.2660, -0.1013],
          ...,
          [-0.7348, -0.0965, -0.1411,  ..., -0.3411,  0.3681,  0.5445],
          [ 0.3645, -0.1415, -0.0662,  ...,  0.1163,  0.1783, -0.7290],
          [-0.8872, -0.2004, -1.0601,  ..., -0.2655,  0.2115,  0.1977]],

         [[ 0.1221, -0.7032,  0.0169,  ..., -0.3249, -0.4935, -0.4965],
          [ 0.3399, -0.4682,  0.1888,  ..., -0.0565,  0.1001, -0.0416],
          [-0.8135, -0.8491, -0.3264,  ..., -0.5674,  0.2638,  0.2006],
          ...,
          [ 0.4460, -0.4475, -0.1583,  ...,  0.4372,  0.3840, -0.7343],
          [-0.1287,  0.0161,  0.0315,  ...,  0.2551,  0.0873,  0.2774],
          [-1.2373, -0.3373,  0.1098,  ..., -0.0276, -0.0181,  0.0602]],

         [[-0.0830, -1.5891, -0.2576,  ..., -1.2944,  0.1082,  0.6745],
          [-0.0724, -0.7200,  0.1463,  ...,  0.6919,  0.9144, -0.1260],
          [-2.3460, -1.1714, -0.7065,  ..., -1.2885,  0.4679,  0.3800],
          ...,
          [ 0.1246, -0.6929,  0.6330,  ...,  0.6294,  1.6869, -0.6655],
          [-0.5757, -1.0845,  0.5794,  ...,  0.0825,  0.5020,  0.2765],
          [-1.2392, -0.6155, -0.9032,  ...,  0.0524, -0.0852,  0.0805]]]])
Mask:  
 tensor([[1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])

Tips

实验的输出结果是 2 * 3 * 8 * 1024 的word embedding，都是2、3、8超参数。
2是batch_size, 3是两层biLM的输出加一层CNN对character编码的输出, 8是最长list的长度(对齐), 1024是每层输出的维度。
mask的输出2是batch_size, 8实在最长list的长度, 第一个list有4个tokens,第二个list有8个tokens, 所以对应位置输出1。