NLP的Token embedding和位置embedding

例1

Token Enbedding,也是字符转向量的一种常用做法。

import tensorflow as tf

model_name = "ted_hrlr_translate_pt_en_converter"
tokenizers = tf.saved_model.load(model_name)

sentence = "este é um problema que temos que resolver."
sentence = tf.constant(sentence)
sentence = sentence[tf.newaxis]
sentence = tokenizers.pt.tokenize(sentence).to_tensor()
print(sentence.shape)
print(sentence)

(1, 11)
tf.Tensor([[  2 125  44  85 231  84 130  84 742  16   3]], shape=(1, 11), dtype=int64)

start_end = tokenizers.en.tokenize([''])[0]
print(start_end)
start = start_end[0][tf.newaxis]
print(start)
end = start_end[1][tf.newaxis]
print(end)

tf.Tensor([2 3], shape=(2,), dtype=int64)
tf.Tensor([2], shape=(1,), dtype=int64)
tf.Tensor([3], shape=(1,), dtype=int64)

token这个词有占用的意思,即该向量被该词占用。

例2

和例1一样是个葡萄牙语翻译为英语的例子

import logging
import tensorflow_datasets as tfds
logging.getLogger('tensorflow').setLevel(logging.ERROR)  # suppress warnings
import tensorflow as tf

examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

for pt_examples, en_examples in train_examples.batch(3).take(1):
  for pt in pt_examples.numpy():
    print(pt.decode('utf-8'))

for en in en_examples.numpy():
  print(en.decode('utf-8'))
    
model_name = "ted_hrlr_translate_pt_en_converter"
tokenizers = tf.saved_model.load(model_name)
encoded = tokenizers.en.tokenize(en_examples)

for row in encoded.to_list():
  print(row)

round_trip = tokenizers.en.detokenize(encoded)
for line in round_trip.numpy():
  print(line.decode('utf-8'))

e quando melhoramos a procura , tiramos a única vantagem da impressão , que é a serendipidade .
mas e se estes fatores fossem ativos ?
mas eles não tinham a curiosidade de me testar .
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .
[2, 72, 117, 79, 1259, 1491, 2362, 13, 79, 150, 184, 311, 71, 103, 2308, 74, 2679, 13, 148, 80, 55, 4840, 1434, 2423, 540, 15, 3]
[2, 87, 90, 107, 76, 129, 1852, 30, 3]
[2, 87, 83, 149, 50, 9, 56, 664, 85, 2512, 15, 3]
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n ' t test for curiosity .

tokens = tokenizers.en.lookup(encoded)
print(tokens)

<tf.RaggedTensor [[b'[START]', b'and', b'when', b'you', b'improve', b'search', b'##ability', b',', b'you', b'actually', b'take', b'away', b'the', b'one', b'advantage', b'of', b'print', b',', b'which', b'is', b's', b'##ere', b'##nd', b'##ip', b'##ity', b'.', b'[END]'], [b'[START]', b'but', b'what', b'if', b'it', b'were', b'active', b'?', b'[END]'], [b'[START]', b'but', b'they', b'did', b'n', b"'", b't', b'test', b'for', b'curiosity', b'.', b'[END]']]>

例3

embedding——嵌入式,可以理解为低位信息嵌入至高维空间。

import tensorflow as tf

model_name = "ted_hrlr_translate_pt_en_converter"
tokenizers = tf.saved_model.load(model_name)

d_model = 128
input_vocab_size=tokenizers.pt.get_vocab_size().numpy()

embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)

x = tf.constant([[2, 87, 90, 107, 76, 129, 1852, 30,0, 0, 0, 3]])
x = embedding(x)

print(input_vocab_size)
print(x.shape)
print(x)

7765
(1, 12, 128)
tf.Tensor(
[[[-0.02317628  0.04599813 -0.0104699  ... -0.03233253 -0.02013252
    0.00171118]
  [-0.02195768  0.0341222   0.00689759 ... -0.00260416  0.02308804
    0.03915772]
  [-0.00282265  0.03714179 -0.03591241 ... -0.03974506 -0.04376533
    0.03113948]
  ...
  [-0.0277048  -0.03750116 -0.03355522 ... -0.00703954 -0.02855991
    0.00357056]
  [-0.0277048  -0.03750116 -0.03355522 ... -0.00703954 -0.02855991
    0.00357056]
  [ 0.04611469  0.04663144  0.02595479 ... -0.03400488 -0.00206001
   -0.03282105]]], shape=(1, 12, 128), dtype=float32)

此例将文本长度为12的向量embedding为高维12×128

例4

transformer的位置embedding,实际算法通常根据深度d_model先计算好1000个位置编码,而计算时根据实时的输入长度截取

import numpy as np
import tensorflow as tf

d_model = 128
position = 1000

def get_angles(pos, i, d_model):
  angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
  return pos * angle_rates

def positional_encoding(position, d_model):
  angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)
  # apply sin to even indices in the array; 2i
  angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

  # apply cos to odd indices in the array; 2i+1
  angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
  pos_encoding = angle_rads[np.newaxis, ...]
  return tf.cast(pos_encoding, dtype=tf.float32)

x = tf.constant([[2, 87, 90, 107, 76, 129, 1852, 30,0, 0, 0, 3]])
seq_len = tf.shape(x)[1]
print(seq_len)
pos_encoding = positional_encoding(position, d_model)
print(pos_encoding.shape)
pe = pos_encoding[:, :seq_len, :]
print(pe.shape)

tf.Tensor(12, shape=(), dtype=int32)
(1, 1000, 128)
(1, 12, 128)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

飞行codes

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值