transformers学习1 quickstart

https://github.com/huggingface/transformers

1 BERT example

BertTokenizer.from_pretrained:Instantiate a :class:`~transformers.PreTrainedTokenizer` (or a derived class) from a predefined tokenizer.实例化一个tokenizer。

tokenizer.tokenize(text):Converts a string in a sequence of tokens (string), using the tokenizer.Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).转换为小写及进行WordPieces等工作。

tokenizer.convert_tokens_to_ids:Convert token to vocabulary indices.

BertModel.from_pretrained:Load pre-trained model (weights).用于产生所有词的编码、pooler输出等。

BertForMaskedLM:用于预测被mask掉的词。参数token_type_ids表示句子分割对应,0为第一个句子,1为第二个句子。

import warnings
warnings.filterwarnings("ignore")
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM

'''
Let’s start by preparing a tokenized input (a list of token embeddings indices to be fed to Bert) 
from a text string using BertTokenizer
'''
# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
# Instantiate a :class:`~transformers.PreTrainedTokenizer` (or a derived class) from a predefined tokenizer.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')  # 文件会缓存到本地

# Tokenize input
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"

# 转换为小写及进行WordPieces等工作
tokenized_text = tokenizer.tokenize(text)
# print(type(tokenized_text))  # <class 'list'>
# print(tokenized_text)  # ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', 'henson', 'was', 'a', 'puppet', '##eer', '[SEP]']

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# print(indexed_tokens)  # [101, 2040, 2001, 3958, 27227, 1029, 102, 3958, 103, 2001, 1037, 13997, 11510, 102]

# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
# 0代表第一个句子      1代表第二个句子
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
# print(tokens_tensor.shape)  # torch.Size([1, 14])
segments_tensors = torch.tensor([segments_ids])
# print(segments_tensors.shape)  # torch.Size([1, 14])


# Let’s see how we can use BertModel to encode our inputs in hidden-states:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')

# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    # print(type(outputs))  # <class 'tuple'>
    # print(len(outputs))  # 2 其中一个是所有词的编码 另一个是"下一句预测"任务的输出(激活函数为Tanh)
    # print(outputs[0].shape)  # torch.Size([1, 14, 768])
    # print(outputs[1].shape)  # torch.Size([1, 768])

    # Transformers models always output tuples.
    # See the models docstrings for the detail of all the outputs
    # In our case, the first element is the hidden state of the last layer of the Bert model
    encoded_layers = outputs[0]  # torch.Size([1, 14, 768])
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)


# And how to use BertForMaskedLM to predict a masked token:
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# If you have a GPU, put everything on cuda
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    print(type(outputs))  # <class 'tuple'>
    print(len(outputs))  # 1
    print(outputs[0].shape)  # torch.Size([1, 14, 30522])
    predictions = outputs[0]

# confirm we were able to predict 'henson'
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token)  # henson
assert predicted_token == 'henson'

2 OpenAI GPT-2

tokenizer.encode:Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.

GPT2LMHeadModel.from_pretrained:Load pre-trained model (weights).

tokenizer.decode:Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.

import warnings
warnings.filterwarnings("ignore")
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode a text inputs
text = "Who was Jim Henson ? Jim Henson was a"

# Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
indexed_tokens = tokenizer.encode(text)
# print(indexed_tokens)  # [8241, 373, 5395, 367, 19069, 5633, 5395, 367, 19069, 373, 257]

# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])
# print(tokens_tensor.shape)  # torch.Size([1, 11])

# use GPT2LMHeadModel to generate the next token following our text:
# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)  # <class 'tuple'>  长度2
    print(outputs[0].shape)  # torch.Size([1, 11, 50257])
    print(type(outputs[1]))  # <class 'tuple'>

    print(outputs[1][0].shape)  # <class 'tuple'>  长度12  每一个shape为torch.Size([2, 1, 12, 11, 64])
    # 代表(2, batch_size, num_heads, sequence_length, embed_size_per_head)
    # Contains pre-computed hidden-states (key and values in the attention blocks).
    # Can be used to speed up sequential decoding.

    predictions = outputs[0]  # torch.Size([1, 11, 50257])

# get the predicted next sub-word (in our case, the word 'man')
predicted_index = torch.argmax(predictions[0, -1, :]).item()  # 582

'''
decode
    Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary
    with options to remove special tokens and clean up tokenization spaces.
'''
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
assert predicted_text == 'Who was Jim Henson? Jim Henson was a man'

3 Using the past

GPT-2, as well as some other models (GPT, XLNet, Transfo-XL, CTRL), make use of a past or mems attribute which can be used to prevent re-computing the key/value pairs when using sequential decoding. It is useful when generating sequences as a big part of the attention mechanism benefits from previous computations.

类似于Transformer的decoder,每一个新生成的词为Q,past为K和V。

import warnings
warnings.filterwarnings("ignore")

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2')

generated = tokenizer.encode("The Manhattan bridge")  # [464, 13458, 7696]
context = torch.tensor([generated])
past = None

for i in range(100):
    print(i)
    output, past = model(context, past=past)
    # print(output.shape)  # torch.Size([1, 3, 50257])
    # print(type(past))  # <class 'tuple'>
    print(past[0].shape)  # torch.Size([2, 1, 12, 3, 64]) 到 torch.Size([2, 1, 12, 102, 64])
    token = torch.argmax(output[..., -1, :])  # tensor(318)

    generated += [token.tolist()]
    context = token.unsqueeze(0).unsqueeze(0)   # shape 为torch.Size([1, 1])

sequence = tokenizer.decode(generated)

print(sequence)

 

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

忧郁的常凯申

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值