Seq2Seq -- Attention -- Transformer

Seq2Seq – Attention – Transformer

1.前言

Transformer是谷歌在2017年的论文 Attention Is All You Need 中提出的一种模型,可以很好地处理序列相关的问题,如机器翻译。在此之前,对于机器翻译问题一般使用CNN或RNN作为encoder-decoder的模型基础,如使用RNN的Seq2Seq模型。

机器翻译一些模型的提出过程如下[3]
在这里插入图片描述

  • RNN对于序列问题有很好的效果,最早是使用基于RNN的Seq2Seq模型处理机器翻译问题,但其序列循环使得在训练时非常缓慢
  • Facebook将Seq2Seq的RNN替换成CNN,用多个CNN叠加放大上下文,刷新了两项翻译任务记录,并将训练速度大大提高
  • Transformer基于Attention机制实现,没有使用CNN或RNN结构,可高度并行,训练快,准确率高

本文将通过对Seq2Seq模型、Attention模型的简单介绍,引入并重点介绍Transformer模型,加深自己对各个模型的理解。

2.Seq2Seq模型

Seq2Seq模型,Sequence-to-Sequence,即序列到序列的过程。

典型的Seq2Seq模型如下,包含编码器Encoder和解码器Decoder两个部分。

Encoder是一个RNN/LSTM模型,将输入的句子编码得到context vector,即
(1) C = F ( x 1 , x 2 , . . . , x m ) C=F(x_1,x_2,...,x_m)\tag{1} C=F(x1,x2,...,xm)(1)
Decoder是Encoder的逆过程,每个状态由之前的状态和context vector决定,即
(2) y i = G ( C , y 1 , y 2 , . . . , y i − 1 ) y_i=G(C,y_1,y_2,...,y_{i-1})\tag{2} yi=G(C,y1,y2,...,yi1)(2)
在这种模型下,所有输入被压缩在一个向量中,导致——

  • 无法表达序列信息
  • 当句子长度较大时,容易丢失信息

3.Attention模型

3.1简介

2015年提出的Attention模型,使用多个context vector,有效地解决了使用Seq2Seq模型难以处理长句子的问题。

注意力机制(Attention Mechanism)源于对人类视觉的研究。在认知科学中,由于信息处理的瓶颈,人类会选择性地关注所有信息的一部分,同时忽略其他可见的信息。上述机制通常被称为注意力机制

在机器翻译中,注意力机制衡量输出单词与每个输入单词的关联程度,关联程度更大的输入单词具有更大的权重,使得输出单词可以更关注其对应的语义。比如,当翻译 I eat an apple 时,输出 时应该重点关注 eat 这个单词,即eat的权重应该比其他单词更高。

3.2模型架构

最先的Attention模型[5]架构如下:

  • 输入:待翻译的句子

  • Encoder:双向的RNN或LSTM,计算得到每个位置的隐状态,下面只用 h i h_i hi 表示

  • Decoder:对当前输出位置 t t t,使用上一个隐状态 s t − 1 s_{t-1} st1 与Encoder的结果计算,如下:

    • 衡量输出位置 i i i 与输入位置 j j j 之间的匹配程度, a a a 可以是点积或其他运算——

    (3) e i j = a ( s i − 1 , h j ) e_{ij}=a(s_{i-1},h_j)\tag{3} eij=a(si1,hj)(3)

    (4) α i j = e x p ( e i j ) ∑ k e x p ( e i k )

  • 2
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是使用PyTorch实现Transformer和Self-Attention的示例代码: ## Self-Attention ```python import torch import torch.nn as nn class SelfAttention(nn.Module): def __init__(self, embed_size, heads): super(SelfAttention, self).__init__() self.embed_size = embed_size self.heads = heads self.head_dim = embed_size // heads assert (self.head_dim * heads == embed_size), "Embed size needs to be divisible by heads" self.values = nn.Linear(self.head_dim, self.head_dim, bias=False) self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False) self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False) self.fc_out = nn.Linear(heads * self.head_dim, embed_size) def forward(self, values, keys, queries, mask): # Get number of training examples N = queries.shape[0] value_len, key_len, query_len = values.shape[1], keys.shape[1], queries.shape[1] # Split embedding into self.heads pieces values = values.reshape(N, value_len, self.heads, self.head_dim) keys = keys.reshape(N, key_len, self.heads, self.head_dim) queries = queries.reshape(N, query_len, self.heads, self.head_dim) # Transpose to get dimensions batch_size * self.heads * seq_len * self.head_dim values = values.permute(0, 2, 1, 3) keys = keys.permute(0, 2, 1, 3) queries = queries.permute(0, 2, 1, 3) # Calculate energy energy = torch.matmul(queries, keys.permute(0, 1, 3, 2)) if mask is not None: energy = energy.masked_fill(mask == 0, float("-1e20")) # Apply softmax to get attention scores attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=-1) # Multiply attention scores with values out = torch.matmul(attention, values) # Concatenate and linearly transform output out = out.permute(0, 2, 1, 3).reshape(N, query_len, self.heads * self.head_dim) out = self.fc_out(out) return out ``` ## Transformer ```python import torch import torch.nn as nn from torch.nn.modules.activation import MultiheadAttention class TransformerBlock(nn.Module): def __init__(self, embed_size, heads, dropout, forward_expansion): super(TransformerBlock, self).__init__() self.attention = MultiheadAttention(embed_dim=embed_size, num_heads=heads) self.norm1 = nn.LayerNorm(embed_size) self.norm2 = nn.LayerNorm(embed_size) self.feed_forward = nn.Sequential( nn.Linear(embed_size, forward_expansion * embed_size), nn.ReLU(), nn.Linear(forward_expansion * embed_size, embed_size) ) self.dropout = nn.Dropout(dropout) def forward(self, value, key, query, mask): attention_output, _ = self.attention(query, key, value, attn_mask=mask) x = self.dropout(self.norm1(attention_output + query)) forward_output = self.feed_forward(x) out = self.dropout(self.norm2(forward_output + x)) return out class Encoder(nn.Module): def __init__(self, src_vocab_size, embed_size, num_layers, heads, device, forward_expansion, dropout, max_length): super(Encoder, self).__init__() self.embed_size = embed_size self.device = device self.word_embedding = nn.Embedding(src_vocab_size, embed_size) self.position_embedding = nn.Embedding(max_length, embed_size) self.layers = nn.ModuleList([ TransformerBlock(embed_size, heads, dropout, forward_expansion) for _ in range(num_layers) ]) self.dropout = nn.Dropout(dropout) def forward(self, x, mask): N, seq_length = x.shape positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device) out = self.dropout(self.word_embedding(x) + self.position_embedding(positions)) for layer in self.layers: out = layer(out, out, out, mask) return out class DecoderBlock(nn.Module): def __init__(self, embed_size, heads, forward_expansion, dropout, device): super(DecoderBlock, self).__init__() self.norm = nn.LayerNorm(embed_size) self.attention = MultiheadAttention(embed_size, heads) self.transformer_block = TransformerBlock(embed_size, heads, dropout, forward_expansion) self.dropout = nn.Dropout(dropout) def forward(self, x, value, key, src_mask, trg_mask): attention_output, _ = self.attention(x, x, x, attn_mask=trg_mask) query = self.dropout(self.norm(attention_output + x)) out = self.transformer_block(value, key, query, src_mask) return out class Decoder(nn.Module): def __init__(self, trg_vocab_size, embed_size, num_layers, heads, forward_expansion, dropout, device, max_length): super(Decoder, self).__init__() self.embed_size = embed_size self.device = device self.word_embedding = nn.Embedding(trg_vocab_size, embed_size) self.position_embedding = nn.Embedding(max_length, embed_size) self.layers = nn.ModuleList([ DecoderBlock(embed_size, heads, forward_expansion, dropout, device) for _ in range(num_layers) ]) self.fc_out = nn.Linear(embed_size, trg_vocab_size) self.dropout = nn.Dropout(dropout) def forward(self, x, enc_out, src_mask, trg_mask): N, seq_length = x.shape positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device) x = self.dropout(self.word_embedding(x) + self.position_embedding(positions)) for layer in self.layers: x = layer(x, enc_out, enc_out, src_mask, trg_mask) out = self.fc_out(x) return out ``` 这些代码可以用于实现Transformer和Self-Attention模型。但这只是示例,你需要根据你的数据和任务来调整这些代码中的各种超参数和结构。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值