Transformer is a show

Transformer model is so powerful, here I try to make it colorful, by interpreting it with a metaphor: Encoder = Stage Show, Decoder = Oscar Award Assessment.

In this new perspective, we see the Decoder process as performing a stage show, and the Decoder process is just like the Oscar Award Committee assessing the show.


Below is the graph and interpretation:

在这里插入图片描述

Part 1: Encoder = Stage Show

Phase:

P1. Ideas -> Conceiving_Story -> Script

  • Performing bases on script, script inspires by story, story starts with ideas.
  • Likely, at the beginning of Transformer, we have raw sequence, then we preprocess it to fit the shape of the model.

P2. Script Interpretation -> Rehearsal -> Adjustment

  • Script Interpretation: x is the performing script, can be interpreted from 3 different aspects:

    • Q = plots, K = roles, V = characteristics
    • A script is consisted of a series of plots, each plot is performed by different roles, each role has his own characteristics.
    • In other words, different role performs different characteristic according to the plot, just like different key matches with different query.
    • Further, the same role may perform a different characteristic at different plot, just like the same word may have different meaning at different position of the sequence.
    • If Q=K=V=x, which is plots=roles=characteristics=script, it means s a solo performing, and the actor can perform according to his will to express the theme of the story. In keras Multi-Head Attention API, this is called “self-attention”.
  • Rehearsal: Multi-Head attention is interpreted as rehearsal, since it processes script interpretation (Q, K, V), just like actors have a rehearsal after understanding the script.

  • Adjustment: Dropout -> ResAdd -> LayerNorm

    • Dropout: randomly cut some plot of the story, in case of highly depending on some actors’ personal performing or mainly betting on the climax
    • ResAdd: see the connection between each plot, not to isolate them
    • LayerNorm: normalize actors’ performing, in case of actors bringing in too much personal characteristics

P3. Crew Discussion -> Adjustment

  • Crew Discussion: FullyConnected is interpreted as crew discussion to abstract the performing features and flaws.
  • Adjustment: like above P2.

P4. Repeat P2 - P3 to practice N rounds

  • After the repeatedly practices, the real_show (enc_output) will go live on stage.

P5. Real Show Performing

  • The assessment members will watch & record this stage show, just like enc_output will be passed to Decoder.

Pseudo code:

  • P1: input_x -> preprocess(Embedding -> Scale -> Pos_encoding -> Dropout) -> x

  • P2: Query=x, Value=x, Key=x, enc_padding_mask -> Multi_Head_Attention -> adjust(Dropout -> ResAdd -> LayerNorm) -> out1

  • P3: out1 -> FC -> adjust(Dropout -> ResAdd -> LayerNorm) -> out2

  • P4: loop(P2 - P3) -> update(out2)

  • P5: enc_output = out2


Part 2: Decoder = Oscar Award Assessment

Phase:

P1. Rumors -> Dig into the story

  • At the very beginning, the show is not on, bu rumors () already spread around, attracting people’s attention, preparing for the premiere.
  • When assessment members hear this rumors, they start to dig into the story, by reading introduction, comments, etc…
  • Likely, we have no real input target but a sign of it at the beginning of Decoder, then we do a preprocess for it.

P2. Ask Questions -> Adjustment

  • Ask Questions:
    • The first Multi-Head Attention outputs a Query, which is like people may think & ask some questions about what they heard & red.
    • So, in code, the input of the first Multi-Head Attention (MHA) is: target, target, target. Because when the curiosity is initially triggered, all you thinking is: more, more, more on the topic.
  • Adjustment: see below P3.

P3. Watch the Show & Answer the Questions -> Adjustment

  • Watch the Show & Answer the Questions:
    • Connectedly, the second MHA is to answer the questions asked in the first MHA by watching the show.
    • So, in code, the Query from the first MHA and the enc_output from Encoder is passed to the second MHA as input. Since both Key and Value information are from the show, so we set Key=enc_output, Value=enc_output.
  • Adjustment: adjustment in assessment is interpreted differently with adjustment in performing
    • Dropout: randomly delete some opinions of the jury, in case of the manipulation of authority
    • ResAdd: evaluate the show comprehensively with previous assessment , not to view it isolatly
    • LayerNorm: normalize the assessment to standard criterion

P4. Assessment Discussion -> Adjustment

  • Assessment Discussion: this FullyConnected is interpreted as a discussion about the evaluation & criterion stuff.
  • Adjustment: like above P3.

P5. Repeat P2 - P4 to assess N rounds

  • Assess several rounds, ensuring the show is well & fair understood and evaluated.

P6. Voting -> Oscar Awards Ceremony

  • Softmax is like voting, disclosing the final winner of the Oscar Award.

Pseudo code:

  • P1: SOS -> preprocess(Embedding -> Pos_encoding) -> target

  • P2: Query=target, Value=target, Key=target, look_ahead_mask -> Multi-Head Attention -> adjust(Dropout -> ResAdd -> LayerNorm) -> out1

  • P3: Query=out1, Value=enc_output, Key=enc_output, dec_padding_mask -> Multi-Head Attention -> adjust(Dropout -> ResAdd -> LayerNorm) -> out2

  • P4: out2 -> FC -> adjust(Dropout -> ResAdd -> LayerNorm) -> out3

  • P5: loop(P2 - P4) -> update(out3) -> dec_output = out3

  • P6: dec_output -> Dense(‘softmax’) -> ŷ


Part 3: Summary

What making the transformer special comparing to other models, is just like the reason why stage show is different with a film:

For stage show, all acts can be performed together at the same time as long as the imagination as well as the stage is big enough, whereas a film is a fixed time sequence that can be only played one screen at a time.

Transformer is like a show, attention is all you need.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值