1. CNN
CNN 不能解决scaling 和 rotation(→ data augmentation)
→ spatial transformer layer
2. Self-Attention
self-attention可以叠加:
2.1. self-attention运作过程
计算两个vector的relevance α:
不一定要用softmax,其他activation function也可以,e.g. relu
2.2. 矩阵角度
2.3. Multi-head Self-attention
2.4. Positional Encoding (?)
2.5. Self-Attention for Image
CNN 是 self-attetion的subset,self-attetion更flexible: