接上一篇
P24P25
MAE的编码器部分
n
Our encoder is a
ViT
but applied only on
visible, unmasked patches
.
n
Just as in a standard
ViT
, our encoder embeds patches by a linear projection with added positional embeddings, and then processes the resulting set via a series of Transformer blocks.
n
However, our encoder only operates on a small subset (e.g., 25%) of the full set.
Masked patches are removed; no mask tokens are used.
n
This allows us to train very large encoders with only a fraction of compute and memory.
n
The full set is handled by
a lightweight decoder
, described next.
MAE的解码器部分
n
The input to the MAE decoder is the full set of tokens consisting of (
i
)
encoded visible patches
, and (ii)
mask tokens
.
n
Each mask token is
a shared, learned vector
that indicates the presence of a missing patch to be predicted.
n
We add positional embeddings to all tokens in this full set; without this, mask tokens would have no information about their location in the image.
n
The decoder has
another series of Transformer blocks.
未完,下一篇继续……