An IMage Is Worth 16 X 16 Words (VIT)
An IMage Is Worth 16 X 16 Words (VIT):1. 网络1.1 emb层每张图N个patch,将patch flatten成一维将每个patch的维度用线性层升到D维,这层叫patch embeddings开始加cls, 每个emb要加pos_embE是linear pro1.2 剩下部分MSA是multiheaded self- attentionLN - layer-normMLP激活函数是GELU每层都加了残差的1.3 wo
原创
2022-05-01 00:04:13 ·
163 阅读 ·
0 评论