ViT-B: layers=12,hidden_size=768,MLP_size=3072,heads=12,params=86M,image_size=384+1
参考:https://blog.csdn.net/weixin_43922901/article/details/102602557
1 Patch embedding
patch_dim = 16163, dim = hidden_size = 768
所以参数量为768*768
2 Transformer block(attention+FFN)
attention: 这一部分的参数主要来源于x->q,k,v所要进行的linear变换,即 w q w_q w