Scaling vision Transformer 论文理解

最新推荐文章于 2024-08-09 11:37:59 发布

Robo-网络矿产提炼工

最新推荐文章于 2024-08-09 11:37:59 发布

阅读量197

点赞数 4

分类专栏：计算机视觉 - Opencv 强化学习等的趣味小实验文章标签： transformer 深度学习人工智能

本博客为个人撰写，未经商业授权严禁转载！

本文链接：https://blog.csdn.net/u013537270/article/details/127244538

版权

计算机视觉 - Opencv 强化学习等的趣味小实验专栏收录该内容

58 篇文章 56 订阅 ¥29.90 ¥99.00

订阅专栏

超级会员免费看

本文探讨了视觉Transformer（ViT）的缩放性质，通过扩大和缩小模型及数据，发现模型性能、数据和计算资源之间的关系。优化后的ViT模型在ImageNet上达到了90.45%的准确率，并在少量样本迁移学习中表现出色。研究指出，要维持前沿性能，需同时扩展计算和模型大小，且大模型在样本效率和少样本学习上表现更好。

摘要由CSDN通过智能技术生成

Scaling vision Transformer 论文理解

1. 摘要
2. 一些主要结论小结
- 2.1 few shot transfer learning
- 2.2 Pareto-front
3. 讨论
- 3.1 Limitations
- 3.2 社会作用
4. 文章结论
参考资料

1. 摘要

Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, understanding a model’s scaling properties is a key to designing future generations effectively. While the laws for scaling Transformer language models have been studied, it is unknown how Vision Transformers scale. To address this, we scale ViT models and data, both up and down, and characterize the relationships between error rate, data, and compute. Along the way, we refine the architecture and training of ViT, reducing memory consumption and increasing accuracy of the resulting models. As a result, we successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well for few-shot transfer, for example, reachin

了解本专栏

超级会员免费看

Robo-网络矿产提炼工

关注

4
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
Scaling vision Transformer 论文理解

基于注意力的神经网络，例如视觉transformer（ViT），最近在许多计算机视觉基准测试上取得了新新进展。模型的规模（大小）是获得优异结果的主要因素。因此，了解模型的规模特性是有效设计后代的关键。虽然已经研究了Transformer语言模型的缩放规律，但尚不清楚Vision Transformers是如何缩放的。为了解决这个问题，我们向上和向下缩放ViT模型和数据，并描述错误率、数据和计算之间的关系。一路上，我们改进了ViT的体系结构和训练，减少了内存消耗，提高了结果模型的准确性。
复制链接

扫一扫