The Ingredients for Robotic Diffusion Transformers-CSDN博客

本文链接：https://blog.csdn.net/s_m_c/article/details/144416006

发表时间：Oct 2024

论文链接：https://readpaper.com/pdf-annotate/note?pdfId=2536104770520270336&noteId=2609148347801754368

作者单位：Carnegie Mellon University

Motivation：In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal（Transformer network architectures and generative diffusion models） improvements has proven surprisingly difficult, since there is no clear and well understood process for making important design choices.

解决方法：In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies.

Our contributions are:

Scalable Attention Blocks:我们提出了一个关键改进（受 Peebles 等人的启发。通过在diffusion transformer policy layers中添加自适应层范数(adaLN)块来稳定训练。这个简单的技巧在包含超过 1000 个决策的长范围、灵巧的、真实世界的操作任务上将性能提高了 30%+（牛逼）！RDT中也使用QKNorm & RMSNorm（稳定计算）。
Efficient Observation Tokenization: we compare several methods to tokenize multiple camera observations, such as Vision Transformers [15] and ResNet [16] encoders. We find that a relatively simple implementation (ResNet image tokenizer + Transformer policy) can provide a substantial (40%+) performance boost over competing strategies.
DiT-Block Policy: We integrate the best performing components in a unified framework, coined DiTBlock Policy.

实现方式：

The observation tokens are passed into a encoder-decoder transformer network (middle), which is responsible for predicting the noise epsilon (ε) used for diffusion.

文本用Film融入视觉token。
The observation tokens are passed into an encoder-decoder transformer network (middle), which is responsible for predicting the noise epsilon (ε) used for diffusion.
对于稳定的训练，解码器块利用定制的 adaLN-Zero 架构（右），使转换器能够可扩展地优化扩散目标。