The Ingredients for Robotic Diffusion Transformers

发表时间:Oct 2024

论文链接:https://readpaper.com/pdf-annotate/note?pdfId=2536104770520270336&noteId=2609148347801754368

作者单位:Carnegie Mellon University

Motivation:In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal(Transformer network architectures and generative diffusion models) improvements has proven surprisingly difficult, since there is no clear and well understood process for making important design choices.

解决方法:In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies.

Our contributions are:

  1. Scalable Attention Blocks:我们提出了一个关键改进(受 Peebles 等人的启发。通过在diffusion transformer policy layers中添加自适应层范数(adaLN)块来稳定训练。这个简单的技巧在包含超过 1000 个决策的长范围、灵巧的、真实世界的操作任务上将性能提高了 30%+(牛逼)!RDT中也使用QKNorm & RMSNorm(稳定计算)。

  2. Efficient Observation Tokenization: we compare several methods to tokenize multiple camera observations, such as Vision Transformers [15] and ResNet [16] encoders. We find that a relatively simple implementation (ResNet image tokenizer + Transformer policy) can provide a substantial (40%+) performance boost over competing strategies.

  3. DiT-Block Policy: We integrate the best performing components in a unified framework, coined DiTBlock Policy.

实现方式

The observation tokens are passed into a encoder-decoder transformer network (middle), which is responsible for predicting the noise epsilon (ε) used for diffusion.

  • 文本用Film融入视觉token。

  • The observation tokens are passed into an encoder-decoder transformer network (middle), which is responsible for predicting the noise epsilon (ε) used for diffusion.

  • 对于稳定的训练,解码器块利用定制的 adaLN-Zero 架构(右),使转换器能够可扩展地优化扩散目标。

实验:我们的第一个任务集考虑了双手动、低成本的 ALOHA 机器人 [8],这使我们能够研究具有高度灵巧、精确行为的挑战性场景。设计了三种任务,由易到难。

最后,请注意 D.P.Transformer 基线无法解决我们的任何任务,因为不稳定的训练会导致嘈杂的/不安全的动作预测。因此,我们得出结论,DiT-Block 策略比基线更稳定地学习扩散策略转换器。

结论我们希望这项工作将为未来的机器人学习技术打开大门,这些技术利用生成扩散建模的效率和大规模transformer架构的可扩展性

此外,我们的观察tokenizer的消融表明,使用单独的 ResNet CNN 进行图像编码比单独使用transformers提供更强的性能。即使缩放transformers也不足以弥补这种差异。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Ming_Chens

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值