一文速览CogACT及其源码剖析：把OpenVLA的离散化动作预测换成DiT，逼近π0(含DiT的实现)

原创已于 2025-01-12 22:45:04 修改 · 6.3k 阅读

42 ·

CC 4.0 BY-SA版权

文章标签：

#CogACT #CogACT源码剖析

于 2025-01-12 12:10:35 首次发布

机械臂VLA发展史：RT2/OpenVLA/3D VLA 专栏收录该内容

17 篇文章

订阅专栏

前言

如上文所说，我在思考，基于哪个模型来改造，以逼近π0，相比OpenVLA，把OpenVLA的离散化动作预测换成DiT的CogACT，更逼近π0

故在上文之后，本文重点讲下这个CogACT，而具身VLA模型cogACT的源码结构还是很清晰的(毕竟想改造一个模型结构，得先把源码抠完)

第一部分 CogACT：VLM作为认知基础，DiT作为动作模块

1.1 CogACT的提出背景与相关工作

1.1.1 CogACT：把OpenVLA的动作预测换成DiT，逼近π0

近年来，配备视觉能力的机器人控制模型引起了广泛的兴趣，比如

7-RT-1,8-RT-2-将7D动作分解为离散的token，并使用VLM PaLI-X [13]像语言token一样自回归地预测它们

15-Diffusion policy,
30-Openvla-采用和RT-2类似的方法，对动作进行tokenizing，并在Open-X-Embodiment数据集[48]上训练Prismatic VLM,34-Vision-language foundation models as effective robot imitators,
45-R3m,48-Open x-embodiment,58-Perceiveractor,60-Open-world object manipulation using pre-trained vision-language models,62-Octo,67-Tinyvla,69-Unleashing large-scale video generative pre-training for visual robot manipulation]

其中，大规模视觉-语言-动作（VLA）模型的发展[8,30,32-RoboFlamingo-通过在OpenFlamingo [3]中加入一个用于预测动作的头部网络并通过MSE损失进行优化来实现扩展]尤为引人注目，这使得机器人能够通过自然语言指令执行复杂任务，并有可能管理偏离训练分布的对象或环境

此外，它们通过微调展现出对新任务和新体现的快速适应能力

VLA的显著泛化能力归因于其庞大的模型规模和作为其基础的强大视觉语言模型（VLM）[13-Pali-x,28- Prismatic vlms,35-Visual instruction tuning]
这些VLM通常是在大规模互联网图像-文本对上进行预训练的，这在增强VLA对新对象和语义多样化指令的泛化能力中起到了关键作用[8-RT-2]
现有的大型VLA通常以简单的方式调整VLM以进行动作预测，导致若干问题阻碍了任务性能。例如，像[8-RT-2,30-OpenVLA]这样的工作直接根据VLM的下一个token预测方案将机器人动作的连续光谱量化为离散的区间
然而，这种简单的量化，与那些为图像[65- Neural discrete representation learning,72- Language model beats diffusion–tokenizer is key to visual generation]和音频[19,73]设计的复杂分词器不同，在动作学习中带来了困难并限制了动作精度
[32-RoboFlamingo]引入了额外的动作头，如LSTM，将VLM输出转换为动作。然而，转向基于回归的学习方案忽视了动作的概率性和多模态特性

24年11月底，来自清华大学、微软亚院、中科大、中科院微电子研究所的研究者们提出了一种基础的视觉-语言-动作模型——CogACT(论文地址、项目地址、GitHub地址)，用于在机器人操作中协同认知与动作

与其将预训练的VLM重新用于动作预测，作者使用VLM提取的认知信息来指导专用动作模块的动作预测过程
且为了处理动作信号固有的特性——连续性、多模态性、时间相关性以及需要高精度——作者采用了基于扩散的先进transformer「即DiT，51-William Peebles and Saining Xie. Scalable diffusion models with transformers，详见此文《Diffusion Transformer(DiT)——将扩散过程中的U-Net换成ViT：近频繁用于视频生成与机器人动作预测(含清华PAD详解)》」作为动作模块，通过注意力机制在VLM输出上进行预处理
To handle the inherent char-acteristics of action signals – continuous, multimodal, tem-porally correlated, and requiring high precision – we employ advanced diffusion-based transformers (DiT) [51] as our action modules, preconditioned on VLM output via the attention mechanism
作者设计的直觉是解耦“认知”和“动作”能力。虽然大型VLM从大量文本和图像中积累了广泛的视觉和语义知识，但认知能力和输出语言模式与密集的机器人动作之间存在根本差距
作者主张设计具有专用动作模块的组件化VLA，而不是直接重新利用VLM

1.1.2 相关工作：GR2、RDT、Octo

首先，类似GR-2的一组方法[5- Gen2act,11-Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation,68-Unleashing large-scale video generative pretraining for visual robot manipulation]采用大规模视频生成预训练来增强视觉机器人操控学习，而不利用预训练的VLM，并展示了有希望的结果

其次，最近有一些与CogACT的工作并行的尝试，探索了用于通用机器人大型动作模型(large action models for gener-alist robot)，例如

[24-Diffusion transformer policy]训练了一个具有221M参数的Diffusion Transformer
[38-Rdt-1b: a diffusion foundation model for bimanual manipulation]进一步将动作模型的规模扩大到1B

这两项工作都应用了单独的视觉和语言编码器，这些编码器是预训练和冻结的，用于处理语言指令和图像，并训练动作模型以整合这些输入并使用VLA数据预测动作

与CogACT的工作不同的是，RDT这类工作无法利用在互联网上规模化视觉语言对齐数据上预训练的强大VLM的泛化和指令遵循能力，毕竟其目标更多是动作大模型

最后，还有基于扩散的机器人策略

最近的研究「15-Diffusion policy，详见此文《Diffusion Policy——斯坦福机器人UMI所用的动作预测策略：从原理到其编码实现(含Diff-Control详解)》,50-Imitating human behaviour with diffusion models,53- Goal-conditioned imitation learning using scorebased diffusion policies」引入了扩散模型
作为建模机器人动作的一种创新方法。这些扩散策略展示了强大的能力，能够捕捉机器人动作分布的多模态特性，并有效地建模机器人为完成特定任务可以采取的各种可行轨迹[15]
受扩散策略的启发，Octo[62]在基于transformer的骨架架构中补充了具有300万参数的紧凑扩散头，以适应不同机器人之间的动作输出
然而，小型扩散头(small diffusion head)在捕捉精确的动作分布方面有所不足，并且整体方法未能从基于网络规模数据预训练的强大视觉语言模型中受益

相比之下，CogACT的工作研究了大型、专用的动作模块(而不是RoboFlamingo、Octo那种小“头”)——采用Diffusion Transformer(DiT)架构

此外，与[15-Diffusion policy,50-Imitating human behaviour with diffusion models,53-Goal-conditioned imitation learning using scorebased diffusion policies]不同，CogACT对来自具有强泛化能力的VLM基础模型的大型VLA感兴趣

1.2 方法

CogACT的目标是开发一个VLA 模型，使不同的机器人能够在接收视觉观测和语言指令的同时执行多样化的任务

形式上，给定时间t 的语言指令 $l$ 和视觉观测 $o_t$ ，模型π 预测一个时间动作序列 $(a_{t}, a_{t+1}, \ldots, a_{t+N})$ 用于执行所需任务(定义为公式1)

$\boldsymbol{\pi}:\left(\boldsymbol{l}, \boldsymbol{o}_{t}\right) \rightarrow\left(a_{t}, a_{t+1}, \ldots, a_{t+N}\right)$

虽然一般情况下， $a_t$ 可以描述具有不同控制模式和末端执行器的各种机器人动作，但在这项工作中，作者考虑一个具有7个自由度（DoF）的夹持器的动作空间(定义为公式2)：

$\boldsymbol{a}_{t}=[\Delta x, \Delta y, \Delta z, \Delta \phi, \Delta \theta, \Delta \psi, g]$

其中∆x、∆y、∆z是末端执行器的相对平移偏移量，∆ϕ、∆θ、∆ψ表示旋转变化，g∈{0,1}表示夹持器的开/闭状态

为了有效处理复杂的视觉观测和语言指令，并将其协同转换为精确的动作，作者将模型π分为三个部分：视觉模块、语言模块和动作模块，如下图图2所示

模型被分为三个部分：

视觉模块，将当前图像观测的信息编码为视觉token
语言模块，将视觉token与语言指令结合，并生成认知特征以确定要执行的期望动作
扩散动作模块，根据认知特征预测多步动作序列。在推理时应用自适应集成策略进行轨迹集成

1.2.1 视觉和语言模块(类似OpenVLA)：DINOv2 + SigLIP + Llama2

CogACT的视觉和语言模块是从现有的视觉语言模型(VLM)[28-Prismatic vlms: Investigating the design space of visually-conditionedlanguage models]中改编而来，总参数约为70亿，与[30-OpenVLA]类似

视觉模块： DINOv2 [49] 和SigLIP[74] 组成——类似OpenVLA中对视觉模块的设置
视觉模块将原始图像输入处理为一组感知token。它由强大的vision transformer DINOv2 [49] 和SigLIP[74] 组成，这些transformer在互联网规模的图像数据上进行预训练，以捕捉丰富的视觉特征和对观察的全面语义理解
在每个时间步 $t$ ，图像观察 $\boldsymbol{o}_{t}$ 被输入到两个模型中，分别产生两个下采样特征图 $\boldsymbol{f}_{t}^{\text {DINO }}$ 和 $\boldsymbol{f}_{t}^{\mathrm{Sig}}$ 。这些特征图然后在通道维度上连接，通过线性投影层，并序列化为一组视觉感知token， $\mathcal{V}=\left\{\boldsymbol{v}_{1}, \boldsymbol{v}_{2}, \ldots, \boldsymbol{v}_{N_{\mathcal{V}}}\right\}$ ，长度为 $N_{\mathcal{V}}$ (默认使用256)
语言模块：LLAMA-2作为骨干——也类似OpenVLA
语言模块负责整合视觉信息和语言指令，并进行认知推理。在这里，LLAMA-2 模型[64] 被应用作为骨干
语言指令 $l$ 被转换为一组语言token， $\mathcal{T}=\left\{\boldsymbol{l}_{1}, \boldsymbol{l}_{2}, \ldots, \boldsymbol{l}_{N_{\mathcal{T}}}\right\}$ ，使用LLAMA-2 的tokenizer

1) 这些token然后与视觉token V 和一个额外的可学习认知token $c$ 连接，并通过模型使用因果注意机制进行处理「These tokens are then concatenated with the visual tokens V andan additional learnable cognition token c, and processed bythe model using a causal attention mechanism」
2) 生成的输出特征 $f_{t}^{c}$ ，对应于认知token，编码了集成信息，确定了当前任务要执行的动作。这作为后续动作模块解释和推导所需动作的条件「The result-ing output feature f ct , corresponding to the cognition token,encodes integrated information that determines the action to be executed for the current task. This serves as a conditionfor the subsequent action module to interpret and derive thedesired actions」

1.2.2 扩散动作模块：基于Diffusion Transformer(DiT)

动作模块接收认知特征作为输入条件以生成一系列动作，如公式1 和2所定义

鉴于现实世界中的物理动作是连续的且常常是多模态的，作者使用扩散建模过程来预测它们[47-Improved denoising diffusion probabilistic models]

且为了建模复杂的和时间相关的动作，应用Diffusion Transformer(DiT) [51- Scalable diffusion models with transformers，详见此文《视频生成Sora的全面解析：从AI绘画、ViT到ViViT、TECO、DiT、VDT、NaViT等》的2.4节DiT(含U-ViT)：将扩散过程中的U-Net 换成ViT(2D图像生成，带文本条件融合)] 作为动作解码过程的强大骨干
具体来说，CogACT的动作模块将认知特征 $f_{t}^{c}$ 和一系列带有噪声的动作 $\left(\boldsymbol{a}_{t}^{i}, \boldsymbol{a}_{t+1}^{i}, \ldots, \boldsymbol{a}_{t+N}^{i}\right)$ 作为输入，其中 $i$ 表示当前去噪步骤。它通过多个去噪步骤预测最终动作 $\left(\boldsymbol{a}_{t}, \boldsymbol{a}_{t+1}, \ldots, \boldsymbol{a}_{t+N}\right)$
Specifically,our action module takes the cognition feature fct along with a series of noisy actions(ait, ait+1, ..., ait+N) as input, where i denotes the current denoising step.It predicts the final actions(at, at+1, ..., at+N) through multiple denoising steps.
1) 认知特征和噪声动作作为输入token传递给transformer 模块，而步骤信息 $i$ 通过正弦位置编码添加到认知特征中「The cognition feature and the noisy actions serve as input tokens to the transformer blocks, while the step informationi is added to the cognition feature with a sinusoidal posi-tional encoding」
2) 作者强制动作模型不仅预测当前动作 $\boldsymbol{a}_{t}$ ，还预测多个未来动作 $\left(\boldsymbol{a}_{t+1}, \ldots, \boldsymbol{a}_{t+N}\right)$ 。这种方法增强了每个时间步长的预测动作的整体平滑性，并提高了任务执行的最终成功率，这在之前的研究中也有类似观察[15-Diffusion policy, 75-ALOHA ACT]

在实践中，预测的未来动作数量设置为一个较小的值(默认情况下N = 15)，导致动作模块的上下文长度为N + 2 = 17。这使得扩散过程非常高效，并且不会给整体框架带来太多计算成本

1.2.3 训练目标：最小化动作模块预测的噪声与真实噪声之间的均方误差

上面的视觉模块、语言模块、动作模块通过最小化动作模块预测的噪声与真实噪声之间的均方误差MSE进行端到端的训练/微调，其损失函数定义为

$\mathcal{L}_{\mathrm{MSE}}=\mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(0,1), i}| | \hat{\boldsymbol{\epsilon}}^{i}-\boldsymbol{\epsilon} \|_{2}$

其中 $\hat{\boldsymbol{\epsilon}}^{i}$ 是对噪声动作序列 $\left(\boldsymbol{a}_{t}^{i}, \boldsymbol{a}_{t+1}^{i}, \ldots, \boldsymbol{a}_{t+N}^{i}\right)$ 在第 $i$ 次去噪步骤的预测噪声， $\boldsymbol{\epsilon}$ 是相应的真实值

1.2.4 自适应动作集成

在推理过程中，CogACT预测多个时间步长的动作

一种简单的策略是基于当前观测值 $o_{t}$ 连续执行这些动作「比如动作分块[75- ALOHA ACT]」
然而，这并未充分利用每个时间步长的视觉信息，可能导致动作不连贯，如[75]中所讨论的。或者，仅执行当前时间步长的动作（即 $a_{t}$ ）也会导致轨迹不够平滑并降低性能

为了解决这些问题，[75-ALOHA ACT] 引入了一种时间集成策略，该策略结合了当前时间步的动作预测和过去预测，使用预设的聚合权重。然而，任务执行的可行动作可能属于不同的模式 [15-Diffusion policy]，简单地聚合它们可能会导致一个不符合任何模式的动作，这是次优的

对此，作者提出了一种自适应集成策略，下图图3展示了要聚合动作之间的相似性。这种方法避免了来自不同模式的动作的不合理聚合

具体来说，令 $\boldsymbol{a}_{t} \mid \boldsymbol{o}_{t}$ 表示给定观测 $\boldsymbol{o}_{t}$ 时当前时间步 $t$ 的动作预测，而 $\left\{\boldsymbol{a}_{t}\left|\boldsymbol{o}_{t-K}, \ldots, \boldsymbol{a}_{t}\right| \boldsymbol{o}_{t-1}\right\}$ 表示基于历史观测 $\left\{\boldsymbol{o}_{t-K}, \ldots, \boldsymbol{o}_{t-1}\right\}$ 的相应动作预测。作者推导出在时间步 $t$ 要执行的最终动作 $\hat{\boldsymbol{a}}_{t}$ 为

$\hat{\boldsymbol{a}}_{t}=\sum_{k=0}^{K} w_{k}^{\mathrm{ada}} \cdot \boldsymbol{a}_{t} \mid \boldsymbol{o}_{t-k}$

其中， $w_{k}^{\text {ada }}$ 是一个自适应加权标量，它将更大的权重分配给与当前预测 $\boldsymbol{a}_{t} \mid \boldsymbol{o}_{t}$ 更相似的过去预测「adak is an adaptive weighting scalar that assignsgreater importance to past predictions that are more simi-lar to the current prediction at|ot」：

$w_{k}^{\text {ada }}=\exp \left(\alpha \cdot<\boldsymbol{a}_{t}\left|\boldsymbol{o}_{t}, \boldsymbol{a}_{t}\right| \boldsymbol{o}_{t-k}>\right)$

其中 $<\cdot, \cdot>$ 计算两个动作之间的余弦相似度，而 $\alpha$ 是一个超参数，实际中设为0.1

且实证结果表明，这种自适应动作集策略有效地提高了任务执行的成功率，同时对推理增加的额外成本极小，因为过去的预测可以方便地缓存

1.3 实验

1.3.1 预训练数据

在训练数据集上，作者使用 Open X-Embodiment（OXE）[48] 数据集作为主要训练数据集。它包括来自 60 个数据集的超过 100万个真实世界的机器人轨迹，涵盖 22 种不同的机器人形态。且使用与 Octo [62] 和 OpenVLA [30]中相似的 OXE 子集进行训练，该子集包含 2250 万帧

进一步而言，与Octo [62]和OpenVLA [30]一样，限制在具有单臂末端执行器控制和至少一个第三人称摄像机视角的数据集上进行训练。数据混合策略主要遵循[30,62]，但在整个训练过程中，不使用Language Table [41]和Droid [29]数据集，因为它们与其他数据的分布差异显著。详细的数据混合情况列在表I中。总共，使用了包含2250万帧的40万机器人轨迹作为我们的训练数据

1.3.2 实现细节

在实现细节上，模型的训练批量大小为256，每个样本有8个扩散步骤，使用来自[30-Openvla] 的预训练视觉和语言模块权重进行初始化

视觉模块（即DINOv2 和SigLIP）、语言模块（即LLAMA-2）和动作模块都是端到端训练的，遵循恒定学习率2e −5 进行超过135K 次迭代。训练在16 个NVIDIA A100 GPU 上进行，使用PyTorch 的完全分片数据并行（FSDP）框架大约耗时5 天。默认情况下，使用DiT-Base 作为动作模型

推理过程中，使用DDIM [59]采样，采用10个采样步骤，并使用无分类器引导（CFG）[23]系数为1.5
集成窗口K 设置为与每帧的移动距离成反比，这可以从机器人的运动速度和观测频率推断。在实践中，使用训练集中的动作标准差来确定K，例如，Google Robot 的RT-1 数据集为2，WidowX Robot 的BridgeDataV2 为7

此外，CogACT与Octo-Base [62]和OpenVLA [30]的比较。为了公平评估，所有模型都在OXE数据集上进行了预训练，然后使用CogACT收集的演示数据进行了微调。CogACT取得了显著的改进，成功率比OpenVLA高出59.1%

更多见原论文

第二部分 CogACT的源码剖析

2.1 VLM部分

// 待更

2.2 动作预测

2.2.1 Diffusion Transformer(DiT)的实现：action_model/models.py

2.2.1.1 TimestepEmbedder类、LabelEmbedder类

通过此文《Diffusion Transformer(DiT)——将扩散过程中的U-Net换成ViT：近频繁用于视频生成与机器人动作预测(含清华PAD详解)》，已知，下图所示——便是扩散transformer(DiT)的架构

下图的左侧是训练条件潜在DiT模型(conditional latent DiT models)，潜在输入被分解成patch并通过几个DiT blocks处理(The input latent is decomposed into patches and processed by several DiT blocks)

本质就是噪声图片减掉预测的噪声以实现逐步复原

比如当输入是一张256x256x3的图片，对图片做切patch后经过投影得到每个patch的token，得到32x32x4的Noised Latent(即加噪的图片，在推理时输入直接是32x32x4的噪声)，结合当前的Timestep t、Label y作为输入
经过N个Dit Block(基于transformer)通过mlp进行输出，从而得到噪声“Noise预测”以及对应的协方差矩阵 $\Sigma$ (After the final DiT block, we need to decode our sequence of image tokens into an output noise prediction and an output diagonal covariance prediction)，最后经过T个step采样，得到32x32x4的降噪后的latent

而TimestepEmbedder类用于将标量时间步嵌入到向量表示中

其构造函数初始化了一个多层感知器（MLP），该感知器由两个线性层和一个SiLU激活函数组成

    def __init__(self, hidden_size, frequency_embedding_size=256):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(frequency_embedding_size, hidden_size, bias=True),
            nn.SiLU(),
            nn.Linear(hidden_size, hidden_size, bias=True),
        )
        self.frequency_embedding_size = frequency_embedding_size

timestep_embedding
静态方法生成正弦时间步嵌入，通过计算频率并将其与时间步相乘，然后应用cos和sin函数来生成嵌入向量

    def timestep_embedding(t, dim, max_period=10000):
        """
        Create sinusoidal timestep embeddings.
        :param t: a 1-D Tensor of N indices, one per batch element.
                          These may be fractional.
        :param dim: the dimension of the output.
        :param max_period: controls the minimum frequency of the embeddings.
        :return: an (N, D) Tensor of positional embeddings.
        """
        # https://github.com/openai/glide-text2im/blob/main/glide_text2im/nn.py
        half = dim // 2
        freqs = torch.exp(
            -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half
        ).to(device=t.device)
        args = t[:, None].float() * freqs[None]
        embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
        if dim % 2:
            embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
        return embedding

forward方法调用timestep_embedding生成时间步嵌入，并通过MLP处理这些嵌入

    def forward(self, t):
        t_freq = self.timestep_embedding(t, self.frequency_embedding_size).to(next(self.mlp.parameters()).dtype)
        t_emb = self.mlp(t_freq)
        return t_emb

LabelEmbedder类用于将条件嵌入到向量表示中，并处理分类器自由引导的标签丢弃

其构造函数初始化了一个线性层和一个可选的未条件参数

class LabelEmbedder(nn.Module):
    """
    Embeds conditions into vector representations. Also handles label dropout for classifier-free guidance.
    """
    def __init__(self, in_size, hidden_size, dropout_prob=0.1, conditions_shape=(1, 1, 4096)):
        super().__init__()
        self.linear = nn.Linear(in_size, hidden_size)
        self.dropout_prob = dropout_prob
        if dropout_prob > 0:
            self.uncondition = nn.Parameter(torch.empty(conditions_shape[1:]))

token_drop方法根据丢弃概率随机丢弃条件，或根据强制丢弃ID进行丢弃

    def token_drop(self, conditions, force_drop_ids=None):
        """
        Drops conditions to enable classifier-free guidance.
        """
        if force_drop_ids is None:
            drop_ids = torch.rand(conditions.shape[0], device=conditions.device) < self.dropout_prob
        else:
            drop_ids = force_drop_ids == 1
        conditions = torch.where(drop_ids.unsqueeze(1).unsqueeze(1).expand(conditions.shape[0], *self.uncondition.shape), self.uncondition, conditions)
        return conditions

forward方法根据训练模式和强制丢弃ID决定是否进行条件丢弃，然后通过线性层处理条件

    def forward(self, conditions, train, force_drop_ids=None):
        use_dropout = self.dropout_prob > 0
        if (train and use_dropout) or (force_drop_ids is not None):
            conditions = self.token_drop(conditions, force_drop_ids)
        embeddings = self.linear(conditions)
        return embeddings

2.1.2.2 ActionEmbedder类

根据此文《从Octo、OpenVLA到TinyVLA、CogACT——视觉语言动作模型VLA的持续升级(含Prismatic VLM详解)》的这一节5.2.2 扩散动作模块：基于Diffusion Transformer(DiT)，可知

动作模块接收认知特征作为输入条件以生成一系列动作，如公式1 和2 所定义

鉴于现实世界中的物理动作是连续的且常常是多模态的，作者使用扩散建模过程来预测它们[47-Improved denoising diffusion probabilistic models]

且为了建模复杂的和时间相关的动作，应用Diffusion Transformer(DiT) [51- Scalable diffusion models with transformers，详见此文《视频生成Sora的全面解析：从AI绘画、ViT到ViViT、TECO、DiT、VDT、NaViT等》的2.4节DiT(含U-ViT)：将扩散过程中的U-Net 换成ViT(2D图像生成，带文本条件融合)] 作为动作解码过程的强大骨干
具体来说，CogACT的动作模块将认知特征 $f_{t}^{c}$ 和一系列带有噪声的动作 $\left(\boldsymbol{a}_{t}^{i}, \boldsymbol{a}_{t+1}^{i}, \ldots, \boldsymbol{a}_{t+N}^{i}\right)$ 作为输入，其中 $i$ 表示当前去噪步骤。它通过多个去噪步骤预测最终动作 $\left(\boldsymbol{a}_{t}, \boldsymbol{a}_{t+1}, \ldots, \boldsymbol{a}_{t+N}\right)$
Specifically,our action module takes the cognition feature fct along with a series of noisy actions(ait, ait+1, ..., ait+N) as input, where i denotes the current denoising step.It predicts the final actions(at, at+1, ..., at+N) through multiple denoising steps.
1) 认知特征和噪声动作作为输入token传递给transformer 模块，而步骤信息 $i$ 通过正弦位置编码添加到认知特征中「The cognition feature and the noisy actions serve as input tokens to the transformer blocks, while the step informationi is added to the cognition feature with a sinusoidal posi-tional encoding」
2) 作者强制动作模型不仅预测当前动作 $\boldsymbol{a}_{t}$ ，还预测多个未来动作 $\left(\boldsymbol{a}_{t+1}, \ldots, \boldsymbol{a}_{t+N}\right)$ 。这种方法增强了每个时间步长的预测动作的整体平滑性，并提高了任务执行的最终成功率，这在之前的研究中也有类似观察[15-Diffusion policy, 75-ALOHA ACT]

在实践中，预测的未来动作数量设置为一个较小的值(默认情况下N = 15)，导致动作模块的上下文长度为N + 2 = 17。这使得扩散过程非常高效，并且不会给整体框架带来太多计算成本

ActionEmbedder类用于将动作嵌入到向量表示中，其构造函数初始化了一个线性层，forward方法通过该线性层处理输入

class ActionEmbedder(nn.Module):
    def __init__(self, action_size, hidden_size):
        super().__init__()
        self.linear = nn.Linear(action_size, hidden_size)

    def forward(self, x):
        x = self.linear(x)
        return x

2.1.2.3 DiTBlock类、FinalLayer类、DiT类

首先，DiTBlock类表示一个带有自注意力条件的DiT块，其构造函数初始化了两个LayerNorm层、自注意力层和一个MLP，forward方法通过自注意力层和MLP处理输入，并添加残差连接

具体而言，其构造函数接受三个主要参数：hidden_size、num_heads和mlp_ratio，并通过`**block_kwargs`接受其他可选参数
```
    def __init__(self, hidden_size, num_heads, mlp_ratio=4.0, **block_kwargs):
        super().__init__()
```
构造函数中初始化了两个LayerNorm层、自注意力层和一个MLP
```
        self.norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
        self.attn = Attention(hidden_size, num_heads=num_heads, qkv_bias=True, **block_kwargs)
        self.norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
```
self.norm1和self.norm2是LayerNorm层，用于标准化输入
self.attn是一个自注意力层，使用了多头注意力机制，并且支持查询、键和值的偏置
self.mlp是一个多层感知器，包含一个隐藏层，其大小是hidden_size的mlp_ratio倍，激活函数使用了近似的GELU
在forward方法中
```
    def forward(self, x):
```
首先对输入x应用第一个LayerNorm层，然后通过自注意力层处理，并将结果与原始输入相加
```
        x = x + self.attn(self.norm1(x))
```
接着，对结果应用第二个LayerNorm层，通过MLP处理，并再次将结果与输入相加
```
        x = x + self.mlp(self.norm2(x))
```
最终返回处理后的结果
```
        return x
```
这个结构确保了每个块都包含自注意力和MLP的处理，同时保留了输入的残差连接

其次，FinalLayer类表示DiT的最终层

其构造函数初始化了一个LayerNorm层和一个线性层

    def __init__(self, hidden_size, out_channels):
        super().__init__()
        self.norm_final = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
        self.linear = nn.Linear(hidden_size, out_channels, bias=True)

forward方法通过这些层处理输入

    def forward(self, x):
        x = self.norm_final(x)
        x = self.linear(x)
        return x

最后，DiT类是一个带有Transformer骨干的扩散模型

首先，其构造函数接受多个参数，包括输入通道数、隐藏层大小、深度、注意力头数量、MLP比例、类别丢弃概率、token大小、未来动作窗口大小、过去动作窗口大小和是否学习sigma

    def __init__(
        self,
        in_channels=7,
        hidden_size=1152,
        depth=28,
        num_heads=16,
        mlp_ratio=4.0,
        class_dropout_prob=0.1,
        token_size=4096,
        future_action_window_size=1,
        past_action_window_size=0,
        learn_sigma=False,
    ):

$\rightarrow$ 然后构造函数中初始化了各种嵌入器

        super().__init__()

        assert past_action_window_size == 0, "Error: action_history is not used now"

        self.learn_sigma = learn_sigma
        self.in_channels = in_channels
        self.out_channels = in_channels * 2 if learn_sigma else in_channels
        self.class_dropout_prob = class_dropout_prob
        self.num_heads = num_heads
        self.past_action_window_size = past_action_window_size
        self.future_action_window_size = future_action_window_size

$\rightarrow$ 位置嵌入、多个DiT块、最终层

        # Action history is not used now.
        self.history_embedder = HistoryEmbedder(action_size=in_channels, hidden_size=hidden_size)
        

        // 分别调用上面定义的三个Embedder类
        self.x_embedder = ActionEmbedder(action_size=in_channels, hidden_size=hidden_size)
        self.t_embedder = TimestepEmbedder(hidden_size)
        self.z_embedder = LabelEmbedder(in_size=token_size, hidden_size=hidden_size, dropout_prob=class_dropout_prob)
        scale = hidden_size ** -0.5

        # Learnable positional embeddings
        # +2, one for the conditional token, and one for the current action prediction
        self.positional_embedding = nn.Parameter(
                scale * torch.randn(future_action_window_size + past_action_window_size + 2, hidden_size))

        self.blocks = nn.ModuleList([
            // 调用上面定义的DiTBlock
            DiTBlock(hidden_size, num_heads, mlp_ratio=mlp_ratio) for _ in range(depth)
        ])

        // 调用上面定义的FinalLayer
        self.final_layer = FinalLayer(hidden_size, self.out_channels)

$\rightarrow$ 并调用initialize_weights方法初始化权重


        // initialize_weights定义——下文马上来
        self.initialize_weights()

在initialize_weights方法中，定义了一个内部函数_basic_init，用于初始化线性层的权重和偏置
然后，应用该函数初始化模型中的所有线性层。此外，还初始化了token嵌入、动作历史嵌入、标签嵌入和时间步嵌入的权重和偏置
forward方法执行模型的前向传播
首先，通过动作嵌入器处理输入动作

然后通过时间步嵌入器处理时间步，通过标签嵌入器处理条件
接着，将时间步嵌入和条件嵌入相加，并与动作嵌入拼接
然后，将拼接后的结果与位置嵌入相加，并通过多个DiT块和最终层生成输出

forward_with_cfg
方法实现了分类器自由引导的扩散前向传播

    def forward_with_cfg(self, x, t, z, cfg_scale):
        """
        Forward pass of Diffusion, but also batches the unconditional forward pass for classifier-free guidance.
        """

首先，将输入数据分成两部分，并将它们拼接在一起

        # https://github.com/openai/glide-text2im/blob/main/notebooks/text2im.ipynb
        half = x[: len(x) // 2]
        combined = torch.cat([half, half], dim=0).to(next(self.x_embedder.parameters()).dtype)

然后，通过模型的前向传播生成输出

        model_out = self.forward(combined, t, z)

接着，将输出分成条件部分和无条件部分，并计算条件部分和无条件部分的加权和

        eps, rest = model_out[:, :self.in_channels], model_out[:, self.in_channels:]
        cond_eps, uncond_eps = torch.split(eps, len(eps) // 2, dim=0)
        half_eps = uncond_eps + cfg_scale * (cond_eps - uncond_eps)

最后，将加权和与剩余部分拼接在一起，生成最终输出

        eps = torch.cat([half_eps, half_eps], dim=0)
        return torch.cat([eps, rest], dim=1)

// 待更