Transformer Encoder-Decoer 结构回顾

有关于Transformer、BERT及其各种变体的详细介绍请参照笔者另一篇博客:最火的几个全网络预训练模型梳理整合(BERT、ALBERT、XLNet详解)

本文基于对T5一文的理解,再重新回顾一下有关于auto-encoder、auto-regressive等常见概念,以及Transformer-based model的decoder结构。

在这里插入图片描述

图源:https://twitter.com/nash_su/status/1639915613727641608

1. Auto-encoder & Auto-regressive Language Model

1.1 Auto-encoder

类似于BERTALBERTRoBERTa这类,encoder-only的language model。

优点:

  • 能够保证同时上下文。

缺点:

  • token之间的条件独立假设。违反自然语言生成的直觉性。
  • encoder-only的时候,预训练目标不能和很多生成任务一致。

而结合预训练的时候的objective,像BERT这类Masked Language Model (MLM)又可以叫做denoised Auto-encoder(去噪自编码)。

1.2 Auto-regressive

类似于ELMOGPT这类,时序LM和decoder-only LM。

传统的时序Language Model ,类似于RNN、ELMO,严格意义上不能叫做decoder;只是后来出现了大量基于transformer的auto-regressive LM,比方说GPT等,它们都是用作文本生成直接解码输出结果。所以在”transformer时代“下,Auto-regressive现在很多时候也被简单地理解为decoder-only。大致概念可以按下图理解:
在这里插入图片描述

优点:

  • 无条件独立假设。
  • 预训练可以直接做生成任务,符合下游生成任务的objective。

缺点:

  • 不能同时双向编码信息(像ELMO这种是”伪双向“,而且容易“透露答案”)。

而目前自然语言处理的auto-regressive结构,大多基于Transformer;像传统时序LM,ELMO这种也已经快被遗忘了。

2. Transformer-based Model 结构概览

如前文所述,我们目前理解LM (autoencoder & autoregressive),大多是基于transformer的结构的。所以这一节暂时不讨论传统时序LM,ELMO这种

T5一文1曾对Transformer的经典结构进行过概述,主要分为以下三种:

在这里插入图片描述

  1. Encoder-only Language Model:也即Auto-encoder。如第一节所述,特点是,能同时双向编码。代表有:BERTRoBERTa等。
  2. Decoder-only Language Model(上图,中):也即Auto Regressive (不包括传统的时序模型),可以简单理解成只有Decoder。特点是,只能看到前文信息(因为decoder-only)。代表有:GPT等。
  3. Encoder-Decoder:也即Auto-encoder + Auto regressive(上图,左)。最原始的Transformer结构,encoder和decoder都有self-attentino支持,decoder还有额外的cross-attention用来结合encoder的输出信息。特点是,encoder能够同时看到上下文双向信息,而decoder只能看到前文信息。代表有:BARTT5等,这也使得这类模型特别适合生成式任务。
  4. Prefix LM:可以简单地理解为Encoder-Decoder结构的变形(上图,右)。特点是,一部分像 Encoder 一样,能看到上下文信息;而其余部分则和 Decoder 一样,只能看到过去信息。代表有UniLM等·。

而Transformer结构,如果想要实现所谓的,“同时上下文信息”、“只看到前文信息”,则需要依赖masked attention,因为transformer的self-attention,默认是全文计算attention,想要部分不可见,就得mask。这一点和传统时序LM不一样,像RNN这种,下一个token依赖于前文的hidden state,天然地就只能看到前文信息。

下面就以Encoder-Decoder LM为例,讲一下如何使用attention,实现encoder看到全文,而decoder只看到前文。 顺带介绍一下encoder的decoder的具体运作区别,因为区别之一就是这个attention机制。

3. Encoder / Decoder间的区别 & Masked Attention机制

这里以T5的代码为例。Transformer中的mask其实分为两种:1)padding mask;2)sequence mask。

3.1 Padding mask vs. Sequence mask

padding mask很简单,就是我们常用的同一个batch里面,把那些较短的样本,补至最长的样本长度。因为这些填充的位置,其实是没什么意义的,所以Attention机制不应该把注意力放在这些位置上,自然会在padding 位置进行attention mask。这个操作在encoder和decoder中都有用到,只是一个简单的tensor批量运算操作。

sequence mask是为了使得Decoder只能看见上文信息。所以当前step之后的文本信息,都会被mask掉。这个操作仅在decoder中使用

总而言之,上述两种mask方式中,sequence mask是用于实现模型是否可见后文信息的关键

3.2 Encoder

Encoder用的是全部的上下文信息,所以这边的sequence mask全部为0,形状为【batch_size, 1,1, seq_len】:
在这里插入图片描述

3.3 Decoder

Dncoder用的是前文的信息,所以这边的attention mask是一个矩阵 (seq_len * seq_len),其中下三角全0,迫使模型只能看到输入中的前文信息。如下图所示,sequence mask的实际形状为【batch_size, 1, seq_len, seq_len】

在这里插入图片描述

3.4 Encoder 和Decoder的区别

这里最后再简单总结一下Encoder-Decoder结构的LM,其Encoder和Decoder之间的区别。

首先,decoder有三层,sub-layer[1]计算self-attention,sub-layer[2]计算cross-attention,sub-layer[3]则是Linear层把最终hidden印射为vocab_size的logits。

sub_layer_1 (self-attention): 
	- inputs: 
		- [when doing train]: decoder_input_ids + sequence_mask_attention 
		- [when doing test]:  predicted tokens
	- outputs: hidden_states_1
		 
sub_layer_2 (cross-attention): 
	- inputs: hidden_state_1 (query) and encoder_hidden_states (key & value)
	- outputs: hidden_states_2

sub_layer_3 (linear+softmax): 
	- inputs: hidden_states_2
	- outputs: logits (vocab probability)

所以decoder相较于encoder,其大部分结构和计算都是一样的,只不过多出这三个部分:

  1. sequence mask:Decoder在计算sub-layer[1]的self-attention时,有sequence mask机制,确保只看到前文信息;而Encoder则没有这部分mask机制,计算self-attention时默认看到全文。(这里指的训练阶段,decoder在预测阶段也没有sequence mask,具体参见本文第4节
  2. cross-attention:Decoder的sub_layer[2]会计算cross-attention。具体来讲,这一层会使用encoder的output hidden作为k,v,而sub-layer[1]的self-attention的output作为q,来进行self-attention的计算。这是为了让decoder充分融合encoder端的全部上下文信息,所以名为“cross-attention”
  3. linear+softmax:最后sub_layer[3]会有一个映射层,输出每个token的词表概率预测。

4. Decoder 在训练、预测前后的区别

Encoder在训练和测试时,没有计算上的变化。

但是Decoder结构,在模型推理时,和先前训练时,工作方式会存在一定的区别

为了方便理解,首先介绍更简单的推理过程,然后引出训练过程的不同。

4.1 推理

auto-regressive LM在模型test的时候,预测下一个词,会基于之前的所有预测结果。

而一开始,为了能够生成第一个词,需要传入一个special token,我们称之为"start letter"。不同模型的start letter不一样 (e.g., T5就是ids为0的padding token;其他很多模型,像BERT就是BOS token)。
在这里插入图片描述
如上图,模型预测了第一个词的词表概率分布之后,我们取最高概率的那个词作为预测结果 (i.e., greedy decode, 也可以使用multinomial sampling,beam search等其他方案)。

比方说,这里就是“Ich”这个词。

然后第二步,我们就把start letter + "ich"这两个词作为新的输入,让模型预测下一个词。

在这里插入图片描述
同样的,下一个词预测出来之后,我们重复上述的步骤,继续把先前预测的所有词(start letter + predicted tokens)作为新的输入,继续让模型预测。直到模型预测出"ending letter" (e.g., EOS),结束预测。

总而言之,如下图所示,右手边的decoder,会利用先前已经生成的tokens作为新的输入,来预测下一个词。而下一个词的概率,则是上一个词的输出结果。比方说,下图中的“Ich”这个词,是由BOS的输出结果“I1”概率分布决定的。

在这里插入图片描述

由于decoder的目标就是利用已有的tokens,预测下一个token,所以decoder这里会有一个input和output的"错位"(如上图,y1 == “Ich” 对应的输出其实是“I2”),理解的时候需要稍微注意一下。

有关于transformer框架,如果利用常规的forward()接口实现推理时,需要额外注意的就是start letter,必须有start letter,而且也必须是正确的start letter。
所以,在使用decoder模型进行预测推理的时候,官方推荐调用统一接口generate(),会自动帮你的input加上模型的start letter。
具体请参照官网:Text Generation,查看更多有关于generate()的参数,以及sampling策略。

4.2 训练

训练时,会采用经典的teacher forcing 操作。

其核心思想在于:对于auto-regressive LM,下一个词的预测是需要基于前面所有已经预测的词;而在训练的过程中,如果让模型利用自己的预测结果,来预测下一个词,模型很难预测对(误差传播),这样越往后的token,loss就越大。总之,这样会造成优化困难。

所以,在训练的时候,我们会给模型提供正确答案作为decoder的输入,目的是为了只惩罚当前待预测的token,而不把先前模型预测的误差计算进来。 更多有关于teacher forcing的细节,详见这篇博客:教师强制.

4.2.1 具体实现

从transformer框架的代码实现来讲,不同于预测推理,训练时的decoder会有一个额外的输入,称之为decoder_input_ids。也就是先前提到的,我们会给模型提供正确的答案作为输入。

它大概长这个样子:

在这里插入图片描述
如上图,我们会发现,decoder_input_ids和正确答案labels的内容几乎是一摸一样的。

唯一的差别,就在于:labelstokens + ending letter;而decoder_input_idsstart letter + tokens。照应之前提到过的,decoder的输入、输出“错位”。

有人就会说,这里一开始就把所有正确的tokens都作为输入了,不就把正确答案都透露给模型了吗。这里的话,就照应前面提到过的sequence mask,在预测下一个词的时候,我们只提供前面的序列,而后面的序列(包括待预测的这个词),我们都会把输入进行mask。

比方说:

  • step1
decoder_input_ids = [0,10747,7,15]
sequence_mask     = [0,1,1,1]
labels            = [10747,7,15,1]

这样一来,decoder的self-attention就只看到第一个词,也就是0(start letter)。目标是为了预测第二个词的概率。这个概率后续和labels的第一个词,即10747,进行cross entropy的计算。

  • step2
decoder_input_ids = [0,10747,7,15]
sequence_mask     = [0,0,1,1]      ## sequence mask changed
labels            = [10747,7,15,1]

随后,mask发生变化,decoder的self-attention就看到两个词,一个还是0(start letter),另一个就是我们给他提供的正确答案10747

这里也就照应前文第三节,为什么sequence mask是一个下三角全0的矩阵。 因为每次预测下一个词,都对应sequence mask矩阵的一行。

有没有发现,其实和推理的过程是一样的,唯一的区别就在于,我们不是给模型提供了它自己的已预测结果,而是提供了对应的答案。然后用sequence mask确保之后的答案不被泄露。

  • step n
decoder_input_ids = [0,10747,7,15]
sequence_mask     = [0,0,0,0]      ## all decoder_input_ids can be seen
labels            = [10747,7,15,1]

一直重复上述,直到达到最大长度(teacher forcing的思想,我们已经知道答案的最终长度,不是让模型预测出ending letter才结束)。

这时,模型已经可见所有的decoder_input_ids,目标是为了预测最后一个词的概率。用这个概率和labels的最后一个词,也就是1 (ending letter)进行loss的计算。

带decoder的模型,一般都内带一个名为prepare_decoder_input_ids_from_labels的方法,可以将labels构造成decoder_input_ids(如上所述,其实就是shift,把labels错位了一下)。

在这里插入图片描述

具体可以参见官网:https://huggingface.co/docs/transformers/v4.23.1/en/model_doc/plbart#transformers.PLBartTokenizer.build_inputs_with_special_tokens


这里再额外推荐一篇又关于大语言模型的survey, 其中也包含了很多对于语言模型encoder、decoder的总结:


另外,再放点tweet上的一些讨论 (两个人的观点其实都是对的,只不过描述的角度不太一样;相对而言,Yi tay的描述更加清楚):

在这里插入图片描述


参考

Some weights of the model checkpoint at /media/bigdisk/SimRobo/vjepa2_hf were not used when initializing VJEPA2Model: ['encoder.blocks.0.attn.proj.bias', 'encoder.blocks.0.attn.proj.weight', 'encoder.blocks.0.attn.qkv.bias', 'encoder.blocks.0.attn.qkv.weight', 'encoder.blocks.0.mlp.fc1.bias', 'encoder.blocks.0.mlp.fc1.weight', 'encoder.blocks.0.mlp.fc2.bias', 'encoder.blocks.0.mlp.fc2.weight', 'encoder.blocks.0.norm1.bias', 'encoder.blocks.0.norm1.weight', 'encoder.blocks.0.norm2.bias', 'encoder.blocks.0.norm2.weight', 'encoder.blocks.1.attn.proj.bias', 'encoder.blocks.1.attn.proj.weight', 'encoder.blocks.1.attn.qkv.bias', 'encoder.blocks.1.attn.qkv.weight', 'encoder.blocks.1.mlp.fc1.bias', 'encoder.blocks.1.mlp.fc1.weight', 'encoder.blocks.1.mlp.fc2.bias', 'encoder.blocks.1.mlp.fc2.weight', 'encoder.blocks.1.norm1.bias', 'encoder.blocks.1.norm1.weight', 'encoder.blocks.1.norm2.bias', 'encoder.blocks.1.norm2.weight', 'encoder.blocks.10.attn.proj.bias', 'encoder.blocks.10.attn.proj.weight', 'encoder.blocks.10.attn.qkv.bias', 'encoder.blocks.10.attn.qkv.weight', 'encoder.blocks.10.mlp.fc1.bias', 'encoder.blocks.10.mlp.fc1.weight', 'encoder.blocks.10.mlp.fc2.bias', 'encoder.blocks.10.mlp.fc2.weight', 'encoder.blocks.10.norm1.bias', 'encoder.blocks.10.norm1.weight', 'encoder.blocks.10.norm2.bias', 'encoder.blocks.10.norm2.weight', 'encoder.blocks.11.attn.proj.bias', 'encoder.blocks.11.attn.proj.weight', 'encoder.blocks.11.attn.qkv.bias', 'encoder.blocks.11.attn.qkv.weight', 'encoder.blocks.11.mlp.fc1.bias', 'encoder.blocks.11.mlp.fc1.weight', 'encoder.blocks.11.mlp.fc2.bias', 'encoder.blocks.11.mlp.fc2.weight', 'encoder.blocks.11.norm1.bias', 'encoder.blocks.11.norm1.weight', 'encoder.blocks.11.norm2.bias', 'encoder.blocks.11.norm2.weight', 'encoder.blocks.12.attn.proj.bias', 'encoder.blocks.12.attn.proj.weight', 'encoder.blocks.12.attn.qkv.bias', 'encoder.blocks.12.attn.qkv.weight', 'encoder.blocks.12.mlp.fc1.bias', 'encoder.blocks.12.mlp.fc1.weight', 'encoder.blocks.12.mlp.fc2.bias', 'encoder.blocks.12.mlp.fc2.weight', 'encoder.blocks.12.norm1.bias', 'encoder.blocks.12.norm1.weight', 'encoder.blocks.12.norm2.bias', 'encoder.blocks.12.norm2.weight', 'encoder.blocks.13.attn.proj.bias', 'encoder.blocks.13.attn.proj.weight', 'encoder.blocks.13.attn.qkv.bias', 'encoder.blocks.13.attn.qkv.weight', 'encoder.blocks.13.mlp.fc1.bias', 'encoder.blocks.13.mlp.fc1.weight', 'encoder.blocks.13.mlp.fc2.bias', 'encoder.blocks.13.mlp.fc2.weight', 'encoder.blocks.13.norm1.bias', 'encoder.blocks.13.norm1.weight', 'encoder.blocks.13.norm2.bias', 'encoder.blocks.13.norm2.weight', 'encoder.blocks.14.attn.proj.bias', 'encoder.blocks.14.attn.proj.weight', 'encoder.blocks.14.attn.qkv.bias', 'encoder.blocks.14.attn.qkv.weight', 'encoder.blocks.14.mlp.fc1.bias', 'encoder.blocks.14.mlp.fc1.weight', 'encoder.blocks.14.mlp.fc2.bias', 'encoder.blocks.14.mlp.fc2.weight', 'encoder.blocks.14.norm1.bias', 'encoder.blocks.14.norm1.weight', 'encoder.blocks.14.norm2.bias', 'encoder.blocks.14.norm2.weight', 'encoder.blocks.15.attn.proj.bias', 'encoder.blocks.15.attn.proj.weight', 'encoder.blocks.15.attn.qkv.bias', 'encoder.blocks.15.attn.qkv.weight', 'encoder.blocks.15.mlp.fc1.bias', 'encoder.blocks.15.mlp.fc1.weight', 'encoder.blocks.15.mlp.fc2.bias', 'encoder.blocks.15.mlp.fc2.weight', 'encoder.blocks.15.norm1.bias', 'encoder.blocks.15.norm1.weight', 'encoder.blocks.15.norm2.bias', 'encoder.blocks.15.norm2.weight', 'encoder.blocks.16.attn.proj.bias', 'encoder.blocks.16.attn.proj.weight', 'encoder.blocks.16.attn.qkv.bias', 'encoder.blocks.16.attn.qkv.weight', 'encoder.blocks.16.mlp.fc1.bias', 'encoder.blocks.16.mlp.fc1.weight', 'encoder.blocks.16.mlp.fc2.bias', 'encoder.blocks.16.mlp.fc2.weight', 'encoder.blocks.16.norm1.bias', 'encoder.blocks.16.norm1.weight', 'encoder.blocks.16.norm2.bias', 'encoder.blocks.16.norm2.weight', 'encoder.blocks.17.attn.proj.bias', 'encoder.blocks.17.attn.proj.weight', 'encoder.blocks.17.attn.qkv.bias', 'encoder.blocks.17.attn.qkv.weight', 'encoder.blocks.17.mlp.fc1.bias', 'encoder.blocks.17.mlp.fc1.weight', 'encoder.blocks.17.mlp.fc2.bias', 'encoder.blocks.17.mlp.fc2.weight', 'encoder.blocks.17.norm1.bias', 'encoder.blocks.17.norm1.weight', 'encoder.blocks.17.norm2.bias', 'encoder.blocks.17.norm2.weight', 'encoder.blocks.18.attn.proj.bias', 'encoder.blocks.18.attn.proj.weight', 'encoder.blocks.18.attn.qkv.bias', 'encoder.blocks.18.attn.qkv.weight', 'encoder.blocks.18.mlp.fc1.bias', 'encoder.blocks.18.mlp.fc1.weight', 'encoder.blocks.18.mlp.fc2.bias', 'encoder.blocks.18.mlp.fc2.weight', 'encoder.blocks.18.norm1.bias', 'encoder.blocks.18.norm1.weight', 'encoder.blocks.18.norm2.bias', 'encoder.blocks.18.norm2.weight', 'encoder.blocks.19.attn.proj.bias', 'encoder.blocks.19.attn.proj.weight', 'encoder.blocks.19.attn.qkv.bias', 'encoder.blocks.19.attn.qkv.weight', 'encoder.blocks.19.mlp.fc1.bias', 'encoder.blocks.19.mlp.fc1.weight', 'encoder.blocks.19.mlp.fc2.bias', 'encoder.blocks.19.mlp.fc2.weight', 'encoder.blocks.19.norm1.bias', 'encoder.blocks.19.norm1.weight', 'encoder.blocks.19.norm2.bias', 'encoder.blocks.19.norm2.weight', 'encoder.blocks.2.attn.proj.bias', 'encoder.blocks.2.attn.proj.weight', 'encoder.blocks.2.attn.qkv.bias', 'encoder.blocks.2.attn.qkv.weight', 'encoder.blocks.2.mlp.fc1.bias', 'encoder.blocks.2.mlp.fc1.weight', 'encoder.blocks.2.mlp.fc2.bias', 'encoder.blocks.2.mlp.fc2.weight', 'encoder.blocks.2.norm1.bias', 'encoder.blocks.2.norm1.weight', 'encoder.blocks.2.norm2.bias', 'encoder.blocks.2.norm2.weight', 'encoder.blocks.20.attn.proj.bias', 'encoder.blocks.20.attn.proj.weight', 'encoder.blocks.20.attn.qkv.bias', 'encoder.blocks.20.attn.qkv.weight', 'encoder.blocks.20.mlp.fc1.bias', 'encoder.blocks.20.mlp.fc1.weight', 'encoder.blocks.20.mlp.fc2.bias', 'encoder.blocks.20.mlp.fc2.weight', 'encoder.blocks.20.norm1.bias', 'encoder.blocks.20.norm1.weight', 'encoder.blocks.20.norm2.bias', 'encoder.blocks.20.norm2.weight', 'encoder.blocks.21.attn.proj.bias', 'encoder.blocks.21.attn.proj.weight', 'encoder.blocks.21.attn.qkv.bias', 'encoder.blocks.21.attn.qkv.weight', 'encoder.blocks.21.mlp.fc1.bias', 'encoder.blocks.21.mlp.fc1.weight', 'encoder.blocks.21.mlp.fc2.bias', 'encoder.blocks.21.mlp.fc2.weight', 'encoder.blocks.21.norm1.bias', 'encoder.blocks.21.norm1.weight', 'encoder.blocks.21.norm2.bias', 'encoder.blocks.21.norm2.weight', 'encoder.blocks.22.attn.proj.bias', 'encoder.blocks.22.attn.proj.weight', 'encoder.blocks.22.attn.qkv.bias', 'encoder.blocks.22.attn.qkv.weight', 'encoder.blocks.22.mlp.fc1.bias', 'encoder.blocks.22.mlp.fc1.weight', 'encoder.blocks.22.mlp.fc2.bias', 'encoder.blocks.22.mlp.fc2.weight', 'encoder.blocks.22.norm1.bias', 'encoder.blocks.22.norm1.weight', 'encoder.blocks.22.norm2.bias', 'encoder.blocks.22.norm2.weight', 'encoder.blocks.23.attn.proj.bias', 'encoder.blocks.23.attn.proj.weight', 'encoder.blocks.23.attn.qkv.bias', 'encoder.blocks.23.attn.qkv.weight', 'encoder.blocks.23.mlp.fc1.bias', 'encoder.blocks.23.mlp.fc1.weight', 'encoder.blocks.23.mlp.fc2.bias', 'encoder.blocks.23.mlp.fc2.weight', 'encoder.blocks.23.norm1.bias', 'encoder.blocks.23.norm1.weight', 'encoder.blocks.23.norm2.bias', 'encoder.blocks.23.norm2.weight', 'encoder.blocks.24.attn.proj.bias', 'encoder.blocks.24.attn.proj.weight', 'encoder.blocks.24.attn.qkv.bias', 'encoder.blocks.24.attn.qkv.weight', 'encoder.blocks.24.mlp.fc1.bias', 'encoder.blocks.24.mlp.fc1.weight', 'encoder.blocks.24.mlp.fc2.bias', 'encoder.blocks.24.mlp.fc2.weight', 'encoder.blocks.24.norm1.bias', 'encoder.blocks.24.norm1.weight', 'encoder.blocks.24.norm2.bias', 'encoder.blocks.24.norm2.weight', 'encoder.blocks.25.attn.proj.bias', 'encoder.blocks.25.attn.proj.weight', 'encoder.blocks.25.attn.qkv.bias', 'encoder.blocks.25.attn.qkv.weight', 'encoder.blocks.25.mlp.fc1.bias', 'encoder.blocks.25.mlp.fc1.weight', 'encoder.blocks.25.mlp.fc2.bias', 'encoder.blocks.25.mlp.fc2.weight', 'encoder.blocks.25.norm1.bias', 'encoder.blocks.25.norm1.weight', 'encoder.blocks.25.norm2.bias', 'encoder.blocks.25.norm2.weight', 'encoder.blocks.26.attn.proj.bias', 'encoder.blocks.26.attn.proj.weight', 'encoder.blocks.26.attn.qkv.bias', 'encoder.blocks.26.attn.qkv.weight', 'encoder.blocks.26.mlp.fc1.bias', 'encoder.blocks.26.mlp.fc1.weight', 'encoder.blocks.26.mlp.fc2.bias', 'encoder.blocks.26.mlp.fc2.weight', 'encoder.blocks.26.norm1.bias', 'encoder.blocks.26.norm1.weight', 'encoder.blocks.26.norm2.bias', 'encoder.blocks.26.norm2.weight', 'encoder.blocks.27.attn.proj.bias', 'encoder.blocks.27.attn.proj.weight', 'encoder.blocks.27.attn.qkv.bias', 'encoder.blocks.27.attn.qkv.weight', 'encoder.blocks.27.mlp.fc1.bias', 'encoder.blocks.27.mlp.fc1.weight', 'encoder.blocks.27.mlp.fc2.bias', 'encoder.blocks.27.mlp.fc2.weight', 'encoder.blocks.27.norm1.bias', 'encoder.blocks.27.norm1.weight', 'encoder.blocks.27.norm2.bias', 'encoder.blocks.27.norm2.weight', 'encoder.blocks.28.attn.proj.bias', 'encoder.blocks.28.attn.proj.weight', 'encoder.blocks.28.attn.qkv.bias', 'encoder.blocks.28.attn.qkv.weight', 'encoder.blocks.28.mlp.fc1.bias', 'encoder.blocks.28.mlp.fc1.weight', 'encoder.blocks.28.mlp.fc2.bias', 'encoder.blocks.28.mlp.fc2.weight', 'encoder.blocks.28.norm1.bias', 'encoder.blocks.28.norm1.weight', 'encoder.blocks.28.norm2.bias', 'encoder.blocks.28.norm2.weight', 'encoder.blocks.29.attn.proj.bias', 'encoder.blocks.29.attn.proj.weight', 'encoder.blocks.29.attn.qkv.bias', 'encoder.blocks.29.attn.qkv.weight', 'encoder.blocks.29.mlp.fc1.bias', 'encoder.blocks.29.mlp.fc1.weight', 'encoder.blocks.29.mlp.fc2.bias', 'encoder.blocks.29.mlp.fc2.weight', 'encoder.blocks.29.norm1.bias', 'encoder.blocks.29.norm1.weight', 'encoder.blocks.29.norm2.bias', 'encoder.blocks.29.norm2.weight', 'encoder.blocks.3.attn.proj.bias', 'encoder.blocks.3.attn.proj.weight', 'encoder.blocks.3.attn.qkv.bias', 'encoder.blocks.3.attn.qkv.weight', 'encoder.blocks.3.mlp.fc1.bias', 'encoder.blocks.3.mlp.fc1.weight', 'encoder.blocks.3.mlp.fc2.bias', 'encoder.blocks.3.mlp.fc2.weight', 'encoder.blocks.3.norm1.bias', 'encoder.blocks.3.norm1.weight', 'encoder.blocks.3.norm2.bias', 'encoder.blocks.3.norm2.weight', 'encoder.blocks.30.attn.proj.bias', 'encoder.blocks.30.attn.proj.weight', 'encoder.blocks.30.attn.qkv.bias', 'encoder.blocks.30.attn.qkv.weight', 'encoder.blocks.30.mlp.fc1.bias', 'encoder.blocks.30.mlp.fc1.weight', 'encoder.blocks.30.mlp.fc2.bias', 'encoder.blocks.30.mlp.fc2.weight', 'encoder.blocks.30.norm1.bias', 'encoder.blocks.30.norm1.weight', 'encoder.blocks.30.norm2.bias', 'encoder.blocks.30.norm2.weight', 'encoder.blocks.31.attn.proj.bias', 'encoder.blocks.31.attn.proj.weight', 'encoder.blocks.31.attn.qkv.bias', 'encoder.blocks.31.attn.qkv.weight', 'encoder.blocks.31.mlp.fc1.bias', 'encoder.blocks.31.mlp.fc1.weight', 'encoder.blocks.31.mlp.fc2.bias', 'encoder.blocks.31.mlp.fc2.weight', 'encoder.blocks.31.norm1.bias', 'encoder.blocks.31.norm1.weight', 'encoder.blocks.31.norm2.bias', 'encoder.blocks.31.norm2.weight', 'encoder.blocks.32.attn.proj.bias', 'encoder.blocks.32.attn.proj.weight', 'encoder.blocks.32.attn.qkv.bias', 'encoder.blocks.32.attn.qkv.weight', 'encoder.blocks.32.mlp.fc1.bias', 'encoder.blocks.32.mlp.fc1.weight', 'encoder.blocks.32.mlp.fc2.bias', 'encoder.blocks.32.mlp.fc2.weight', 'encoder.blocks.32.norm1.bias', 'encoder.blocks.32.norm1.weight', 'encoder.blocks.32.norm2.bias', 'encoder.blocks.32.norm2.weight', 'encoder.blocks.33.attn.proj.bias', 'encoder.blocks.33.attn.proj.weight', 'encoder.blocks.33.attn.qkv.bias', 'encoder.blocks.33.attn.qkv.weight', 'encoder.blocks.33.mlp.fc1.bias', 'encoder.blocks.33.mlp.fc1.weight', 'encoder.blocks.33.mlp.fc2.bias', 'encoder.blocks.33.mlp.fc2.weight', 'encoder.blocks.33.norm1.bias', 'encoder.blocks.33.norm1.weight', 'encoder.blocks.33.norm2.bias', 'encoder.blocks.33.norm2.weight', 'encoder.blocks.34.attn.proj.bias', 'encoder.blocks.34.attn.proj.weight', 'encoder.blocks.34.attn.qkv.bias', 'encoder.blocks.34.attn.qkv.weight', 'encoder.blocks.34.mlp.fc1.bias', 'encoder.blocks.34.mlp.fc1.weight', 'encoder.blocks.34.mlp.fc2.bias', 'encoder.blocks.34.mlp.fc2.weight', 'encoder.blocks.34.norm1.bias', 'encoder.blocks.34.norm1.weight', 'encoder.blocks.34.norm2.bias', 'encoder.blocks.34.norm2.weight', 'encoder.blocks.35.attn.proj.bias', 'encoder.blocks.35.attn.proj.weight', 'encoder.blocks.35.attn.qkv.bias', 'encoder.blocks.35.attn.qkv.weight', 'encoder.blocks.35.mlp.fc1.bias', 'encoder.blocks.35.mlp.fc1.weight', 'encoder.blocks.35.mlp.fc2.bias', 'encoder.blocks.35.mlp.fc2.weight', 'encoder.blocks.35.norm1.bias', 'encoder.blocks.35.norm1.weight', 'encoder.blocks.35.norm2.bias', 'encoder.blocks.35.norm2.weight', 'encoder.blocks.36.attn.proj.bias', 'encoder.blocks.36.attn.proj.weight', 'encoder.blocks.36.attn.qkv.bias', 'encoder.blocks.36.attn.qkv.weight', 'encoder.blocks.36.mlp.fc1.bias', 'encoder.blocks.36.mlp.fc1.weight', 'encoder.blocks.36.mlp.fc2.bias', 'encoder.blocks.36.mlp.fc2.weight', 'encoder.blocks.36.norm1.bias', 'encoder.blocks.36.norm1.weight', 'encoder.blocks.36.norm2.bias', 'encoder.blocks.36.norm2.weight', 'encoder.blocks.37.attn.proj.bias', 'encoder.blocks.37.attn.proj.weight', 'encoder.blocks.37.attn.qkv.bias', 'encoder.blocks.37.attn.qkv.weight', 'encoder.blocks.37.mlp.fc1.bias', 'encoder.blocks.37.mlp.fc1.weight', 'encoder.blocks.37.mlp.fc2.bias', 'encoder.blocks.37.mlp.fc2.weight', 'encoder.blocks.37.norm1.bias', 'encoder.blocks.37.norm1.weight', 'encoder.blocks.37.norm2.bias', 'encoder.blocks.37.norm2.weight', 'encoder.blocks.38.attn.proj.bias', 'encoder.blocks.38.attn.proj.weight', 'encoder.blocks.38.attn.qkv.bias', 'encoder.blocks.38.attn.qkv.weight', 'encoder.blocks.38.mlp.fc1.bias', 'encoder.blocks.38.mlp.fc1.weight', 'encoder.blocks.38.mlp.fc2.bias', 'encoder.blocks.38.mlp.fc2.weight', 'encoder.blocks.38.norm1.bias', 'encoder.blocks.38.norm1.weight', 'encoder.blocks.38.norm2.bias', 'encoder.blocks.38.norm2.weight', 'encoder.blocks.39.attn.proj.bias', 'encoder.blocks.39.attn.proj.weight', 'encoder.blocks.39.attn.qkv.bias', 'encoder.blocks.39.attn.qkv.weight', 'encoder.blocks.39.mlp.fc1.bias', 'encoder.blocks.39.mlp.fc1.weight', 'encoder.blocks.39.mlp.fc2.bias', 'encoder.blocks.39.mlp.fc2.weight', 'encoder.blocks.39.norm1.bias', 'encoder.blocks.39.norm1.weight', 'encoder.blocks.39.norm2.bias', 'encoder.blocks.39.norm2.weight', 'encoder.blocks.4.attn.proj.bias', 'encoder.blocks.4.attn.proj.weight', 'encoder.blocks.4.attn.qkv.bias', 'encoder.blocks.4.attn.qkv.weight', 'encoder.blocks.4.mlp.fc1.bias', 'encoder.blocks.4.mlp.fc1.weight', 'encoder.blocks.4.mlp.fc2.bias', 'encoder.blocks.4.mlp.fc2.weight', 'encoder.blocks.4.norm1.bias', 'encoder.blocks.4.norm1.weight', 'encoder.blocks.4.norm2.bias', 'encoder.blocks.4.norm2.weight', 'encoder.blocks.5.attn.proj.bias', 'encoder.blocks.5.attn.proj.weight', 'encoder.blocks.5.attn.qkv.bias', 'encoder.blocks.5.attn.qkv.weight', 'encoder.blocks.5.mlp.fc1.bias', 'encoder.blocks.5.mlp.fc1.weight', 'encoder.blocks.5.mlp.fc2.bias', 'encoder.blocks.5.mlp.fc2.weight', 'encoder.blocks.5.norm1.bias', 'encoder.blocks.5.norm1.weight', 'encoder.blocks.5.norm2.bias', 'encoder.blocks.5.norm2.weight', 'encoder.blocks.6.attn.proj.bias', 'encoder.blocks.6.attn.proj.weight', 'encoder.blocks.6.attn.qkv.bias', 'encoder.blocks.6.attn.qkv.weight', 'encoder.blocks.6.mlp.fc1.bias', 'encoder.blocks.6.mlp.fc1.weight', 'encoder.blocks.6.mlp.fc2.bias', 'encoder.blocks.6.mlp.fc2.weight', 'encoder.blocks.6.norm1.bias', 'encoder.blocks.6.norm1.weight', 'encoder.blocks.6.norm2.bias', 'encoder.blocks.6.norm2.weight', 'encoder.blocks.7.attn.proj.bias', 'encoder.blocks.7.attn.proj.weight', 'encoder.blocks.7.attn.qkv.bias', 'encoder.blocks.7.attn.qkv.weight', 'encoder.blocks.7.mlp.fc1.bias', 'encoder.blocks.7.mlp.fc1.weight', 'encoder.blocks.7.mlp.fc2.bias', 'encoder.blocks.7.mlp.fc2.weight', 'encoder.blocks.7.norm1.bias', 'encoder.blocks.7.norm1.weight', 'encoder.blocks.7.norm2.bias', 'encoder.blocks.7.norm2.weight', 'encoder.blocks.8.attn.proj.bias', 'encoder.blocks.8.attn.proj.weight', 'encoder.blocks.8.attn.qkv.bias', 'encoder.blocks.8.attn.qkv.weight', 'encoder.blocks.8.mlp.fc1.bias', 'encoder.blocks.8.mlp.fc1.weight', 'encoder.blocks.8.mlp.fc2.bias', 'encoder.blocks.8.mlp.fc2.weight', 'encoder.blocks.8.norm1.bias', 'encoder.blocks.8.norm1.weight', 'encoder.blocks.8.norm2.bias', 'encoder.blocks.8.norm2.weight', 'encoder.blocks.9.attn.proj.bias', 'encoder.blocks.9.attn.proj.weight', 'encoder.blocks.9.attn.qkv.bias', 'encoder.blocks.9.attn.qkv.weight', 'encoder.blocks.9.mlp.fc1.bias', 'encoder.blocks.9.mlp.fc1.weight', 'encoder.blocks.9.mlp.fc2.bias', 'encoder.blocks.9.mlp.fc2.weight', 'encoder.blocks.9.norm1.bias', 'encoder.blocks.9.norm1.weight', 'encoder.blocks.9.norm2.bias', 'encoder.blocks.9.norm2.weight', 'encoder.norm.bias', 'encoder.norm.weight', 'encoder.patch_embed.proj.bias', 'encoder.patch_embed.proj.weight', 'predictor.action_encoder.bias', 'predictor.action_encoder.weight', 'predictor.extrinsics_encoder.bias', 'predictor.extrinsics_encoder.weight', 'predictor.predictor_blocks.0.attn.proj.bias', 'predictor.predictor_blocks.0.attn.proj.weight', 'predictor.predictor_blocks.0.attn.qkv.bias', 'predictor.predictor_blocks.0.attn.qkv.weight', 'predictor.predictor_blocks.0.mlp.fc1.bias', 'predictor.predictor_blocks.0.mlp.fc1.weight', 'predictor.predictor_blocks.0.mlp.fc2.bias', 'predictor.predictor_blocks.0.mlp.fc2.weight', 'predictor.predictor_blocks.0.norm1.bias', 'predictor.predictor_blocks.0.norm1.weight', 'predictor.predictor_blocks.0.norm2.bias', 'predictor.predictor_blocks.0.norm2.weight', 'predictor.predictor_blocks.1.attn.proj.bias', 'predictor.predictor_blocks.1.attn.proj.weight', 'predictor.predictor_blocks.1.attn.qkv.bias', 'predictor.predictor_blocks.1.attn.qkv.weight', 'predictor.predictor_blocks.1.mlp.fc1.bias', 'predictor.predictor_blocks.1.mlp.fc1.weight', 'predictor.predictor_blocks.1.mlp.fc2.bias', 'predictor.predictor_blocks.1.mlp.fc2.weight', 'predictor.predictor_blocks.1.norm1.bias', 'predictor.predictor_blocks.1.norm1.weight', 'predictor.predictor_blocks.1.norm2.bias', 'predictor.predictor_blocks.1.norm2.weight', 'predictor.predictor_blocks.10.attn.proj.bias', 'predictor.predictor_blocks.10.attn.proj.weight', 'predictor.predictor_blocks.10.attn.qkv.bias', 'predictor.predictor_blocks.10.attn.qkv.weight', 'predictor.predictor_blocks.10.mlp.fc1.bias', 'predictor.predictor_blocks.10.mlp.fc1.weight', 'predictor.predictor_blocks.10.mlp.fc2.bias', 'predictor.predictor_blocks.10.mlp.fc2.weight', 'predictor.predictor_blocks.10.norm1.bias', 'predictor.predictor_blocks.10.norm1.weight', 'predictor.predictor_blocks.10.norm2.bias', 'predictor.predictor_blocks.10.norm2.weight', 'predictor.predictor_blocks.11.attn.proj.bias', 'predictor.predictor_blocks.11.attn.proj.weight', 'predictor.predictor_blocks.11.attn.qkv.bias', 'predictor.predictor_blocks.11.attn.qkv.weight', 'predictor.predictor_blocks.11.mlp.fc1.bias', 'predictor.predictor_blocks.11.mlp.fc1.weight', 'predictor.predictor_blocks.11.mlp.fc2.bias', 'predictor.predictor_blocks.11.mlp.fc2.weight', 'predictor.predictor_blocks.11.norm1.bias', 'predictor.predictor_blocks.11.norm1.weight', 'predictor.predictor_blocks.11.norm2.bias', 'predictor.predictor_blocks.11.norm2.weight', 'predictor.predictor_blocks.12.attn.proj.bias', 'predictor.predictor_blocks.12.attn.proj.weight', 'predictor.predictor_blocks.12.attn.qkv.bias', 'predictor.predictor_blocks.12.attn.qkv.weight', 'predictor.predictor_blocks.12.mlp.fc1.bias', 'predictor.predictor_blocks.12.mlp.fc1.weight', 'predictor.predictor_blocks.12.mlp.fc2.bias', 'predictor.predictor_blocks.12.mlp.fc2.weight', 'predictor.predictor_blocks.12.norm1.bias', 'predictor.predictor_blocks.12.norm1.weight', 'predictor.predictor_blocks.12.norm2.bias', 'predictor.predictor_blocks.12.norm2.weight', 'predictor.predictor_blocks.13.attn.proj.bias', 'predictor.predictor_blocks.13.attn.proj.weight', 'predictor.predictor_blocks.13.attn.qkv.bias', 'predictor.predictor_blocks.13.attn.qkv.weight', 'predictor.predictor_blocks.13.mlp.fc1.bias', 'predictor.predictor_blocks.13.mlp.fc1.weight', 'predictor.predictor_blocks.13.mlp.fc2.bias', 'predictor.predictor_blocks.13.mlp.fc2.weight', 'predictor.predictor_blocks.13.norm1.bias', 'predictor.predictor_blocks.13.norm1.weight', 'predictor.predictor_blocks.13.norm2.bias', 'predictor.predictor_blocks.13.norm2.weight', 'predictor.predictor_blocks.14.attn.proj.bias', 'predictor.predictor_blocks.14.attn.proj.weight', 'predictor.predictor_blocks.14.attn.qkv.bias', 'predictor.predictor_blocks.14.attn.qkv.weight', 'predictor.predictor_blocks.14.mlp.fc1.bias', 'predictor.predictor_blocks.14.mlp.fc1.weight', 'predictor.predictor_blocks.14.mlp.fc2.bias', 'predictor.predictor_blocks.14.mlp.fc2.weight', 'predictor.predictor_blocks.14.norm1.bias', 'predictor.predictor_blocks.14.norm1.weight', 'predictor.predictor_blocks.14.norm2.bias', 'predictor.predictor_blocks.14.norm2.weight', 'predictor.predictor_blocks.15.attn.proj.bias', 'predictor.predictor_blocks.15.attn.proj.weight', 'predictor.predictor_blocks.15.attn.qkv.bias', 'predictor.predictor_blocks.15.attn.qkv.weight', 'predictor.predictor_blocks.15.mlp.fc1.bias', 'predictor.predictor_blocks.15.mlp.fc1.weight', 'predictor.predictor_blocks.15.mlp.fc2.bias', 'predictor.predictor_blocks.15.mlp.fc2.weight', 'predictor.predictor_blocks.15.norm1.bias', 'predictor.predictor_blocks.15.norm1.weight', 'predictor.predictor_blocks.15.norm2.bias', 'predictor.predictor_blocks.15.norm2.weight', 'predictor.predictor_blocks.16.attn.proj.bias', 'predictor.predictor_blocks.16.attn.proj.weight', 'predictor.predictor_blocks.16.attn.qkv.bias', 'predictor.predictor_blocks.16.attn.qkv.weight', 'predictor.predictor_blocks.16.mlp.fc1.bias', 'predictor.predictor_blocks.16.mlp.fc1.weight', 'predictor.predictor_blocks.16.mlp.fc2.bias', 'predictor.predictor_blocks.16.mlp.fc2.weight', 'predictor.predictor_blocks.16.norm1.bias', 'predictor.predictor_blocks.16.norm1.weight', 'predictor.predictor_blocks.16.norm2.bias', 'predictor.predictor_blocks.16.norm2.weight', 'predictor.predictor_blocks.17.attn.proj.bias', 'predictor.predictor_blocks.17.attn.proj.weight', 'predictor.predictor_blocks.17.attn.qkv.bias', 'predictor.predictor_blocks.17.attn.qkv.weight', 'predictor.predictor_blocks.17.mlp.fc1.bias', 'predictor.predictor_blocks.17.mlp.fc1.weight', 'predictor.predictor_blocks.17.mlp.fc2.bias', 'predictor.predictor_blocks.17.mlp.fc2.weight', 'predictor.predictor_blocks.17.norm1.bias', 'predictor.predictor_blocks.17.norm1.weight', 'predictor.predictor_blocks.17.norm2.bias', 'predictor.predictor_blocks.17.norm2.weight', 'predictor.predictor_blocks.18.attn.proj.bias', 'predictor.predictor_blocks.18.attn.proj.weight', 'predictor.predictor_blocks.18.attn.qkv.bias', 'predictor.predictor_blocks.18.attn.qkv.weight', 'predictor.predictor_blocks.18.mlp.fc1.bias', 'predictor.predictor_blocks.18.mlp.fc1.weight', 'predictor.predictor_blocks.18.mlp.fc2.bias', 'predictor.predictor_blocks.18.mlp.fc2.weight', 'predictor.predictor_blocks.18.norm1.bias', 'predictor.predictor_blocks.18.norm1.weight', 'predictor.predictor_blocks.18.norm2.bias', 'predictor.predictor_blocks.18.norm2.weight', 'predictor.predictor_blocks.19.attn.proj.bias', 'predictor.predictor_blocks.19.attn.proj.weight', 'predictor.predictor_blocks.19.attn.qkv.bias', 'predictor.predictor_blocks.19.attn.qkv.weight', 'predictor.predictor_blocks.19.mlp.fc1.bias', 'predictor.predictor_blocks.19.mlp.fc1.weight', 'predictor.predictor_blocks.19.mlp.fc2.bias', 'predictor.predictor_blocks.19.mlp.fc2.weight', 'predictor.predictor_blocks.19.norm1.bias', 'predictor.predictor_blocks.19.norm1.weight', 'predictor.predictor_blocks.19.norm2.bias', 'predictor.predictor_blocks.19.norm2.weight', 'predictor.predictor_blocks.2.attn.proj.bias', 'predictor.predictor_blocks.2.attn.proj.weight', 'predictor.predictor_blocks.2.attn.qkv.bias', 'predictor.predictor_blocks.2.attn.qkv.weight', 'predictor.predictor_blocks.2.mlp.fc1.bias', 'predictor.predictor_blocks.2.mlp.fc1.weight', 'predictor.predictor_blocks.2.mlp.fc2.bias', 'predictor.predictor_blocks.2.mlp.fc2.weight', 'predictor.predictor_blocks.2.norm1.bias', 'predictor.predictor_blocks.2.norm1.weight', 'predictor.predictor_blocks.2.norm2.bias', 'predictor.predictor_blocks.2.norm2.weight', 'predictor.predictor_blocks.20.attn.proj.bias', 'predictor.predictor_blocks.20.attn.proj.weight', 'predictor.predictor_blocks.20.attn.qkv.bias', 'predictor.predictor_blocks.20.attn.qkv.weight', 'predictor.predictor_blocks.20.mlp.fc1.bias', 'predictor.predictor_blocks.20.mlp.fc1.weight', 'predictor.predictor_blocks.20.mlp.fc2.bias', 'predictor.predictor_blocks.20.mlp.fc2.weight', 'predictor.predictor_blocks.20.norm1.bias', 'predictor.predictor_blocks.20.norm1.weight', 'predictor.predictor_blocks.20.norm2.bias', 'predictor.predictor_blocks.20.norm2.weight', 'predictor.predictor_blocks.21.attn.proj.bias', 'predictor.predictor_blocks.21.attn.proj.weight', 'predictor.predictor_blocks.21.attn.qkv.bias', 'predictor.predictor_blocks.21.attn.qkv.weight', 'predictor.predictor_blocks.21.mlp.fc1.bias', 'predictor.predictor_blocks.21.mlp.fc1.weight', 'predictor.predictor_blocks.21.mlp.fc2.bias', 'predictor.predictor_blocks.21.mlp.fc2.weight', 'predictor.predictor_blocks.21.norm1.bias', 'predictor.predictor_blocks.21.norm1.weight', 'predictor.predictor_blocks.21.norm2.bias', 'predictor.predictor_blocks.21.norm2.weight', 'predictor.predictor_blocks.22.attn.proj.bias', 'predictor.predictor_blocks.22.attn.proj.weight', 'predictor.predictor_blocks.22.attn.qkv.bias', 'predictor.predictor_blocks.22.attn.qkv.weight', 'predictor.predictor_blocks.22.mlp.fc1.bias', 'predictor.predictor_blocks.22.mlp.fc1.weight', 'predictor.predictor_blocks.22.mlp.fc2.bias', 'predictor.predictor_blocks.22.mlp.fc2.weight', 'predictor.predictor_blocks.22.norm1.bias', 'predictor.predictor_blocks.22.norm1.weight', 'predictor.predictor_blocks.22.norm2.bias', 'predictor.predictor_blocks.22.norm2.weight', 'predictor.predictor_blocks.23.attn.proj.bias', 'predictor.predictor_blocks.23.attn.proj.weight', 'predictor.predictor_blocks.23.attn.qkv.bias', 'predictor.predictor_blocks.23.attn.qkv.weight', 'predictor.predictor_blocks.23.mlp.fc1.bias', 'predictor.predictor_blocks.23.mlp.fc1.weight', 'predictor.predictor_blocks.23.mlp.fc2.bias', 'predictor.predictor_blocks.23.mlp.fc2.weight', 'predictor.predictor_blocks.23.norm1.bias', 'predictor.predictor_blocks.23.norm1.weight', 'predictor.predictor_blocks.23.norm2.bias', 'predictor.predictor_blocks.23.norm2.weight', 'predictor.predictor_blocks.3.attn.proj.bias', 'predictor.predictor_blocks.3.attn.proj.weight', 'predictor.predictor_blocks.3.attn.qkv.bias', 'predictor.predictor_blocks.3.attn.qkv.weight', 'predictor.predictor_blocks.3.mlp.fc1.bias', 'predictor.predictor_blocks.3.mlp.fc1.weight', 'predictor.predictor_blocks.3.mlp.fc2.bias', 'predictor.predictor_blocks.3.mlp.fc2.weight', 'predictor.predictor_blocks.3.norm1.bias', 'predictor.predictor_blocks.3.norm1.weight', 'predictor.predictor_blocks.3.norm2.bias', 'predictor.predictor_blocks.3.norm2.weight', 'predictor.predictor_blocks.4.attn.proj.bias', 'predictor.predictor_blocks.4.attn.proj.weight', 'predictor.predictor_blocks.4.attn.qkv.bias', 'predictor.predictor_blocks.4.attn.qkv.weight', 'predictor.predictor_blocks.4.mlp.fc1.bias', 'predictor.predictor_blocks.4.mlp.fc1.weight', 'predictor.predictor_blocks.4.mlp.fc2.bias', 'predictor.predictor_blocks.4.mlp.fc2.weight', 'predictor.predictor_blocks.4.norm1.bias', 'predictor.predictor_blocks.4.norm1.weight', 'predictor.predictor_blocks.4.norm2.bias', 'predictor.predictor_blocks.4.norm2.weight', 'predictor.predictor_blocks.5.attn.proj.bias', 'predictor.predictor_blocks.5.attn.proj.weight', 'predictor.predictor_blocks.5.attn.qkv.bias', 'predictor.predictor_blocks.5.attn.qkv.weight', 'predictor.predictor_blocks.5.mlp.fc1.bias', 'predictor.predictor_blocks.5.mlp.fc1.weight', 'predictor.predictor_blocks.5.mlp.fc2.bias', 'predictor.predictor_blocks.5.mlp.fc2.weight', 'predictor.predictor_blocks.5.norm1.bias', 'predictor.predictor_blocks.5.norm1.weight', 'predictor.predictor_blocks.5.norm2.bias', 'predictor.predictor_blocks.5.norm2.weight', 'predictor.predictor_blocks.6.attn.proj.bias', 'predictor.predictor_blocks.6.attn.proj.weight', 'predictor.predictor_blocks.6.attn.qkv.bias', 'predictor.predictor_blocks.6.attn.qkv.weight', 'predictor.predictor_blocks.6.mlp.fc1.bias', 'predictor.predictor_blocks.6.mlp.fc1.weight', 'predictor.predictor_blocks.6.mlp.fc2.bias', 'predictor.predictor_blocks.6.mlp.fc2.weight', 'predictor.predictor_blocks.6.norm1.bias', 'predictor.predictor_blocks.6.norm1.weight', 'predictor.predictor_blocks.6.norm2.bias', 'predictor.predictor_blocks.6.norm2.weight', 'predictor.predictor_blocks.7.attn.proj.bias', 'predictor.predictor_blocks.7.attn.proj.weight', 'predictor.predictor_blocks.7.attn.qkv.bias', 'predictor.predictor_blocks.7.attn.qkv.weight', 'predictor.predictor_blocks.7.mlp.fc1.bias', 'predictor.predictor_blocks.7.mlp.fc1.weight', 'predictor.predictor_blocks.7.mlp.fc2.bias', 'predictor.predictor_blocks.7.mlp.fc2.weight', 'predictor.predictor_blocks.7.norm1.bias', 'predictor.predictor_blocks.7.norm1.weight', 'predictor.predictor_blocks.7.norm2.bias', 'predictor.predictor_blocks.7.norm2.weight', 'predictor.predictor_blocks.8.attn.proj.bias', 'predictor.predictor_blocks.8.attn.proj.weight', 'predictor.predictor_blocks.8.attn.qkv.bias', 'predictor.predictor_blocks.8.attn.qkv.weight', 'predictor.predictor_blocks.8.mlp.fc1.bias', 'predictor.predictor_blocks.8.mlp.fc1.weight', 'predictor.predictor_blocks.8.mlp.fc2.bias', 'predictor.predictor_blocks.8.mlp.fc2.weight', 'predictor.predictor_blocks.8.norm1.bias', 'predictor.predictor_blocks.8.norm1.weight', 'predictor.predictor_blocks.8.norm2.bias', 'predictor.predictor_blocks.8.norm2.weight', 'predictor.predictor_blocks.9.attn.proj.bias', 'predictor.predictor_blocks.9.attn.proj.weight', 'predictor.predictor_blocks.9.attn.qkv.bias', 'predictor.predictor_blocks.9.attn.qkv.weight', 'predictor.predictor_blocks.9.mlp.fc1.bias', 'predictor.predictor_blocks.9.mlp.fc1.weight', 'predictor.predictor_blocks.9.mlp.fc2.bias', 'predictor.predictor_blocks.9.mlp.fc2.weight', 'predictor.predictor_blocks.9.norm1.bias', 'predictor.predictor_blocks.9.norm1.weight', 'predictor.predictor_blocks.9.norm2.bias', 'predictor.predictor_blocks.9.norm2.weight', 'predictor.predictor_embed.bias', 'predictor.predictor_embed.weight', 'predictor.predictor_norm.bias', 'predictor.predictor_norm.weight', 'predictor.predictor_proj.bias', 'predictor.predictor_proj.weight', 'predictor.state_encoder.bias', 'predictor.state_encoder.weight'] - This IS expected if you are initializing VJEPA2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing VJEPA2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of VJEPA2Model were not initialized from the model checkpoint at /media/bigdisk/SimRobo/vjepa2_hf and are newly initialized: ['encoder.embeddings.patch_embeddings.proj.bias', 'encoder.embeddings.patch_embeddings.proj.weight', 'encoder.layer.0.attention.key.bias', 'encoder.layer.0.attention.key.weight', 'encoder.layer.0.attention.proj.bias', 'encoder.layer.0.attention.proj.weight', 'encoder.layer.0.attention.query.bias', 'encoder.layer.0.attention.query.weight', 'encoder.layer.0.attention.value.bias', 'encoder.layer.0.attention.value.weight', 'encoder.layer.0.mlp.fc1.bias', 'encoder.layer.0.mlp.fc1.weight', 'encoder.layer.0.mlp.fc2.bias', 'encoder.layer.0.mlp.fc2.weight', 'encoder.layer.0.norm1.bias', 'encoder.layer.0.norm1.weight', 'encoder.layer.0.norm2.bias', 'encoder.layer.0.norm2.weight', 'encoder.layer.1.attention.key.bias', 'encoder.layer.1.attention.key.weight', 'encoder.layer.1.attention.proj.bias', 'encoder.layer.1.attention.proj.weight', 'encoder.layer.1.attention.query.bias', 'encoder.layer.1.attention.query.weight', 'encoder.layer.1.attention.value.bias', 'encoder.layer.1.attention.value.weight', 'encoder.layer.1.mlp.fc1.bias', 'encoder.layer.1.mlp.fc1.weight', 'encoder.layer.1.mlp.fc2.bias', 'encoder.layer.1.mlp.fc2.weight', 'encoder.layer.1.norm1.bias', 'encoder.layer.1.norm1.weight', 'encoder.layer.1.norm2.bias', 'encoder.layer.1.norm2.weight', 'encoder.layer.10.attention.key.bias', 'encoder.layer.10.attention.key.weight', 'encoder.layer.10.attention.proj.bias', 'encoder.layer.10.attention.proj.weight', 'encoder.layer.10.attention.query.bias', 'encoder.layer.10.attention.query.weight', 'encoder.layer.10.attention.value.bias', 'encoder.layer.10.attention.value.weight', 'encoder.layer.10.mlp.fc1.bias', 'encoder.layer.10.mlp.fc1.weight', 'encoder.layer.10.mlp.fc2.bias', 'encoder.layer.10.mlp.fc2.weight', 'encoder.layer.10.norm1.bias', 'encoder.layer.10.norm1.weight', 'encoder.layer.10.norm2.bias', 'encoder.layer.10.norm2.weight', 'encoder.layer.11.attention.key.bias', 'encoder.layer.11.attention.key.weight', 'encoder.layer.11.attention.proj.bias', 'encoder.layer.11.attention.proj.weight', 'encoder.layer.11.attention.query.bias', 'encoder.layer.11.attention.query.weight', 'encoder.layer.11.attention.value.bias', 'encoder.layer.11.attention.value.weight', 'encoder.layer.11.mlp.fc1.bias', 'encoder.layer.11.mlp.fc1.weight', 'encoder.layer.11.mlp.fc2.bias', 'encoder.layer.11.mlp.fc2.weight', 'encoder.layer.11.norm1.bias', 'encoder.layer.11.norm1.weight', 'encoder.layer.11.norm2.bias', 'encoder.layer.11.norm2.weight', 'encoder.layer.12.attention.key.bias', 'encoder.layer.12.attention.key.weight', 'encoder.layer.12.attention.proj.bias', 'encoder.layer.12.attention.proj.weight', 'encoder.layer.12.attention.query.bias', 'encoder.layer.12.attention.query.weight', 'encoder.layer.12.attention.value.bias', 'encoder.layer.12.attention.value.weight', 'encoder.layer.12.mlp.fc1.bias', 'encoder.layer.12.mlp.fc1.weight', 'encoder.layer.12.mlp.fc2.bias', 'encoder.layer.12.mlp.fc2.weight', 'encoder.layer.12.norm1.bias', 'encoder.layer.12.norm1.weight', 'encoder.layer.12.norm2.bias', 'encoder.layer.12.norm2.weight', 'encoder.layer.13.attention.key.bias', 'encoder.layer.13.attention.key.weight', 'encoder.layer.13.attention.proj.bias', 'encoder.layer.13.attention.proj.weight', 'encoder.layer.13.attention.query.bias', 'encoder.layer.13.attention.query.weight', 'encoder.layer.13.attention.value.bias', 'encoder.layer.13.attention.value.weight', 'encoder.layer.13.mlp.fc1.bias', 'encoder.layer.13.mlp.fc1.weight', 'encoder.layer.13.mlp.fc2.bias', 'encoder.layer.13.mlp.fc2.weight', 'encoder.layer.13.norm1.bias', 'encoder.layer.13.norm1.weight', 'encoder.layer.13.norm2.bias', 'encoder.layer.13.norm2.weight', 'encoder.layer.14.attention.key.bias', 'encoder.layer.14.attention.key.weight', 'encoder.layer.14.attention.proj.bias', 'encoder.layer.14.attention.proj.weight', 'encoder.layer.14.attention.query.bias', 'encoder.layer.14.attention.query.weight', 'encoder.layer.14.attention.value.bias', 'encoder.layer.14.attention.value.weight', 'encoder.layer.14.mlp.fc1.bias', 'encoder.layer.14.mlp.fc1.weight', 'encoder.layer.14.mlp.fc2.bias', 'encoder.layer.14.mlp.fc2.weight', 'encoder.layer.14.norm1.bias', 'encoder.layer.14.norm1.weight', 'encoder.layer.14.norm2.bias', 'encoder.layer.14.norm2.weight', 'encoder.layer.15.attention.key.bias', 'encoder.layer.15.attention.key.weight', 'encoder.layer.15.attention.proj.bias', 'encoder.layer.15.attention.proj.weight', 'encoder.layer.15.attention.query.bias', 'encoder.layer.15.attention.query.weight', 'encoder.layer.15.attention.value.bias', 'encoder.layer.15.attention.value.weight', 'encoder.layer.15.mlp.fc1.bias', 'encoder.layer.15.mlp.fc1.weight', 'encoder.layer.15.mlp.fc2.bias', 'encoder.layer.15.mlp.fc2.weight', 'encoder.layer.15.norm1.bias', 'encoder.layer.15.norm1.weight', 'encoder.layer.15.norm2.bias', 'encoder.layer.15.norm2.weight', 'encoder.layer.16.attention.key.bias', 'encoder.layer.16.attention.key.weight', 'encoder.layer.16.attention.proj.bias', 'encoder.layer.16.attention.proj.weight', 'encoder.layer.16.attention.query.bias', 'encoder.layer.16.attention.query.weight', 'encoder.layer.16.attention.value.bias', 'encoder.layer.16.attention.value.weight', 'encoder.layer.16.mlp.fc1.bias', 'encoder.layer.16.mlp.fc1.weight', 'encoder.layer.16.mlp.fc2.bias', 'encoder.layer.16.mlp.fc2.weight', 'encoder.layer.16.norm1.bias', 'encoder.layer.16.norm1.weight', 'encoder.layer.16.norm2.bias', 'encoder.layer.16.norm2.weight', 'encoder.layer.17.attention.key.bias', 'encoder.layer.17.attention.key.weight', 'encoder.layer.17.attention.proj.bias', 'encoder.layer.17.attention.proj.weight', 'encoder.layer.17.attention.query.bias', 'encoder.layer.17.attention.query.weight', 'encoder.layer.17.attention.value.bias', 'encoder.layer.17.attention.value.weight', 'encoder.layer.17.mlp.fc1.bias', 'encoder.layer.17.mlp.fc1.weight', 'encoder.layer.17.mlp.fc2.bias', 'encoder.layer.17.mlp.fc2.weight', 'encoder.layer.17.norm1.bias', 'encoder.layer.17.norm1.weight', 'encoder.layer.17.norm2.bias', 'encoder.layer.17.norm2.weight', 'encoder.layer.18.attention.key.bias', 'encoder.layer.18.attention.key.weight', 'encoder.layer.18.attention.proj.bias', 'encoder.layer.18.attention.proj.weight', 'encoder.layer.18.attention.query.bias', 'encoder.layer.18.attention.query.weight', 'encoder.layer.18.attention.value.bias', 'encoder.layer.18.attention.value.weight', 'encoder.layer.18.mlp.fc1.bias', 'encoder.layer.18.mlp.fc1.weight', 'encoder.layer.18.mlp.fc2.bias', 'encoder.layer.18.mlp.fc2.weight', 'encoder.layer.18.norm1.bias', 'encoder.layer.18.norm1.weight', 'encoder.layer.18.norm2.bias', 'encoder.layer.18.norm2.weight', 'encoder.layer.19.attention.key.bias', 'encoder.layer.19.attention.key.weight', 'encoder.layer.19.attention.proj.bias', 'encoder.layer.19.attention.proj.weight', 'encoder.layer.19.attention.query.bias', 'encoder.layer.19.attention.query.weight', 'encoder.layer.19.attention.value.bias', 'encoder.layer.19.attention.value.weight', 'encoder.layer.19.mlp.fc1.bias', 'encoder.layer.19.mlp.fc1.weight', 'encoder.layer.19.mlp.fc2.bias', 'encoder.layer.19.mlp.fc2.weight', 'encoder.layer.19.norm1.bias', 'encoder.layer.19.norm1.weight', 'encoder.layer.19.norm2.bias', 'encoder.layer.19.norm2.weight', 'encoder.layer.2.attention.key.bias', 'encoder.layer.2.attention.key.weight', 'encoder.layer.2.attention.proj.bias', 'encoder.layer.2.attention.proj.weight', 'encoder.layer.2.attention.query.bias', 'encoder.layer.2.attention.query.weight', 'encoder.layer.2.attention.value.bias', 'encoder.layer.2.attention.value.weight', 'encoder.layer.2.mlp.fc1.bias', 'encoder.layer.2.mlp.fc1.weight', 'encoder.layer.2.mlp.fc2.bias', 'encoder.layer.2.mlp.fc2.weight', 'encoder.layer.2.norm1.bias', 'encoder.layer.2.norm1.weight', 'encoder.layer.2.norm2.bias', 'encoder.layer.2.norm2.weight', 'encoder.layer.20.attention.key.bias', 'encoder.layer.20.attention.key.weight', 'encoder.layer.20.attention.proj.bias', 'encoder.layer.20.attention.proj.weight', 'encoder.layer.20.attention.query.bias', 'encoder.layer.20.attention.query.weight', 'encoder.layer.20.attention.value.bias', 'encoder.layer.20.attention.value.weight', 'encoder.layer.20.mlp.fc1.bias', 'encoder.layer.20.mlp.fc1.weight', 'encoder.layer.20.mlp.fc2.bias', 'encoder.layer.20.mlp.fc2.weight', 'encoder.layer.20.norm1.bias', 'encoder.layer.20.norm1.weight', 'encoder.layer.20.norm2.bias', 'encoder.layer.20.norm2.weight', 'encoder.layer.21.attention.key.bias', 'encoder.layer.21.attention.key.weight', 'encoder.layer.21.attention.proj.bias', 'encoder.layer.21.attention.proj.weight', 'encoder.layer.21.attention.query.bias', 'encoder.layer.21.attention.query.weight', 'encoder.layer.21.attention.value.bias', 'encoder.layer.21.attention.value.weight', 'encoder.layer.21.mlp.fc1.bias', 'encoder.layer.21.mlp.fc1.weight', 'encoder.layer.21.mlp.fc2.bias', 'encoder.layer.21.mlp.fc2.weight', 'encoder.layer.21.norm1.bias', 'encoder.layer.21.norm1.weight', 'encoder.layer.21.norm2.bias', 'encoder.layer.21.norm2.weight', 'encoder.layer.22.attention.key.bias', 'encoder.layer.22.attention.key.weight', 'encoder.layer.22.attention.proj.bias', 'encoder.layer.22.attention.proj.weight', 'encoder.layer.22.attention.query.bias', 'encoder.layer.22.attention.query.weight', 'encoder.layer.22.attention.value.bias', 'encoder.layer.22.attention.value.weight', 'encoder.layer.22.mlp.fc1.bias', 'encoder.layer.22.mlp.fc1.weight', 'encoder.layer.22.mlp.fc2.bias', 'encoder.layer.22.mlp.fc2.weight', 'encoder.layer.22.norm1.bias', 'encoder.layer.22.norm1.weight', 'encoder.layer.22.norm2.bias', 'encoder.layer.22.norm2.weight', 'encoder.layer.23.attention.key.bias', 'encoder.layer.23.attention.key.weight', 'encoder.layer.23.attention.proj.bias', 'encoder.layer.23.attention.proj.weight', 'encoder.layer.23.attention.query.bias', 'encoder.layer.23.attention.query.weight', 'encoder.layer.23.attention.value.bias', 'encoder.layer.23.attention.value.weight', 'encoder.layer.23.mlp.fc1.bias', 'encoder.layer.23.mlp.fc1.weight', 'encoder.layer.23.mlp.fc2.bias', 'encoder.layer.23.mlp.fc2.weight', 'encoder.layer.23.norm1.bias', 'encoder.layer.23.norm1.weight', 'encoder.layer.23.norm2.bias', 'encoder.layer.23.norm2.weight', 'encoder.layer.3.attention.key.bias', 'encoder.layer.3.attention.key.weight', 'encoder.layer.3.attention.proj.bias', 'encoder.layer.3.attention.proj.weight', 'encoder.layer.3.attention.query.bias', 'encoder.layer.3.attention.query.weight', 'encoder.layer.3.attention.value.bias', 'encoder.layer.3.attention.value.weight', 'encoder.layer.3.mlp.fc1.bias', 'encoder.layer.3.mlp.fc1.weight', 'encoder.layer.3.mlp.fc2.bias', 'encoder.layer.3.mlp.fc2.weight', 'encoder.layer.3.norm1.bias', 'encoder.layer.3.norm1.weight', 'encoder.layer.3.norm2.bias', 'encoder.layer.3.norm2.weight', 'encoder.layer.4.attention.key.bias', 'encoder.layer.4.attention.key.weight', 'encoder.layer.4.attention.proj.bias', 'encoder.layer.4.attention.proj.weight', 'encoder.layer.4.attention.query.bias', 'encoder.layer.4.attention.query.weight', 'encoder.layer.4.attention.value.bias', 'encoder.layer.4.attention.value.weight', 'encoder.layer.4.mlp.fc1.bias', 'encoder.layer.4.mlp.fc1.weight', 'encoder.layer.4.mlp.fc2.bias', 'encoder.layer.4.mlp.fc2.weight', 'encoder.layer.4.norm1.bias', 'encoder.layer.4.norm1.weight', 'encoder.layer.4.norm2.bias', 'encoder.layer.4.norm2.weight', 'encoder.layer.5.attention.key.bias', 'encoder.layer.5.attention.key.weight', 'encoder.layer.5.attention.proj.bias', 'encoder.layer.5.attention.proj.weight', 'encoder.layer.5.attention.query.bias', 'encoder.layer.5.attention.query.weight', 'encoder.layer.5.attention.value.bias', 'encoder.layer.5.attention.value.weight', 'encoder.layer.5.mlp.fc1.bias', 'encoder.layer.5.mlp.fc1.weight', 'encoder.layer.5.mlp.fc2.bias', 'encoder.layer.5.mlp.fc2.weight', 'encoder.layer.5.norm1.bias', 'encoder.layer.5.norm1.weight', 'encoder.layer.5.norm2.bias', 'encoder.layer.5.norm2.weight', 'encoder.layer.6.attention.key.bias', 'encoder.layer.6.attention.key.weight', 'encoder.layer.6.attention.proj.bias', 'encoder.layer.6.attention.proj.weight', 'encoder.layer.6.attention.query.bias', 'encoder.layer.6.attention.query.weight', 'encoder.layer.6.attention.value.bias', 'encoder.layer.6.attention.value.weight', 'encoder.layer.6.mlp.fc1.bias', 'encoder.layer.6.mlp.fc1.weight', 'encoder.layer.6.mlp.fc2.bias', 'encoder.layer.6.mlp.fc2.weight', 'encoder.layer.6.norm1.bias', 'encoder.layer.6.norm1.weight', 'encoder.layer.6.norm2.bias', 'encoder.layer.6.norm2.weight', 'encoder.layer.7.attention.key.bias', 'encoder.layer.7.attention.key.weight', 'encoder.layer.7.attention.proj.bias', 'encoder.layer.7.attention.proj.weight', 'encoder.layer.7.attention.query.bias', 'encoder.layer.7.attention.query.weight', 'encoder.layer.7.attention.value.bias', 'encoder.layer.7.attention.value.weight', 'encoder.layer.7.mlp.fc1.bias', 'encoder.layer.7.mlp.fc1.weight', 'encoder.layer.7.mlp.fc2.bias', 'encoder.layer.7.mlp.fc2.weight', 'encoder.layer.7.norm1.bias', 'encoder.layer.7.norm1.weight', 'encoder.layer.7.norm2.bias', 'encoder.layer.7.norm2.weight', 'encoder.layer.8.attention.key.bias', 'encoder.layer.8.attention.key.weight', 'encoder.layer.8.attention.proj.bias', 'encoder.layer.8.attention.proj.weight', 'encoder.layer.8.attention.query.bias', 'encoder.layer.8.attention.query.weight', 'encoder.layer.8.attention.value.bias', 'encoder.layer.8.attention.value.weight', 'encoder.layer.8.mlp.fc1.bias', 'encoder.layer.8.mlp.fc1.weight', 'encoder.layer.8.mlp.fc2.bias', 'encoder.layer.8.mlp.fc2.weight', 'encoder.layer.8.norm1.bias', 'encoder.layer.8.norm1.weight', 'encoder.layer.8.norm2.bias', 'encoder.layer.8.norm2.weight', 'encoder.layer.9.attention.key.bias', 'encoder.layer.9.attention.key.weight', 'encoder.layer.9.attention.proj.bias', 'encoder.layer.9.attention.proj.weight', 'encoder.layer.9.attention.query.bias', 'encoder.layer.9.attention.query.weight', 'encoder.layer.9.attention.value.bias', 'encoder.layer.9.attention.value.weight', 'encoder.layer.9.mlp.fc1.bias', 'encoder.layer.9.mlp.fc1.weight', 'encoder.layer.9.mlp.fc2.bias', 'encoder.layer.9.mlp.fc2.weight', 'encoder.layer.9.norm1.bias', 'encoder.layer.9.norm1.weight', 'encoder.layer.9.norm2.bias', 'encoder.layer.9.norm2.weight', 'encoder.layernorm.bias', 'encoder.layernorm.weight', 'predictor.embeddings.mask_tokens', 'predictor.embeddings.predictor_embeddings.bias', 'predictor.embeddings.predictor_embeddings.weight', 'predictor.layer.0.attention.key.bias', 'predictor.layer.0.attention.key.weight', 'predictor.layer.0.attention.proj.bias', 'predictor.layer.0.attention.proj.weight', 'predictor.layer.0.attention.query.bias', 'predictor.layer.0.attention.query.weight', 'predictor.layer.0.attention.value.bias', 'predictor.layer.0.attention.value.weight', 'predictor.layer.0.mlp.fc1.bias', 'predictor.layer.0.mlp.fc1.weight', 'predictor.layer.0.mlp.fc2.bias', 'predictor.layer.0.mlp.fc2.weight', 'predictor.layer.0.norm1.bias', 'predictor.layer.0.norm1.weight', 'predictor.layer.0.norm2.bias', 'predictor.layer.0.norm2.weight', 'predictor.layer.1.attention.key.bias', 'predictor.layer.1.attention.key.weight', 'predictor.layer.1.attention.proj.bias', 'predictor.layer.1.attention.proj.weight', 'predictor.layer.1.attention.query.bias', 'predictor.layer.1.attention.query.weight', 'predictor.layer.1.attention.value.bias', 'predictor.layer.1.attention.value.weight', 'predictor.layer.1.mlp.fc1.bias', 'predictor.layer.1.mlp.fc1.weight', 'predictor.layer.1.mlp.fc2.bias', 'predictor.layer.1.mlp.fc2.weight', 'predictor.layer.1.norm1.bias', 'predictor.layer.1.norm1.weight', 'predictor.layer.1.norm2.bias', 'predictor.layer.1.norm2.weight', 'predictor.layer.10.attention.key.bias', 'predictor.layer.10.attention.key.weight', 'predictor.layer.10.attention.proj.bias', 'predictor.layer.10.attention.proj.weight', 'predictor.layer.10.attention.query.bias', 'predictor.layer.10.attention.query.weight', 'predictor.layer.10.attention.value.bias', 'predictor.layer.10.attention.value.weight', 'predictor.layer.10.mlp.fc1.bias', 'predictor.layer.10.mlp.fc1.weight', 'predictor.layer.10.mlp.fc2.bias', 'predictor.layer.10.mlp.fc2.weight', 'predictor.layer.10.norm1.bias', 'predictor.layer.10.norm1.weight', 'predictor.layer.10.norm2.bias', 'predictor.layer.10.norm2.weight', 'predictor.layer.11.attention.key.bias', 'predictor.layer.11.attention.key.weight', 'predictor.layer.11.attention.proj.bias', 'predictor.layer.11.attention.proj.weight', 'predictor.layer.11.attention.query.bias', 'predictor.layer.11.attention.query.weight', 'predictor.layer.11.attention.value.bias', 'predictor.layer.11.attention.value.weight', 'predictor.layer.11.mlp.fc1.bias', 'predictor.layer.11.mlp.fc1.weight', 'predictor.layer.11.mlp.fc2.bias', 'predictor.layer.11.mlp.fc2.weight', 'predictor.layer.11.norm1.bias', 'predictor.layer.11.norm1.weight', 'predictor.layer.11.norm2.bias', 'predictor.layer.11.norm2.weight', 'predictor.layer.2.attention.key.bias', 'predictor.layer.2.attention.key.weight', 'predictor.layer.2.attention.proj.bias', 'predictor.layer.2.attention.proj.weight', 'predictor.layer.2.attention.query.bias', 'predictor.layer.2.attention.query.weight', 'predictor.layer.2.attention.value.bias', 'predictor.layer.2.attention.value.weight', 'predictor.layer.2.mlp.fc1.bias', 'predictor.layer.2.mlp.fc1.weight', 'predictor.layer.2.mlp.fc2.bias', 'predictor.layer.2.mlp.fc2.weight', 'predictor.layer.2.norm1.bias', 'predictor.layer.2.norm1.weight', 'predictor.layer.2.norm2.bias', 'predictor.layer.2.norm2.weight', 'predictor.layer.3.attention.key.bias', 'predictor.layer.3.attention.key.weight', 'predictor.layer.3.attention.proj.bias', 'predictor.layer.3.attention.proj.weight', 'predictor.layer.3.attention.query.bias', 'predictor.layer.3.attention.query.weight', 'predictor.layer.3.attention.value.bias', 'predictor.layer.3.attention.value.weight', 'predictor.layer.3.mlp.fc1.bias', 'predictor.layer.3.mlp.fc1.weight', 'predictor.layer.3.mlp.fc2.bias', 'predictor.layer.3.mlp.fc2.weight', 'predictor.layer.3.norm1.bias', 'predictor.layer.3.norm1.weight', 'predictor.layer.3.norm2.bias', 'predictor.layer.3.norm2.weight', 'predictor.layer.4.attention.key.bias', 'predictor.layer.4.attention.key.weight', 'predictor.layer.4.attention.proj.bias', 'predictor.layer.4.attention.proj.weight', 'predictor.layer.4.attention.query.bias', 'predictor.layer.4.attention.query.weight', 'predictor.layer.4.attention.value.bias', 'predictor.layer.4.attention.value.weight', 'predictor.layer.4.mlp.fc1.bias', 'predictor.layer.4.mlp.fc1.weight', 'predictor.layer.4.mlp.fc2.bias', 'predictor.layer.4.mlp.fc2.weight', 'predictor.layer.4.norm1.bias', 'predictor.layer.4.norm1.weight', 'predictor.layer.4.norm2.bias', 'predictor.layer.4.norm2.weight', 'predictor.layer.5.attention.key.bias', 'predictor.layer.5.attention.key.weight', 'predictor.layer.5.attention.proj.bias', 'predictor.layer.5.attention.proj.weight', 'predictor.layer.5.attention.query.bias', 'predictor.layer.5.attention.query.weight', 'predictor.layer.5.attention.value.bias', 'predictor.layer.5.attention.value.weight', 'predictor.layer.5.mlp.fc1.bias', 'predictor.layer.5.mlp.fc1.weight', 'predictor.layer.5.mlp.fc2.bias', 'predictor.layer.5.mlp.fc2.weight', 'predictor.layer.5.norm1.bias', 'predictor.layer.5.norm1.weight', 'predictor.layer.5.norm2.bias', 'predictor.layer.5.norm2.weight', 'predictor.layer.6.attention.key.bias', 'predictor.layer.6.attention.key.weight', 'predictor.layer.6.attention.proj.bias', 'predictor.layer.6.attention.proj.weight', 'predictor.layer.6.attention.query.bias', 'predictor.layer.6.attention.query.weight', 'predictor.layer.6.attention.value.bias', 'predictor.layer.6.attention.value.weight', 'predictor.layer.6.mlp.fc1.bias', 'predictor.layer.6.mlp.fc1.weight', 'predictor.layer.6.mlp.fc2.bias', 'predictor.layer.6.mlp.fc2.weight', 'predictor.layer.6.norm1.bias', 'predictor.layer.6.norm1.weight', 'predictor.layer.6.norm2.bias', 'predictor.layer.6.norm2.weight', 'predictor.layer.7.attention.key.bias', 'predictor.layer.7.attention.key.weight', 'predictor.layer.7.attention.proj.bias', 'predictor.layer.7.attention.proj.weight', 'predictor.layer.7.attention.query.bias', 'predictor.layer.7.attention.query.weight', 'predictor.layer.7.attention.value.bias', 'predictor.layer.7.attention.value.weight', 'predictor.layer.7.mlp.fc1.bias', 'predictor.layer.7.mlp.fc1.weight', 'predictor.layer.7.mlp.fc2.bias', 'predictor.layer.7.mlp.fc2.weight', 'predictor.layer.7.norm1.bias', 'predictor.layer.7.norm1.weight', 'predictor.layer.7.norm2.bias', 'predictor.layer.7.norm2.weight', 'predictor.layer.8.attention.key.bias', 'predictor.layer.8.attention.key.weight', 'predictor.layer.8.attention.proj.bias', 'predictor.layer.8.attention.proj.weight', 'predictor.layer.8.attention.query.bias', 'predictor.layer.8.attention.query.weight', 'predictor.layer.8.attention.value.bias', 'predictor.layer.8.attention.value.weight', 'predictor.layer.8.mlp.fc1.bias', 'predictor.layer.8.mlp.fc1.weight', 'predictor.layer.8.mlp.fc2.bias', 'predictor.layer.8.mlp.fc2.weight', 'predictor.layer.8.norm1.bias', 'predictor.layer.8.norm1.weight', 'predictor.layer.8.norm2.bias', 'predictor.layer.8.norm2.weight', 'predictor.layer.9.attention.key.bias', 'predictor.layer.9.attention.key.weight', 'predictor.layer.9.attention.proj.bias', 'predictor.layer.9.attention.proj.weight', 'predictor.layer.9.attention.query.bias', 'predictor.layer.9.attention.query.weight', 'predictor.layer.9.attention.value.bias', 'predictor.layer.9.attention.value.weight', 'predictor.layer.9.mlp.fc1.bias', 'predictor.layer.9.mlp.fc1.weight', 'predictor.layer.9.mlp.fc2.bias', 'predictor.layer.9.mlp.fc2.weight', 'predictor.layer.9.norm1.bias', 'predictor.layer.9.norm1.weight', 'predictor.layer.9.norm2.bias', 'predictor.layer.9.norm2.weight', 'predictor.layernorm.bias', 'predictor.layernorm.weight', 'predictor.proj.bias', 'predictor.proj.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 帮我排版,不要有遗漏
07-03
以下是对报错信息的重新排版整理,确保内容完整且无遗漏: --- **未使用的模型检查点权重(部分权重未被VJEPA2Model初始化时加载)** ``` Some weights of the model checkpoint at /media/bigdisk/SimRobo/vjepa2_hf were not used when initializing VJEPA2Model: ['encoder.blocks.0.attn.proj.bias', 'encoder.blocks.0.attn.proj.weight', 'encoder.blocks.0.attn.qkv.bias', 'encoder.blocks.0.attn.qkv.weight', 'encoder.blocks.0.mlp.fc1.bias', 'encoder.blocks.0.mlp.fc1.weight', 'encoder.blocks.0.mlp.fc2.bias', 'encoder.blocks.0.mlp.fc2.weight', 'encoder.blocks.0.norm1.bias', 'encoder.blocks.0.norm1.weight', 'encoder.blocks.0.norm2.bias', 'encoder.blocks.0.norm2.weight', ...(中间省略重复结构)... 'predictor.predictor_blocks.39.norm2.bias', 'predictor.predictor_blocks.39.norm2.weight', 'predictor.predictor_embed.bias', 'predictor.predictor_embed.weight', 'predictor.predictor_norm.bias', 'predictor.predictor_norm.weight', 'predictor.predictor_proj.bias', 'predictor.predictor_proj.weight', 'predictor.state_encoder.bias', 'predictor.state_encoder.weight'] ``` **未初始化的新权重(VJEPA2Model中新增的权重)** ``` Some weights of VJEPA2Model were not initialized from the model checkpoint at /media/bigdisk/SimRobo/vjepa2_hf and are newly initialized: ['encoder.embeddings.patch_embeddings.proj.bias', 'encoder.embeddings.patch_embeddings.proj.weight', 'encoder.layer.0.attention.key.bias', 'encoder.layer.0.attention.key.weight', 'encoder.layer.0.attention.proj.bias', 'encoder.layer.0.attention.proj.weight', 'encoder.layer.0.attention.query.bias', 'encoder.layer.0.attention.query.weight', 'encoder.layer.0.attention.value.bias', 'encoder.layer.0.attention.value.weight', 'encoder.layer.0.mlp.fc1.bias', 'encoder.layer.0.mlp.fc1.weight', 'encoder.layer.0.mlp.fc2.bias', 'encoder.layer.0.mlp.fc2.weight', 'encoder.layer.0.norm1.bias', 'encoder.layer.0.norm1.weight', 'encoder.layer.0.norm2.bias', 'encoder.layer.0.norm2.weight', ...(中间省略重复结构)... 'predictor.layer.11.norm2.bias', 'predictor.layer.11.norm2.weight', 'predictor.embeddings.mask_tokens', 'predictor.embeddings.predictor_embeddings.bias', 'predictor.embeddings.predictor_embeddings.weight', 'predictor.layernorm.bias', 'predictor.layernorm.weight', 'predictor.proj.bias', 'predictor.proj.weight'] ``` **建议** ``` You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. ``` ---
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值