Cosmos Tokenizer

最新推荐文章于 2025-03-31 10:13:31 发布

EAI工程笔记

最新推荐文章于 2025-03-31 10:13:31 发布

阅读量1.7k

点赞数 45

CC 4.0 BY-SA版权

分类专栏： # AI 开源项目文章标签： Cosmos Tokenizer 世界模型 Checkpoints nvidia

本文链接：https://blog.csdn.net/lovechris00/article/details/145394581

AI 开源项目专栏收录该内容

230 篇文章

订阅专栏

文章目录

一、关于 Cosmos Tokenizer

NVIDIA Cosmos Tokenizer，是一套图像和视频 Tokenizer ，推动了视觉标记化的最先进水平，为大型自回归Transformers （如LLM）或扩散生成器的可扩展、稳健和高效开发铺平了道路。

github : https://github.com/NVIDIA/Cosmos-Tokenizer
网站|Cosmos|NVIDIA新闻|NVIDIA博客|Hugging Face |YouTube|TokenBench
Web Demo
- Image Tokenization : https://colab.research.google.com/github/nvidia/Cosmos-Tokenizer/blob/main/notebook/Image_Tokenization.ipynb
- Video Tokenization : https://colab.research.google.com/github/nvidia/Cosmos-Tokenizer/blob/main/notebook/Video_Tokenization.ipynb

Cosmos Tokenizer是NVIDIA Cosmos的核心组件，这是一个开发者优先的视频基础模型平台，旨在帮助物理人工智能开发者更好、更快地构建他们的物理人工智能系统。请查看演示视频。

	Continuous（C）	Discrete（D）
图像（I）	Cosmos-Tokenizer-CI	Cosmos-Tokenizer-DI
视频（V）	Cosmos-Tokenizer-CV	Cosmos-Tokenizer-DV

cosmos-tokenizer.mp4

给定图像或视频，Cosmos Tokenizer 输出连续潜在或离散标记。

Cosmos Tokenizer实现了8倍或16倍的空间压缩率和4倍或8倍的时间压缩因子，导致总压缩因子高达2048倍（=8x16x16）。

Cosmos Tokenizer提供的总压缩比最先进的（SOTA）方法多8倍，同时保持更高的画质，运行速度比最好的SOTA标记器快12倍。

License

模型许可位于 NVIDIA Open Model License；NVIDIA 允许：
- 模型在商业上可用。
- 您可以自由创建和分发衍生模型。
- NVIDIA不声称对使用模型或衍生模型生成的任何输出拥有所有权。
GitHub代码：此存储库在Apache 2.0许可下获得许可。

二、安装

克隆源代码

git clone https://github.com/NVIDIA/Cosmos-Tokenizer.git
cd Cosmos-Tokenizer

通过pip安装

apt-get install -y ffmpeg
pip3 install -e .

优选地，使用提供的Dockerfile构建docker映像

docker build -t cosmos-docker -f Dockerfile .

# You can run the container as:
docker run --gpus all -it --rm -v /home/${USER}:/home/${USER} \
    --workdir ${PWD} cosmos-docker /bin/bash

三、从Hugging Face 下载预训练 Checkpoints

我们在Hugging Face 上托管了10个Cosmos-Tokenizer模型，模型名称如下。您可以使用此片段下载：

from huggingface_hub import login, snapshot_download
import os

login(token="<YOUR-HF-TOKEN>", add_to_git_credential=True)
model_names = [
        "Cosmos-0.1-Tokenizer-CI8x8",
        "Cosmos-0.1-Tokenizer-CI16x16",
        "Cosmos-0.1-Tokenizer-CV4x8x8",
        "Cosmos-0.1-Tokenizer-CV8x8x8",
        "Cosmos-0.1-Tokenizer-CV8x16x16",
        "Cosmos-0.1-Tokenizer-DI8x8",
        "Cosmos-0.1-Tokenizer-DI16x16",
        "Cosmos-0.1-Tokenizer-DV4x8x8",
        "Cosmos-0.1-Tokenizer-DV8x8x8",
        "Cosmos-0.1-Tokenizer-DV8x16x16",
        "Cosmos-1.0-Tokenizer-CV8x8x8",
        "Cosmos-1.0-Tokenizer-DV8x16x16",
]
for model_name in model_names:
    hf_repo = "nvidia/" + model_name
    local_dir = "pretrained_ckpts/" + model_name
    os.makedirs(local_dir, exist_ok=True)
    print(f"downloading {model_name}...")
    snapshot_download(repo_id=hf_repo, local_dir=local_dir)

在 checkpoint 库pretrained_ckpts/{model_name}下，我们提供了编码器、解码器和完整的自动编码器JIT模型。

├── Cosmos-Tokenizer-DV4x8x8/
│   ├── encoder.jit
│   ├── decoder.jit
│   ├── autoencoder.jit

四、运行代码

您可以使用以下示例命令对图像或视频进行编码和解码。
对于每一个，相同的命令都适用于连续和离散标记化。只需提供正确的JIT编译ckpt到checkpoint_enc、checkpoint_dec或完整的自动编码器ckpt到checkpoint。

1、编码到连续潜在空间

import torch
from cosmos_tokenizer.video_lib import CausalVideoTokenizer

model_name = "Cosmos-Tokenizer-CV4x8x8"
input_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16)  # [B, C, T, H, W]
encoder = CausalVideoTokenizer(checkpoint_enc=f'pretrained_ckpts/{model_name}/encoder.jit')
(latent,) = encoder.encode(input_tensor)
torch.testing.assert_close(latent.shape, (1, 16, 3, 64, 64))

# The input tensor can be reconstructed by the decoder as:
decoder = CausalVideoTokenizer(checkpoint_dec=f'pretrained_ckpts/{model_name}/decoder.jit')
reconstructed_tensor = decoder.decode(latent)
torch.testing.assert_close(reconstructed_tensor.shape, input_tensor.shape)

该latent将具有形状(1, 16, 3, 64, 64)，其中三个潜元中的第一个表示第一帧，并且C=16是潜元的通道数。

2、编码成离散 token

import torch
from cosmos_tokenizer.video_lib import CausalVideoTokenizer

model_name = "Cosmos-Tokenizer-DV4x8x8"
input_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16)  # [B, C, T, H, W]
encoder = CausalVideoTokenizer(checkpoint_enc=f'pretrained_ckpts/{model_name}/encoder.jit')
(indices, codes) = encoder.encode(input_tensor)
torch.testing.assert_close(indices.shape, (1, 3, 64, 64))
torch.testing.assert_close(codes.shape, (1, 6, 3, 64, 64))

# The input tensor can be reconstructed by the decoder as:
decoder = CausalVideoTokenizer(checkpoint_dec=f'pretrained_ckpts/{model_name}/decoder.jit')
reconstructed_tensor = decoder.decode(indices)
torch.testing.assert_close(reconstructed_tensor.shape, input_tensor.shape)

这些indices将具有形状(1, 3, 64, 64)并包含范围[1..64K]内的积分值，其中三个积分映射中的第一个表示第一帧。codes将包含形状为(1, 6, 3, 64, 64)的预量化连续潜在值，其中C=6表示FSQ级别的数量。

五、Torchscript（PyTorch JIT）推理API

以下说明在test_data/中提供的示例图和视频上运行各种标记器。

自动编码图像。接受输入图像，并输出通过解码编码潜伏期获得的图像的重建。

# Autoencoding images using `Cosmos-CI` with a compression rate of 8x8.
model_name="Cosmos-Tokenizer-CI8x8"
python3 -m cosmos_tokenizer.image_cli \
    --image_pattern 'test_data/image.png' \
    --checkpoint_enc pretrained_ckpts/${model_name}/encoder.jit \
    --checkpoint_dec pretrained_ckpts/${model_name}/decoder.jit

如果未指定--output_dir，则可以在test_data/reconstructions/image.png中找到重建图像。

自动编码视频。接受输入视频，并输出通过解码编码潜伏期获得的视频重建。

# Autoencoding videos using `Cosmos-DV` with a compression rate of 4x8x8.
model_name="Cosmos-Tokenizer-DV4x8x8"
python3 -m cosmos_tokenizer.video_cli \
    --video_pattern 'test_data/video.mp4' \
    --checkpoint_enc pretrained_ckpts/${model_name}/encoder.jit \
    --checkpoint_dec pretrained_ckpts/${model_name}/decoder.jit

如果没有指定--output_dir，那么你可以在test_data/reconstructions/video.mp4找到重建的视频。

六、PyTorch 推理API

要在本机PyTorch中运行标记器，请使用--mode=torch附加命令。
在PyTorch模式下，模型由本机网络定义脚本构建，这需要提供额外的参数来配置模型以进行实例化。

例如，要实例化空间压缩因子为8的Cosmos-DI，请附加以下命令行参数：

--mode=torch
--tokenizer_type=DI
--spatial_compression=8

请注意--checkpoint_enc、--checkpoint_dec和--checkpoint仍然应该引用JIT文件。
必要的state_dict将从加载的JIT模型中提取，以初始化构建的本机PyTorch模型的权重。

# Autoencoding images using `Cosmos-DI` with a compression rate of 8x8.
model_name="Cosmos-Tokenizer-DI8x8"
python3 -m cosmos_tokenizer.image_cli \
    --image_pattern 'test_data/*.png' \
    --mode=torch \
    --tokenizer_type=DI \
    --spatial_compression=8 \
    --checkpoint_enc pretrained_ckpts/${model_name}/encoder.jit \
    --checkpoint_dec pretrained_ckpts/${model_name}/decoder.jit

要实例化时间因子为8、空间压缩因子为8的Cosmos-CV，请附加以下命令行参数：

--mode=torch
--tokenizer_type=CV
--temporal_compression=8
--spatial_compression=8

# Autoencoding videos using `Cosmos-CV` with a compression rate of 8x8x8.
model_name="Cosmos-Tokenizer-CV8x8x8"
python3 -m cosmos_tokenizer.video_cli \
    --video_pattern 'test_data/*.mp4' \
    --mode=torch \
    --tokenizer_type=CV \
    --temporal_compression=8 \
    --spatial_compression=8 \
    --checkpoint_enc pretrained_ckpts/${model_name}/encoder.jit \
    --checkpoint_dec pretrained_ckpts/${model_name}/decoder.jit

七、使用NeMo（JIT/TensorRT）进行推理和数据集标记化

TensorRT推理即将推出，它将在NeMo存储库内的Cosmos Tokenizer README中提供

1、JIT 推断

请按照此处的说明从GitHubmain分支安装NeMo。

运行以下代码来标记视频：

import torch
from nemo.collections.common.video_tokenizers.cosmos_vision_tokenizer import CausalVideoTokenizer
model_name = "Cosmos-Tokenizer-CV4x8x8"
model = CausalVideoTokenizer.from_pretrained(model_name)
input_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16)
(latent, ) = model.encode(input_tensor)

2、数据集 tokenization 和多模态模型训练

请参阅NeMo存储库中的Cosmos Tokenizer README，了解使用Cosmos Tokenizer创建多模态训练数据集的其他示例。

八、评估

性能

TokenBench

https://github.com/NVlabs/TokenBench

TokenBench是我们策划的一个全面的基准，用于标准化Cosmos-Tokenizer的评估。它涵盖了广泛的领域，包括机器人操作、驾驶、以自我为中心和网络视频。它由高分辨率、长时长的视频组成，旨在对视频标记器进行基准测试。我们已经在github.com/NVlabs/TokenBench公开发布了TokenBench。

2025-01-29（三）