具身智能VLA方向--基于仿真数据的单/双臂VLA模型训练(RDT与openpi)

大米爱吃老鼠

已于 2025-03-01 00:14:38 修改

阅读量1.8k

点赞数 12

分类专栏：具身智能文章标签：人工智能 python

于 2025-02-10 01:21:15 首次发布

本文链接：https://blog.csdn.net/iamjackjin/article/details/145538265

版权

具身智能专栏收录该内容

2 篇文章

订阅专栏

系列文章目录

具身智能VLA方向模型fine-tune（单臂）（24.12.26已完结）
该文章主要给出了基于实机场景下的一些数据转化与openVLA，RDT模型的finetune与部署

前言

本篇文章将基于前文，结合当前我所做的一些项目相关的工作，给出一些基于仿真平台的数据采集处理与模型微调，本篇文章主要基于仿真平台：
RoboTwin
进行数据采集与转化并进行训练（我不是1.0版本的协作者，我是1.0发布后才加入的，另外提一嘴，2.0 coming soon~）
由于内侧版本还没发出，所以所有的数据都使用1.0采集展示~
然后我将通过RDT生成的数据转化成对应RDT与openpi所支持的数据，进行模型微调。

一、RoboTwin部署

RoboTwin部署只需要参考:
需要注意的是，为了部署RDT，请按照下面我的一些改动来：

配置RDT相关环境(python3.10 && torch2.1.0)
RoboticsDiffusionTransformer

conda create -n RoboTwin python=3.10.0
conda activate RoboTwin
# Install pytorch
# Look up https://pytorch.org/get-started/previous-versions/ with your cuda version for a correct command
pip install torch==2.1.0 torchvision==0.16.0  --index-url https://download.pytorch.org/whl/cu121
pip install packaging==24.0
# Install flash-attn
pip install flash-attn --no-build-isolation
# Install other prequisites
pip install -r requirements.txt

配置RoboTwin

pip install sapien==3.0.0b1 scipy==1.10.1 mplib==0.1.1 gymnasium==0.29.1 trimesh==4.4.3 open3d==0.18.0 imageio==2.34.2 pydantic zarr openai huggingface_hub==0.25.0

上面相较于RoboTwin INSTALLATION.md
去除了torch2.4.1，使用了RDT的2.1.0，剩下步骤请直接按照INSTALLATION.md来。
3. 测试是否配置成功

python scripts/test_render.py

输出了render ok就是成功了。

二、基于RoboTwin采集数据

采集数据

RoboTwin采集数据方式十分简单：
在RoboTwin项目根目录下:

# 例如bash run_task.sh shoe_place 0
# 所有task_name可以在envs中查看
bash run_task.sh ${task_name} ${gpu_id}

如果我们想可视化观看每次动作采集的渲染（要有desktop），可以编辑对应在task_config中的${taskname}.yml中的render_freq参数，推荐设置：5，10，15，设置越高，生成速度越慢，如果需要大批量采集，建议设置为0（关闭可视化渲染）。
一些${task)name}.yml中的可修改参数：
use_seed: false
（整个RoboTwin采集属于分两步：1. 采集可以正确完成任务的seed 2. 对对应seed进行渲染并保存渲染结果。如果你之前已经采集过了可以成功的seed，那么会在指定路径下保存一个json文件，你可以直接基于seed进行渲染，不需要采集seed）
head_camera_type: L515
（我们自己使用是全部使用D435，因为实机是D435）
episode_num: 100
(收集多少组数据，可以根据需求调整)
depth: true
(建议设置false，因为VLA暂时都没有depth的使用，还会让生成变慢)
如果我们成功采集了数据，会发现在data下面有这样一个/${task_name}的文件夹，里面存储episode_num个episode，每个episode里面有若干个{%d}.whl文件，存储每一帧的对应数据。
在这里插入图片描述

数据格式转化hdf5

在上一篇中，我给出了一个基于.npy转.hdf5的python脚本，这一次我将给出一个从RoboTwin采集数据批量转化到RDT支持的hdf5格式数据的脚本：
由于2.0版本还没有测试完成，所以需要在./policy路径下git clone一个RDT代码：

cd policy
git clone https://github.com/thu-ml/RoboticsDiffusionTransformer.git
mv RoboticsDiffusionTransformer-main RDT
mkdir RDT/processed_data
cd ..

然后就可以在RoboTwin环境下运行python脚本pkl2hdf5_rdt.py

import sys
sys.path.append('./policy/RDT/')

import os
import h5py
import numpy as np
import pickle
import cv2
import argparse
# from scripts.encode_lang_batch_tpp import encode_lang

def images_encoding(imgs):
    encode_data = []
    padded_data = []
    max_len = 0
    for i in range(len(imgs)):
        success, encoded_image = cv2.imencode('.jpg', imgs[i])
        jpeg_data = encoded_image.tobytes()
        encode_data.append(jpeg_data)
        max_len = max(max_len, len(jpeg_data))
    # padding
    for i in range(len(imgs)):
        padded_data.append(encode_data[i].ljust(max_len, b'\0'))
    return encode_data, max_len

def data_transform(path, episode_num, save_path):
    begin = 0
    floders =  os.listdir(path)
    assert episode_num <= len(floders), "data num not enough"

    if not os.path.exists(save_path):
        os.makedirs(save_path)
    
    for i in range(episode_num):  # 遍历所有子任务
        subfolder_name = f"episode{i}"
        subfolder_path = os.path.join(path, subfolder_name)
        # 存储hdf5要使用的数据
        qpos = []
        actions = []
        cam_high = []
        cam_right_wrist = []
        cam_left_wrist = []

        if os.path.isdir(subfolder_path):  # 确保是文件夹
            episode = []
            pkl_files = [f for f in os.listdir(subfolder_path) if f.endswith('.pkl')]  # 获得所有.npy文件
            last_state = None
            for j in range(0, len(pkl_files)): 
                pkl_file_path = os.path.join(subfolder_path, f'{j}.pkl')
                with open(pkl_file_path, 'rb') as pkl_f:
                    data = pickle.load(pkl_f)

                state = np.array(data['joint_action'])  # joints angle       
                state = state.astype(np.float32)
                state[6] /= 0.045
                state[13] /= 0.045
                qpos.append(state)
                
                action = state
                actions.append(action)

                # if j == 0:
                #     pass
                # elif j == len(pkl_files)-1:
                #     action = state - last_state
                #     actions.append(action)
                #     actions.append(action)  # 最后一次轨迹没有预测，就用最后一次的轨迹本身作为预测
                # else:
                #     action = state - last_state
                #     actions.append(action)

                camera_high= data['observation']['head_camera']['rgb']
                camera_high = camera_high[:,:,::-1]
                camera_high_resized = cv2.resize(camera_high, (640,480))
                cam_high.append(camera_high_resized)
                
                camera_right_wrist = data['observation']['right_camera']['rgb']
                camera_right_wrist = camera_right_wrist[:,:,::-1]
                camera_right_wrist_resized = cv2.resize(camera_right_wrist, (640,480))
                cam_right_wrist.append(camera_right_wrist_resized)
           
                camera_left_wrist = data['observation']['left_camera']['rgb']
                camera_left_wrist = camera_left_wrist[:,:,::-1]
                camera_left_wrist_resized = cv2.resize(camera_left_wrist, (640,480))
                cam_left_wrist.append(camera_left_wrist_resized)
                # last_state = state

        hdf5path = os.path.join(save_path, f'episode_{i}.hdf5')
        with h5py.File(hdf5path, 'w') as f:
            f.create_dataset('action', data=np.array(actions))
            obs = f.create_group('observations')
            obs.create_dataset('qpos', data=np.array(qpos))
            image = obs.create_group('images')
            # 图像编码后按顺序存储
            cam_high_enc, len_high = images_encoding(cam_high)
            cam_right_wrist_enc, len_right = images_encoding(cam_right_wrist)
            cam_left_wrist_enc, len_left = images_encoding(cam_left_wrist)
            image.create_dataset('cam_high', data=cam_high_enc, dtype=f'S{len_high}')
            image.create_dataset('cam_right_wrist', data=cam_right_wrist_enc, dtype=f'S{len_right}')
            image.create_dataset('cam_left_wrist', data=cam_left_wrist_enc, dtype=f'S{len_left}')

        begin += 1
        print(f"proccess {i} success!")
    return begin

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Process some episodes.')
    parser.add_argument('task_name', type=str, default='block_hammer_beat',
                        help='The name of the task (e.g., block_hammer_beat)')
    parser.add_argument('setting', type=str)
    parser.add_argument('expert_data_num', type=int, default=50,
                        help='Number of episodes to process (e.g., 50)')
    args = parser.parse_args()
    
    task_name = args.task_name
    num = args.expert_data_num
    setting = args.setting
    
    data_path_name = task_name + '_' + setting
    begin = 0
    print(f'read data from path:{os.path.join("data/", data_path_name)}')
    begin = data_transform(os.path.join("data/",data_path_name), num, f"./policy/RDT/processed_data/{task_name}_{setting}_{num}")

# task_name:任务名，如shoe_place
# setting:有点忘了1.0版本有没有这个参数了，没的话就改下main函数,把有关setting的全删掉就行,
# data_path_name = task_name
# begin = data_transform(os.path.join("data/",data_path_name), num, #f"./policy/RDT/processed_data/{task_name}_{num}")
# expert_data_num:希望转化多少数据
python pkl2hdf5_rdt.py ${task_name} ${setting} ${expert_data_num}

如果一切顺利，我们将在./policy/RDT/processed_data下面看到{task_name}{setting}{num}文件夹，里面有episode_{%d}.hdf5的对应hdf5数据。

三、模型训练

本篇文章将基于RoboTwin的数据训练RDT和openpi两个目前认可度高的开源VLA模型。

RDT模型训练

RoboTwin最新版本已经集成RDT了，这是policy/RDT/README.md：

Deploy RDT on RoboTwin

1. Environment Setup

The conda environment for RDT with RoboTwin is identical to the official RDT environment. Please follow the (RDT official documentation) to install the environment and directly overwrite the RoboTwin virtual environment INSTALLATION.md.

# Make sure python version == 3.10
conda activate RoboTwin

# Install pytorch
# Look up https://pytorch.org/get-started/previous-versions/ with your cuda version for a correct command
pip install torch==2.1.0 torchvision==0.16.0  --index-url https://download.pytorch.org/whl/cu121

# Install packaging
pip install packaging==24.0
pip install ninja
# Verify Ninja --> should return exit code "0"
ninja --version; echo $?
# Install flash-attn
pip install flash-attn==2.7.2.post1 --no-build-isolation

# Install other prequisites
pip install -r requirements.txt
# If you are using a PyPI mirror, you may encounter issues when downloading tfds-nightly and tensorflow. 
# Please use the official source to download these packages.
# pip install tfds-nightly==4.9.4.dev202402070044 -i  https://pypi.org/simple
# pip install tensorflow==2.15.0.post1 -i  https://pypi.org/simple

2. Download Model

# In the RoboTwin/policy directory
cd ../weights
mkdir RDT && cd RDT
# Download the models used by RDT
huggingface-cli download google/t5-v1_1-xxl --local-dir t5-v1_1-xxl
huggingface-cli download google/siglip-so400m-patch14-384 --local-dir siglip-so400m-patch14-384
huggingface-cli download robotics-diffusion-transformer/rdt-1b --local-dir rdt-1b

3. Generate HDF5 Data

First, create the processed_data and training_data folders in the policy/RDT directory:

mkdir processed_data && mkdir training_data

To generate the data for converting to HDF5, you need to run the following command in the RoboTwin/ directory:

cd ../..
bash run_task.sh ${task_name} ${gpu_id}

The data will be saved by default in the RoboTwin/data/${task_name}_${camera_type}_pkl directory.

Then, run the following in the RoboTwin/policy/RDT directory:

cd policy/RDT
# task_name: the already generated data, default located in data/${task_name}
# head_camera_type: default to D435
# expert_data_num: the number of data to be converted to hdf5
# gpu_id: running language encoding,default to 0
# After running, the data will be saved to policy/RDT/processed_data by default
bash process_data_rdt.sh $task_name $head_camera_type $expert_data_num $gpu_id

If success, you will find the ${task_name}_${expert_data_num} folder under policy/RDT/processed_data, with the following data structure:

`processed_data/${task_name}_${expert_data_num}:`
`instructions/lang_embed_{%d}.pt`
`episode_{%d}.hdf5`

4. Generate Configuration File

cd policy/RDT
# model_name: the name you want to save your model as, it is recommended to use ${task_name_1}_${num_1}_${task_name_2}_${num_2}... for easy record-keeping
bash generate.sh ${model_name}

This will create a folder named \${model_name} under training_data and a configuration file \${model_name}.yml under model_config.

Move all the data you wish to use for training into training_data${model_name}. If you have multiple tasks with different data, simply move them in the same way.

Example folder structure:

`training_data/${model_name}:`
`\${task_1}/episode_{%d}.hdf5`
`\${task_1}/instructions/lang_embed_{%d}.pt`
`\${task_2}/episode_{%d}.hdf5`
`\${task_2}/instructions/lang_embed_{%d}.pt`
`...`

In model_config/${model_name}.yml, you need to manually set the GPU to be used. For a single GPU, set it to 0.

5. Finetune model

Once the training parameters are set, you can start training with:

bash finetune.sh ${model_name}

6. Eval on RoboTwin

Once the model fine-tuning is complete, you can test your model’s performance on the RoboTwin simulation platform. RoboTwin offers more than 20 tasks to choose from, and you can find them in the RoboTwin/task_config directory.

bash eval.sh $task_name $head_camera_type $model_name $checkpoint_id $seed $gpu_id

openpi模型训练

RoboTwin最新版本已经集成openpi了，这是policy/openpi/README.md：

OpenPI on RoboTwin Usage

1. Environment Setup

Follow the official OpenPI website to configure the environment. The OpenPI + RoboTwin environment has already been pre-configured in a file, so no additional setup is needed.

GIT_LFS_SKIP_SMUDGE=1 uv sync

install pytorch3d：

conda deactivate
source .venv/bin/activate
# At this point, you should be in the (openpi) environment
pip install portalocker tabulate yacs iopath fvcore
cd ../../third_party/pytorch3d_simplified/
pip install .
# if error:
python setup.py install
pip uninstall pytorch3d
pip install .

cd ../../policy/openpi/
bash

Note that the uv environment will only take effect when the current directory is set as the root directory.
Or you can use uder commands:

source .venv/bin/activate

Next, locate mplib within the (openpi) environment:

uv run where_is_package.py

Then, based on the printed output, modify the corresponding mplib as needed:
Modification Reference

2. Generate Data

We have already generated HDF5 data in the conda environment, and you can refer to the section in the RoboTwin/policy/RDT/README.md for generating HDF5 data.
After generating the HDF5 data, we can directly generate the LerobotDataset format data for OpenPI.
Unlike the data generation process in RDT, we need to manually move the /data/instructions/${task_name}.json file to the corresponding ${task_%d}/ directory and rename it as instructions.json.

# hdf5_path: The path to the generated HDF5 data (e.g., ./training_data/empty_cup_place_500_hdf5/)
# dataset_name: The name of the dataset (e.g., empty_cup_place_500)
bash generate.sh ${hdf5_path} ${dataset_name}

training_data/${hdf5_path}:
${task_1}/episode_{%d}.hdf5
${task_1}/instructions.json
${task_2}/episode_{%d}.hdf5
${task_2}/instructions.json
...

Here, the instructions.json corresponds to the task instructions, located in RoboTwin/data/instructions/ as ${task_name}.json.
Generating the dataset can take some time—about half an hour for 100 sets, so feel free to take a break.

note!

If you don’t have enough disk space under the ~/.cache path, please use the following command to set a different cache directory with sufficient space:

export LEROBOT_HOME=/path/to/your/cache

This is because generating the lerobotdataset will require a large amount of space.And the datasets will be writed into $LEROBOT_HOME.

3. Write the Corresponding `train_config`

In src/openpi/training/config.py, there is a dictionary called _CONFIGS. You can modify two pre-configured PI0 configurations I’ve written:
pi0_base_aloha_robotwin_lora
pi0_fast_aloha_robotwin_lora
pi0_base_aloha_robotwin_full
pi0_fast_aloha_robotwin_full

You only need to write repo_id on your datasets.
If you want to change the name in TrainConfig, please include fast if you choose pi_fast_base model.

4. Finetune model

Simply modify the repo_id to fine-tune the model:

# train_config_name: The name corresponding to the config in _CONFIGS, such as pi0_base_aloha_full
# model_name: You can choose any name for your model
# gpu_use: if not using fsdp_devices,set to gpu_id like 0;else set like 0,1,2,3
bash finetune.sh ${train_config_name} ${model_name} ${gpu_use}

Training mode	Memory Required	Example GPU
Fine-Tuning (LoRA)	> 48 GB	A6000(48G)
Fine-Tuning (Full)	> 100 GB	A100 (80GB) / H100

If your GPU memory is insufficient, please set the fsdp_devices parameter according to the following GPU memory reference, or reduce the batch_size parameter:
The default batch_size is 32 in the table below.

GPU memory	Model type	GPU num	fsdp_devices	Example GPU
24G	lora	2	2	4090(24G)
40G	lora	2	2	A100(40G)
48G	lora	1	1	A6000(48G)
40G	full	2	4	4090(24G)
80G	full	2	2	4090(24G)

5. Eval on RoboTwin

bash eval.sh $task_name $head_camera_type $train_config_name $model_name $checkpoint_id $seed $gpu_id

四、RoboTwin仿真测试

先放两个openpi的部署demo~
RoboTwin with VLA的分支正在校对，即将发布~

pi0_robotwin_empty_cup_place_success_demo

关于数据仿真采集与测试的一些说明

开启视频保存

在RoboTwin/task_config/{task_name}.yml中，可以选择eval_video_log，这样会固定间隔保存图片，并用ffmpeg进行帧合成，在RoboTwin/eval_video下生成对应的视频。
该参数可以让你在服务器上查看eval的失败/成功的原因。

可视化生成

在RoboTwin/task_config/{task_name}.yml中，可以设置render_freq的值来直接在数据生成和评估的时候实时的查看机械臂运动情况。
需要注意，服务器上没法开启，因为是直接可视化到屏幕的。

使用自己的机械臂怎么去训练呢？

训练自己的机械臂

假设你已经模仿libero（单臂）/aloha（双臂）格式将自己的机械臂转化为lerobot格式了（唯一要注意的是你的机械臂自由度可能和libero不同），记得把LiberoOutput的输出维度改为[:8]，然后你就可以设置一个你的train_config了。

注意事项

如果你是单臂，请将wrist_image填充到左臂，state则是填充到右臂[:action_dims]，这样微调效果比较好（由于动作空间共享，所以无论单臂是在左边还是右边，都要填充成这样，不要往前填0）
记得修改config.py中：delta_action_mask = _transforms.make_bool_mask(action_dims-1, -1, action_dims-1, -1)
如果是aloha_policy请设置adapt_to_pi=false