具身智能RDT模型在lerobot机械臂上复现

持续学习的程序员+1

已于 2025-04-21 15:19:00 修改

阅读量1.1k

点赞数 32

分类专栏： lerobot机械臂实操主流具身智能模型具身智能多模态扩散模型RDT-1B 文章标签： elasticsearch 大数据搜索引擎人工智能

于 2025-04-21 15:17:28 首次发布

本文链接：https://blog.csdn.net/weixin_43915081/article/details/147395117

版权

lerobot机械臂实操主流具身智能模型同时被 2 个专栏收录

4 篇文章

订阅专栏

具身智能多模态扩散模型RDT-1B

3 篇文章

订阅专栏

前段时间用lerobot代码库中默认的ACT模型采数据训练并跑了一下效果，还是不错的。从收集数据到train到测试，基本直接用现成的，lerobot已经帮我们集成好了。但学习嘛，实践才能出真知，所以用几天又把以前看的RDT模型finetune并迁移到lerobot环境中跑起来，整体效果也还不错，从抓取成功率上来看，RDT>ACT，但动作稳定性方面来看，ACT还是更丝滑一些，详情如下。

ACT模型跑起来的过程信息如上一篇文章所示：

具身智能ACT模型在lerobot机械臂上复现

RDT模型的一些信息请参考专栏：

具身智能多模态扩散模型RDT-1B

在这里插入图片描述

一，采数据

这次finetune RDT采了101个episodes的数据，也参考了RDT论文中的实验，很多都是采集了100+数据。使用命令：

python lerobot/scripts/control_robot.py   --robot.type=so100   --control.type=record   --control.fps=30   --control.single_task="Put the yellow toy block in a stainless steel bowl."   --control.repo_id=hxdoso/so100_top_side_view   --control.tags='["so100","tutorial"]'   --control.warmup_time_s=5   --control.episode_time_s=300   --control.reset_time_s=30   --control.num_episodes=1 --control.resume=true

详细信息参考lerobot代码库中的10_use_so100.md即可。采集的过程中，特意增加了从不同方向接近目标物体的过程。另外，也不一定都是从机械臂的初始位置开始采集，有一批是机械臂移动到中间位置开始采集。在目标物体的摆放位置上，有30+个数据是固定点位，其它的都是随机。

二，将数据处理成hdf5格式

RDT代码库中finetune默认用的是hdf5格式，为了跟它一致，我们将lerobot采集的数据格式转换为hdf5格式。代码我已经提交在lerobot库中了，地址：git@github.com:hxdoit/lerobot.git，文件名：lerobot/scripts/lerobot_dataset_2_hdf5.py，命令如下：

python lerobot/scripts/lerobot_dataset_2_hdf5.py --dataset.repo_id=hxdoso/so100_top_side_view  --policy.type=act   --output_dir=outputs/train/so100_top_side_view_33episodes1   --job_name=act_so100_test   --policy.device=cuda

其中参数dataset.repo_id是需要的，其它的都不用管它（框架需要，不要删它）。转换比较吃CPU。转换完成后，一个episode会转换成一个hdf5文件。

三，finetune

训练代码我没有迁移到lerobot库中，仍是在RDT官方代码库中训练。过程直接follow官方代码库中的README即可。因为数据格式转换需要lerobot的一些库，所以数据格式转换代码仍然在lerobot代码库中，参考上一段介绍。

训练代码的修改核心是需要根据数据的情况修改data/hdf5_vla_dataset.py文件。如下所示，中文注释是我加的，供参考。整体我修改的内容统一提交在代码库git@github.com:hxdoit/RoboticsDiffusionTransformer.git中，可自行pull。

补充一下，可以用上面代码库中scripts/print_hdf5.py可视化hdf5文件，若可视化没有问题（视频连续，不出现颜色不正常等现象），大概率数据就没问题。

 # Rescale gripper to [0, 1]
             qpos = qpos / np.array(
             #将lerobot的六个关节从角度值转换到[-1,1]区间，因为RDT的预训练数据的范围基本都是[-1,1]区间
-               [[1, 1, 1, 1, 1, 1, 4.7908, 1, 1, 1, 1, 1, 1, 4.7888]] 
+               [[180, 180, 180, 180, 180, 180]]
             )
             target_qpos = f['action'][step_id:step_id+self.CHUNK_SIZE] / np.array(
-               [[1, 1, 1, 1, 1, 1, 11.8997, 1, 1, 1, 1, 1, 1, 13.9231]] 
+               [[180, 180, 180, 180, 180, 180]]
             )
             
             # Parse the state and action
@@ -180,14 +180,8 @@ class HDF5VLADataset:
             def fill_in_state(values):
                 # Target indices corresponding to your state space
                 # In this example: 6 joints + 1 gripper for each arm
                 # 我们是单臂，将相关数据填充在right_arm中
-                UNI_STATE_INDICES = [
-                    STATE_VEC_IDX_MAPPING[f"left_arm_joint_{i}_pos"] for i in range(6)
-                ] + [
-                    STATE_VEC_IDX_MAPPING["left_gripper_open"]
-                ] + [
+                UNI_STATE_INDICES =  [
                     STATE_VEC_IDX_MAPPING[f"right_arm_joint_{i}_pos"] for i in range(6)
-                ] + [
-                    STATE_VEC_IDX_MAPPING["right_gripper_open"]
                 ]
                 uni_vec = np.zeros(values.shape[:-1] + (self.STATE_DIM,))
                 uni_vec[..., UNI_STATE_INDICES] = values
@@ -222,8 +216,8 @@ class HDF5VLADataset:
             cam_high_mask = np.array(
                 [False] * (self.IMG_HISORY_SIZE - valid_len) + [True] * valid_len
             )
             # RDT预训练数据用3个位置(前面，左手，右手)的图片，没有的置为零，在后续训练过程中会使用background_image填充
-            cam_left_wrist = parse_img('cam_left_wrist')
-            cam_left_wrist_mask = cam_high_mask.copy()
+            cam_left_wrist = np.zeros((self.IMG_HISORY_SIZE, 0, 0, 0))#parse_img('cam_right_wrist')
+            cam_left_wrist_mask = [False, False]#cam_high_mask.copy()
             cam_right_wrist = parse_img('cam_right_wrist')
             cam_right_wrist_mask = cam_high_mask.copy()

直接运行source finetune.sh可开启训练。因为RDT模型比较大，大小为1.2B，一个24G显存是不够的，我简单测试可以用2个4090D训练，为了缩短时间，我使用了autodl平台上的3个4090D显卡，batch size=120，共训练了4000个iter，花费10小时，每小时6元，训练成本60元左右。

{'mylerobot_sample_mse': 0.0002, 'mylerobot_sample_l2err': 0.0514, 'overall_avg_sample_mse': 0.0002, 'overall_avg_sample_l2err': 0.0437}

训练过程中，作者建议关注sample时输出的overall_avg_sample_mse指标，最终收敛在0.0002，预计接着训练效果会更好一些，为了节省预算，就ctrl+c了^v。在训练过程中，我也在不断测试过程中的效果，发现在3200iter以前，效果很差，在3200和4000的时候，效果明显变好。

与ACT模型相比，为了达到接近的效果，训练时长大幅增长，ACT训练3个小时就够了，可能与模型大小有关系吧，ACT 5000万参数量，RDT 12亿参数量。

四，测试评估

我将前向推理的代码也集成在了lerobot库中并与机械臂打通，可以直接下发动作，文件：lerobot/scripts/control_robot_rdt.py，代码库：git@github.com:hxdoit/lerobot.git，第一个参数robot.type是必须的，其它的不用管它（框架需要，不要删除）

python lerobot/scripts/control_robot_rdt.py --robot.type=so100   --control.type=record   --control.fps=30   --control.single_task="Put the yellow toy block in a stainless steel bowl."   --control.repo_id=hxdoso/so100_top_side_view    --control.warmup_time_s=5   --control.episode_time_s=300   --control.reset_time_s=30   --control.num_episodes=1 --control.resume=true

我在本地的3090上跑是没有问题的，每次模型推理会生成64个动作，放在队列中，每次取1个，频率是30hz的话，可以使用2秒，队列为空后，再进行模型推理。

五，效果对比

截止目前跑通了2个模型，做了几组试验对比如下，每个实验是5次，成功4次成功率就是80%，成功的标记是把物体放进盘子。整体上ACT在动作稳定性，执行质量上更好一些。RDT动作较为不稳定，通过数值查看，确实不够连续，有跳动情况，但最终成功率高一些，泛化性更强一些，可能利益于是基于作者使用了大量数据预训练的基础吧。

在这里插入图片描述