Foundation_pose在自己的物体上复现指南：实现任意物体6D位姿检测（利用realsense_D435i和iphone_16pro手机）

神筆＆『馬良』

已于 2025-05-22 15:16:52 修改

阅读量2k

点赞数 29

文章标签： ubuntu 机器人目标检测视觉检测

于 2025-03-17 21:35:10 首次发布

本文链接：https://blog.csdn.net/qq_53186375/article/details/146321017

版权

参考文章：https://blog.csdn.net/m0_56661101/article/details/137921028

1. 从github上克隆源码

# Download repo
git clone https://github.com/NVlabs/FoundationPose.git

在下面的网站下载权重，下载好后放在weights/目录下

https://drive.google.com/drive/folders/1DFezOAD0oD1BblsXVxqDsl8fj0qzB82i

在下面的网站下载测试数据，下载好后解压到demo_data/目录下

https://drive.google.com/drive/folders/1pRyFmxYXmAnpku7nGRioZaKrVJtIsroP

环境配置：

（1）安装eigen 3.4.0到系统

cd $HOME && wget -q https://gitlab.com/libeigen/eigen/-/archive/3.4.0/eigen-3.4.0.tar.gz && \
tar -xzf eigen-3.4.0.tar.gz && \
cd eigen-3.4.0 && mkdir build && cd build
cmake .. -Wno-dev -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=-std=c++14 ..
sudo make install
cd $HOME && rm -rf eigen-3.4.0 eigen-3.4.0.tar.gz

（2）创建Conda环境

# Create conda environment
conda create -n foundationpose python=3.9
# Activate conda environment
conda activate foundationpose

# Install dependencies
python -m pip install -r requirements.txt

# Install NVDiffRast
python -m pip install --quiet --no-cache-dir git+https://github.com/NVlabs/nvdiffrast.git

# Install Kaolin (Optional, needed if running model-free setup)
python -m pip install --quiet --no-cache-dir kaolin==0.15.0 -f https://nvidia-kaolin.s3.us-east-2.amazonaws.com/torch-2.0.0_cu118.html

# Install PyTorch3D
python -m pip install --quiet --no-index --no-cache-dir pytorch3d -f https://dl.fbaipublicfiles.com/pytorch3d/packaging/wheels/py39_cu118_pyt200/download.html

# Build and install extensions in repo
CMAKE_PREFIX_PATH=$CONDA_PREFIX/lib/python3.9/site-packages/pybind11/share/cmake/pybind11 bash build_all_conda.sh

（3）运行测试

python run_demo.py

效果：

可以看到包围盒围住物体测试文件夹在demo_data/mustard0文件中

2. 分析数据集

对于数据集必须认真分析并反复推敲，后续如果效果不理想一定是数据集的制作有问题。模型的后续调参并不重要，重要的是数据集是否符合要求。

在mustard0文件夹中共有5个文件

（1）cam_K.txt 保存了相机的内参矩阵

注意是realsense的彩色视频流的相机内参，如果错误使用了深度相机流的相机，会导致物体的包围盒特别小且只覆盖物体的一部分。

（2）depth 深度图片文件夹，用于保存相机采集到的深度图片

注意是保存深度图与彩色图对齐后的深度图片（可以在realsense启动文件中进行设置），否则会出现foundation_pose运行时物体的包围盒乱飞或者跟踪不到物体的情况。

（3）masks 彩色相机第一帧mask图片，是划分出彩色图片中物体的mask掩码位置，本文后续使用Grounded-segment-anything模型对第一帧彩图进行分割，制作自己的数据集

注意：第一帧mask的大小一定要是640*480的（与彩图大小对应），因为经过一些处理后（例如Grounded-segment-anything分割大模型图片分辨率会变得很大，需要进行裁剪）否则会出现foundation_pose在一开始就定位不到物体

（4）mesh 第一关键的文件夹：包含三个关键的文件：

textured_simple.obj :物体的模型文件

textured_simple.obj.mtl :物体的模型文件与纹理文件之间的调用桥梁

texture_map.png :物体的纹理文件保存了物体的皮肤特征

（5）rgb :彩色图片文件夹，用于保存相机读取的彩色图片数据集

3.换成自己的数据集：制作上面所需要的物体数据集

（1）cam_K.txt 启动realsense，读取camera_info话题中的K矩阵，这里我没有使用rs_camera.launch作为启动文件（由于它启动后保存的图片单张超过1MB，笔记本之后运行Foundation_pose会报显存不足的报错）我魔改了点云的启动文件，并添加参数：

<launch>
  <arg name="serial_no"             default=""/>
  <arg name="json_file_path"        default=""/>
  <arg name="camera"                default="camera"/>

  <group ns="$(arg camera)">
    <include file="$(find realsense2_camera)/launch/includes/nodelet.launch.xml">
      <arg name="serial_no"         value="$(arg serial_no)"/>
      <arg name="json_file_path"    value="$(arg json_file_path)"/>
      <arg name="depth_width"       value="640"/>
      <arg name="depth_height"      value="480"/>
      <arg name="depth_fps"         value="30"/>
      <arg name="color_width"       value="640"/>
      <arg name="color_height"      value="480"/>
      <arg name="color_fps"         value="30"/>
      <arg name="enable_depth"      value="true"/>
      <arg name="enable_color"      value="true"/>
      <arg name="enable_infra1"     value="false"/>
      <arg name="enable_infra2"     value="false"/>
      <arg name="enable_fisheye"    value="false"/>
      <arg name="enable_gyro"       value="false"/>
      <arg name="enable_accel"      value="false"/>
      <arg name="enable_pointcloud" value="false"/>
      <arg name="enable_sync"       value="true"/>
      <arg name="align_depth"       value="true"/>
      <arg name="tf_prefix"         value="$(arg camera)"/>
    </include>

    <node name="rviz" pkg="rviz" type="rviz" args="-d $(find realsense2_camera)/rviz/pointcloud.rviz" required="true" />
  </group>
</launch>

启动后运行：

rostopic echo /camera/aligned_depth_to_color/camera_info

读取相机的K矩阵为：

6.092415771484375000e+02 0.000000000000000000e+00 3.261578674316406000e+02
0.000000000000000000e+00 6.092608642578125000e+02 2.484097442626953000e+02
0.000000000000000000e+00 0.000000000000000000e+00 1.000000000000000000e+00

（2）和（5）depth 深度图片和彩色图片文件夹，用于保存相机采集到的深度图片和彩色图片

保存代码：分开保存为depth和rgb两个文件夹

#!/home/robot/anaconda3/envs/mytorch/bin/python
import rospy
import cv2
import os
import shutil
import message_filters
from sensor_msgs.msg import Image
from cv_bridge import CvBridge

def create_folder(folder):
    if not os.path.exists(folder):
        os.makedirs(folder)
    else:
        # 清空文件夹内容
        for filename in os.listdir(folder):
            file_path = os.path.join(folder, filename)
            try:
                if os.path.isfile(file_path) or os.path.islink(file_path):
                    os.unlink(file_path)  # 删除文件或符号链接
                elif os.path.isdir(file_path):
                    shutil.rmtree(file_path)  # 递归删除子文件夹
            except Exception as e:
                rospy.logerr(f"Failed to delete {file_path}: {e}")

class ImageSaver:
    def __init__(self):
        rospy.init_node('image_saver', anonymous=True)
        
        # 订阅 RGB 和深度图像话题
        self.bridge = CvBridge()
        rgb_sub = message_filters.Subscriber('/camera/color/image_raw', Image)
        depth_sub = message_filters.Subscriber('/camera/depth/image_rect_raw', Image)
        
        # 时间同步
        ts = message_filters.ApproximateTimeSynchronizer([rgb_sub, depth_sub], queue_size=10, slop=0.1)
        ts.registerCallback(self.callback)
        
        # 创建存储目录（每次运行时清空原来的图片）
        self.rgb_folder = "rgb"
        self.depth_folder = "depth"
        create_folder(self.rgb_folder)
        create_folder(self.depth_folder)
        # # 创建存储目录（每次运行时清空原来的图片）
        # base_folder = os.path.abspath(os.path.join(os.getcwd(), "../../"))  # 上上级目录
        # self.rgb_folder = os.path.join(base_folder, "rgb")
        # self.depth_folder = os.path.join(base_folder, "depth")
        # create_folder(self.rgb_folder)
        # create_folder(self.depth_folder)

    def callback(self, rgb_msg, depth_msg):
        timestamp = str(rgb_msg.header.stamp.to_nsec())  # 以纳秒时间戳命名文件
        
        try:
            # 转换 RGB 图像
            rgb_image = self.bridge.imgmsg_to_cv2(rgb_msg, "bgr8")
            rgb_path = os.path.join(self.rgb_folder, f"{timestamp}.png")
            cv2.imwrite(rgb_path, rgb_image)
            
            # 转换深度图像
            depth_image = self.bridge.imgmsg_to_cv2(depth_msg, "16UC1")  # 16-bit 单通道
            depth_path = os.path.join(self.depth_folder, f"{timestamp}.png")
            cv2.imwrite(depth_path, depth_image)
            
            rospy.loginfo(f"Saved images: {rgb_path} and {depth_path}")
        except Exception as e:
            rospy.logerr(f"Error saving images: {e}")

if __name__ == '__main__':
    try:
        ImageSaver()
        rospy.spin()
    except rospy.ROSInterruptException:
        pass

（3）masks 彩色相机第一帧mask图片：调用Grounded模型分割第一帧彩色图像我使用的是一个木头物块，长方体，输入提示词为：最小的木块

关于Grounded可以看一下这篇文章注意模型的输出不能直接使用需要进行裁剪和灰度化

Grounded-Segment-Anything本地部署-CSDN博客

首先运行SAM模型得到物体mask.jpg

import argparse
import os
import sys
import numpy as np
import json
import torch
from PIL import Image
sys.path.append(os.path.join(os.getcwd(), "GroundingDINO"))
sys.path.append(os.path.join(os.getcwd(), "segment_anything"))
# Grounding DINO
import GroundingDINO.groundingdino.datasets.transforms as T
from GroundingDINO.groundingdino.models import build_model
from GroundingDINO.groundingdino.util.slconfig import SLConfig
from GroundingDINO.groundingdino.util.utils import clean_state_dict, get_phrases_from_posmap
# segment anything
from segment_anything import (
    sam_model_registry,
    sam_hq_model_registry,
    SamPredictor
)
import cv2
import matplotlib.pyplot as plt

# 加载图片
def load_image(image_path):
    # load image
    image_pil = Image.open(image_path).convert("RGB")  # load image

    transform = T.Compose(
        [
            T.RandomResize([800], max_size=1333),
            T.ToTensor(),
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    )
    image, _ = transform(image_pil, None)  # 3, h, w
    return image_pil, image
# 加载模型
def load_model(model_config_path, model_checkpoint_path, bert_base_uncased_path, device):
    args = SLConfig.fromfile(model_config_path)
    args.device = device
    args.bert_base_uncased_path = bert_base_uncased_path
    model = build_model(args)
    checkpoint = torch.load(model_checkpoint_path, map_location="cpu")
    load_res = model.load_state_dict(clean_state_dict(checkpoint["model"]), strict=False)
    print(load_res)
    _ = model.eval()
    return model
# 得到Grounding_DINO模型的检测框输出
def get_grounding_output(model, image, caption, box_threshold, text_threshold, with_logits=True, device="cpu"):
    caption = caption.lower()
    caption = caption.strip()
    if not caption.endswith("."):
        caption = caption + "."
    model = model.to(device)
    image = image.to(device)
    with torch.no_grad():
        outputs = model(image[None], captions=[caption])
    logits = outputs["pred_logits"].cpu().sigmoid()[0]  # (nq, 256)
    boxes = outputs["pred_boxes"].cpu()[0]  # (nq, 4)
    logits.shape[0]

    # filter output
    logits_filt = logits.clone()
    boxes_filt = boxes.clone()
    filt_mask = logits_filt.max(dim=1)[0] > box_threshold
    logits_filt = logits_filt[filt_mask]  # num_filt, 256
    boxes_filt = boxes_filt[filt_mask]  # num_filt, 4
    logits_filt.shape[0]

    # get phrase
    tokenlizer = model.tokenizer
    tokenized = tokenlizer(caption)
    # build pred
    pred_phrases = []
    for logit, box in zip(logits_filt, boxes_filt):
        pred_phrase = get_phrases_from_posmap(logit > text_threshold, tokenized, tokenlizer)
        if with_logits:
            pred_phrases.append(pred_phrase + f"({str(logit.max().item())[:4]})")
        else:
            pred_phrases.append(pred_phrase)

    return boxes_filt, pred_phrases
# 展示mask
def show_mask(mask, ax, random_color=False):
    if random_color:
        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    else:
        color = np.array([30/255, 144/255, 255/255, 0.6])
    h, w = mask.shape[-2:]
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)
# 展示检索框
def show_box(box, ax, label):
    x0, y0 = box[0], box[1]
    w, h = box[2] - box[0], box[3] - box[1]
    ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0,0,0,0), lw=2))
    ax.text(x0, y0, label)
# 保存mask数据
def save_mask_data(output_dir, mask_list, box_list, label_list):
    value = 0  # 0 for background

    mask_img = torch.zeros(mask_list.shape[-2:])
    for idx, mask in enumerate(mask_list):
        mask_img[mask.cpu().numpy()[0] == True] = value + idx + 1
    plt.figure(figsize=(10, 10))
    plt.imshow(mask_img.numpy())
    plt.axis('off')
    plt.savefig(os.path.join(output_dir, 'mask.jpg'), bbox_inches="tight", dpi=300, pad_inches=0.0)

    json_data = [{
        'value': value,
        'label': 'background'
    }]
    for label, box in zip(label_list, box_list):
        value += 1
        name, logit = label.split('(')
        logit = logit[:-1] # the last is ')'
        json_data.append({
            'value': value,
            'label': name,
            'logit': float(logit),
            'box': box.numpy().tolist(),
        })
    with open(os.path.join(output_dir, 'mask.json'), 'w') as f:
        json.dump(json_data, f)

def main():
    # 直接初始化参数，不使用 argparse 解析命令行
    config_file = "GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py"
    grounded_checkpoint = "groundingdino_swint_ogc.pth"
    sam_version = "vit_b"
    sam_checkpoint = None
    sam_hq_checkpoint = "sam_hq_vit_b.pth"  # 如果有 sam-hq，填入相应路径
    use_sam_hq = True  # 是否使用 sam-hq
    image_path = "assets/demo22.png"
    text_prompt = "smallest wooden block" # largest "smallest wooden block"
    output_dir = "outputs"
    box_threshold = 0.3
    text_threshold = 0.25
    device = "cuda"
    bert_base_uncased_path = None  # 如果有 BERT 相关路径，填入相应路径

    # 创建输出目录
    os.makedirs(output_dir, exist_ok=True)
    # 加载图像
    image_pil, image = load_image(image_path)
    image_pil.save(os.path.join(output_dir, "raw_image.jpg"))
    # 加载 Grounding DINO 模型
    model = load_model(config_file, grounded_checkpoint, bert_base_uncased_path, device=device)
    # 运行 Grounding DINO 进行目标检测 先看能否运行到这步 如果不能再启用修改浮点数
    # 可以考虑预先保存分割框的位置
    image = image.half()
    model = model.half()
    boxes_filt, pred_phrases = get_grounding_output(model, image, text_prompt, box_threshold, text_threshold,
                                                    device=device)
    # 初始化 SAM-HQ 预测器
    print(f"Using SAM-HQ model: {sam_hq_checkpoint}")
    predictor = SamPredictor(sam_hq_model_registry[sam_version](checkpoint=sam_hq_checkpoint).to(device))
    # 读取图像
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    predictor.set_image(image)
    # 变换 box 坐标
    size = image_pil.size
    H, W = size[1], size[0]
    for i in range(boxes_filt.size(0)):
        boxes_filt[i] = boxes_filt[i] * torch.Tensor([W, H, W, H])
        boxes_filt[i][:2] -= boxes_filt[i][2:] / 2
        boxes_filt[i][2:] += boxes_filt[i][:2]

    boxes_filt = boxes_filt.cpu()
    transformed_boxes = predictor.transform.apply_boxes_torch(boxes_filt, image.shape[:2]).to(device)

    # 进行分割
    masks, _, _ = predictor.predict_torch(
        point_coords=None,
        point_labels=None,
        boxes=transformed_boxes.to(device),
        multimask_output=False,
    )
    # **调整 mask 大小**
    target_size = (480, 640)  # 目标 mask 分辨率
    masks_resized = torch.nn.functional.interpolate(masks.float(), size=target_size, mode="nearest")

    # 绘制分割结果
    plt.figure(figsize=(10, 10))
    plt.imshow(image)
    for mask in masks_resized:
        show_mask(mask.cpu().numpy(), plt.gca(), random_color=True)
    for box, label in zip(boxes_filt, pred_phrases):
        show_box(box.numpy(), plt.gca(), label)

    plt.axis('off')
    plt.savefig(
        os.path.join(output_dir, "grounded_sam_output_wood_block.jpg"),
        bbox_inches="tight", dpi=300, pad_inches=0.0
    )
    print("成功保存分割后的数据 Saved grounded sam output")

    # 保存分割数据
    save_mask_data(output_dir, masks_resized, boxes_filt, pred_phrases)


if __name__ == '__main__':
    main()

得到下面的mask.jpg文件

再经过一次裁剪和灰度化得到：

import cv2
import os
import numpy as np
from matplotlib import pyplot as plt

def resize_and_convert_to_png(input_path, output_path, width=640, height=480):
    # 读取图片
    image = cv2.imread(input_path)

    # 检查图片是否正确读取
    if image is None:
        print(f"Error: 无法读取图片 {input_path}")
        return

    # 调整大小
    resized_image = cv2.resize(image, (width, height), interpolation=cv2.INTER_AREA)

    # 确保输出路径是 PNG 格式
    output_path = os.path.splitext(output_path)[0] + ".png"

    # 保存图片为 PNG 格式
    cv2.imwrite(output_path, resized_image)
    print(f"已保存调整后的 PNG 图片: {output_path}")


# 示例用法
input_image_path = "outputs/mask.jpg"  # 替换为你的输入图片路径
output_image_path = "outputs/mask_out.png"  # 替换为你的输出文件路径（会自动转换为 PNG）
resize_and_convert_to_png(input_image_path, output_image_path)

# 读取图片
image_path = "outputs/mask_out.png"
image = cv2.imread(image_path)

# 转换为HSV颜色空间，以便提取黄色区域
hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)

# 定义黄色的HSV范围
lower_yellow = np.array([20, 100, 100])
upper_yellow = np.array([30, 255, 255])

# 创建掩码，提取黄色区域
mask = cv2.inRange(hsv, lower_yellow, upper_yellow)

# 创建黑白图像：黄色区域变白色，其余变黑色
result = np.zeros_like(image)  # 先创建黑色背景
result[mask > 0] = [255, 255, 255]  # 将黄色区域变为白色

# 转换为灰度图
result_gray = cv2.cvtColor(result, cv2.COLOR_BGR2GRAY)

# 保存处理后的图片
output_path = "outputs/1742097201894254684.png"
cv2.imwrite(output_path, result_gray)

# 显示处理后的图片
plt.imshow(result_gray, cmap='gray')
plt.axis('off')
plt.show()

最终得到的mask是：

（4）mesh 第一关键的文件夹：这个是用iphone16Pro得到的hh，目前没有什么别的好方法。使用Apple商店里的AR code应用程序（AR code Object Capture 3D scan）扫描物体得到一个usdz文件，之后导入blender进行提取纹理和obj文件。具体细节后续有时间我再补充，大家可以先问Chatgpt怎么做。注意这里使用的iphone只有12以上，并且为Pro版本才可以使用这个软件（具备深感相机）。

先将usdz文件导入blender，需要下载读取usd文件的插件，导入后可以看到物体的模型，会有一个立方体的包围盒在外面，导出的时候要忽略它，只导出你建模的这一个物体

导出的时候选择仅选中项，可以得到obj和mtl文件

导出纹理文件：选择UV编辑，在图片标志的下拉框中有几个选项可以选择，其中两个png就是纹理文件，norm文件应该用不到，也可以不用导出，关键是这个tex0，带有物体材质的图片信息。

再点击图像，会有另存为这个选项，保存就可以了。最终得到三个最关键的文件。

textured_simple.obj :物体的模型文件

textured_simple.obj.mtl :物体的模型文件与纹理文件之间的调用桥梁

texture_map.png :物体的纹理文件保存了物体的皮肤特征

4. 编写测试代码进行测试

将代码中的输入替换为我的数据集：中间有一些我用来debug的注释大家可以直接删掉

# 此文件用于测试自己的数据集
# 识别时修改裁剪比例 如果将裁剪比例设置大 则包围盒聚焦于物块的上方
# 如果将裁剪比例设置小 则包围盒聚焦于物块的前方 且方向旋转对应不起来
from estimater import *
from datareader import *
import os
import logging
import trimesh
import numpy as np
import cv2
import imageio
import open3d as o3d


def main():
  code_dir = os.path.dirname(os.path.realpath(__file__))

  # 直接定义参数
  # 要处理的 3D 物体的 .obj 文件路径 包含默认路径/demo_data/mustard0/mesh/textured_simple.obj 另一个文件夹是 kinect_driller_seq
  # # 理论上这两个数据集都可以跑通 但是内存溢出了 应该如何修改呢
  # mesh_file = f'{code_dir}/demo_data/my_data/mesh/textured_simple.obj'
  # # 测试场景的文件夹路径，包含 RGB + 深度图像数据 /demo_data/mustard0
  # test_scene_dir = f'{code_dir}/demo_data/my_data'
  # 姿态跟踪的结果保存在 my_data_debug/ob_in_cam 文件夹下
  mesh_file = f'{code_dir}/demo_data/my_data4/mesh/untitled.obj'  # 或者 untitled
  test_scene_dir = f'{code_dir}/demo_data/my_data4'

  # 初始姿态估计的优化迭代次数，默认为 5 次
  est_refine_iter = 5
  # 姿态跟踪的优化迭代次数，默认为 2 次
  track_refine_iter = 2
  # 调试级别，控制可视化输出 默认为 1 级
  debug = 1

  # 调试信息存储目录
  debug_dir = f'{code_dir}/my_data_debug'

  set_logging_format()  # 设置日志格式
  set_seed(0)

  mesh = trimesh.load(mesh_file)  # 读取 3D 物体网格文件

  # 实验证明 网格拼接对于姿态识别分割没有作用
  # # 拆分成多个子网格
  # sub_meshes = mesh.split(only_watertight=False)
  # # 计算所有子网格的 AABB
  # for i, sub_mesh in enumerate(sub_meshes):
  #   print(f"Sub-mesh {i + 1} AABB size:", sub_mesh.bounding_box.extents)
  # full_mesh = trimesh.util.concatenate(sub_meshes)
  # # to_origin, extents = trimesh.bounds.oriented_bounds(full_mesh)

  # 如果扩大mesh 姿态检测为错误 物体为站立状态 改变这个因数物体姿态不会发生变化
  scale_factor = 1  # 放大 10%

  print(type(mesh))  # 确认是 Trimesh 还是 Scene
  # 如果 mesh 是一个 Scene，提取所有子模型
  if isinstance(mesh, trimesh.Scene):
    print("检测到多个子模型：")
    for name, submesh in mesh.geometry.items():
      print(f"名称: {name}, 顶点数: {len(submesh.vertices)}")
  # if isinstance(mesh, trimesh.Scene):
  #   mesh = list(mesh.geometry.values())[0]  # 获取场景中的第一个Trimesh对象

  # if isinstance(mesh, trimesh.Scene):
  #   # 按体积排序，选择最小的
  #   smallest_mesh = min(mesh.geometry.values(), key=lambda m: m.bounding_box.volume)
  #   mesh = smallest_mesh

  os.system(f'rm -rf {debug_dir}/* && mkdir -p {debug_dir}/track_vis {debug_dir}/ob_in_cam')

  # 计算 3D 物体的定向包围盒，返回 to_origin 变换矩阵和 extents 长宽高信息
  to_origin, extents = trimesh.bounds.oriented_bounds(mesh)
  # extents *= scale_factor
  print('物体大小',extents)

  # 计算物体的3D包围盒 表示 3D 物体的边界
  bbox = np.stack([-extents / 2, extents / 2], axis=0).reshape(2, 3)
  print("Number of vertices:", len(mesh.vertices))
  print("Number of faces:", len(mesh.faces))
  # mesh.fill_holes()  # 填补孔洞
  # mesh.remove_unreferenced_vertices()  # 移除未被面引用的顶点
  # mesh.remove_degenerate_faces()  # 移除坏的三角面
  # print("修正后 Number of vertices:", len(mesh.vertices))
  # print("修正后 Number of faces:", len(mesh.faces))

  # 初始化姿态估计器
  scorer = ScorePredictor()  # 评分模型，可能用于评估姿态估计的质量
  refiner = PoseRefinePredictor()  # 姿态优化模型，用于优化初始姿态估计结果
  glctx = dr.RasterizeCudaContext()  # 可能是一个 CUDA 计算环境，用于高效渲染

  # 创建姿态估计器 est，传入 3D 物体模型、评分器、优化器、调试参数等
  est = FoundationPose(
    model_pts=mesh.vertices, model_normals=mesh.vertex_normals, mesh=mesh,
    scorer=scorer, refiner=refiner, debug_dir=debug_dir, debug=debug, glctx=glctx
  )
  logging.info("Estimator initialization done")

  # 用户定义的数据读取类，应该用于读取 test_scene_dir 目录中的 RGB + 深度数据
  reader = YcbineoatReader(video_dir=test_scene_dir, shorter_side=None, zfar=np.inf)

  # 实时视频处理
  # 遍历 test_scene_dir 目录中的所有 RGB 帧，并读取对应的深度图
  for i in range(len(reader.color_files)):
    logging.info(f'i:{i}')

    color = reader.get_color(i)  # 读取第 i 帧的 RGB 图像
    depth = reader.get_depth(i)  # 读取第 i 帧的深度图

    if i == 0:
      # 从数据集中获取mask数据
      mask = reader.get_mask(0).astype(bool)
      # 进行 初始姿态估计，输入相机内参 (K)、RGB 图像、深度图和物体掩码 并进行 est_refine_iter 轮优化
      pose = est.register(K=reader.K, rgb=color, depth=depth, ob_mask=mask, iteration=est_refine_iter)

      # 只有当 debug 级别 大于等于 3 时，才会执行下面的代码
      if debug >= 3:  # debug为1级 最基本的可视化 debug为2级 保存中间结果track_vis中的图像 debug为3级 更详细的可视化，例如导出变换后的 3D 物体模型和场景点云
        m = mesh.copy()
        m.apply_transform(pose)
        m.export(f'{debug_dir}/model_tf.obj')
        xyz_map = depth2xyzmap(depth, reader.K)
        valid = depth >= 0.001
        pcd = toOpen3dCloud(xyz_map[valid], color[valid])
        o3d.io.write_point_cloud(f'{debug_dir}/scene_complete.ply', pcd)
    else:
      # 进行姿态跟踪，从前一帧的姿态开始，优化track_refine_iter轮
      pose = est.track_one(rgb=color, depth=depth, K=reader.K, iteration=track_refine_iter)

    # 保存物体在相机坐标系下的姿态矩阵
    os.makedirs(f'{debug_dir}/ob_in_cam', exist_ok=True)
    np.savetxt(f'{debug_dir}/ob_in_cam/{reader.id_strs[i]}.txt', pose.reshape(4, 4))

    # 在 RGB 图像上绘制 3D 包围盒和坐标轴，并显示出来
    if debug >= 1:
      center_pose = pose @ np.linalg.inv(to_origin)
      vis = draw_posed_3d_box(reader.K, img=color, ob_in_cam=center_pose, bbox=bbox)
      vis = draw_xyz_axis(color, ob_in_cam=center_pose, scale=0.1, K=reader.K, thickness=3, transparency=0,
                          is_input_rgb=True)
      cv2.imshow('1', vis[..., ::-1])
      cv2.waitKey(1)

    # 如果 debug >= 2，保存可视化的跟踪结果图片
    if debug >= 2:
      os.makedirs(f'{debug_dir}/track_vis', exist_ok=True)
      imageio.imwrite(f'{debug_dir}/track_vis/{reader.id_strs[i]}.png', vis)


if __name__ == '__main__':
  main()

最终输出：跟踪稳定的物块移动，weights文件夹中的权重参数全部使用默认的参数。