SAM2视频模块使用(翻译自video_predictor_example.ipynb)

使用 SAM 2 进行视频分割
本笔记本介绍如何使用 SAM 2 进行视频交互式分割。它将涵盖以下内容:

在帧上添加点击,以获取并完善小掩码(时空掩码)
在整个视频中传播点击以获取掩码
同时分割和跟踪多个对象
我们使用分段或掩码来指单个帧上的物体模型预测,使用小掩码来指整个视频中的时空掩码。

如果使用 jupyter 在本地运行,请首先使用软件仓库中的安装说明在您的环境中安装 segment-anything-2。

导入库

import os
import torch
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
# 为整个 notebook 使用 bfloat16
torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()

# 如果 CUDA 设备的属性为 8 或更高版本,则为 Ampere GPU 开启 tfloat32 
# 详细信息参考 https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices
if torch.cuda.get_device_properties(0).major >= 8:
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True

加载 SAM 2 视频预测器

from sam2.build_sam import build_sam2_video_predictor

# 指定 sam2 模型的检查点文件路径
sam2_checkpoint = "../checkpoints/sam2_hiera_large.pt"

# 指定模型配置文件路径
model_cfg = "sam2_hiera_l.yaml"

# 使用指定的模型配置和检查点文件构建视频预测器
predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint)

 sam2_checkpoint修改为自己的权重文件所在位置

model_cfg修改为与你权重相匹配的,一般在sam2_configs文件夹

选择视频示例
我们假设视频存储为 JPEG 帧列表,文件名为 <frame_index>.jpg。

对于自定义视频,您可以使用 ffmpeg (https://ffmpeg.org/) 提取其 JPEG 帧,如下所示:

ffmpeg -i <your_video>.mp4 -q:v 2 -start_number 0 <output_dir>/'%05d.jpg' 其中,-q:v 生成高质量的 JPEG 帧。
其中 -q:v 生成高质量的 JPEG 帧,而 -start_number 0 则要求 ffmpeg 从 00000.jpg 开始生成 JPEG 文件。

# `video_dir` 是一个包含 JPEG 帧的目录,文件名格式如 `<frame_index>.jpg`
video_dir = "./videos/bedroom"

# 扫描该目录中的所有 JPEG 帧文件名
frame_names = [
    p for p in os.listdir(video_dir)
    if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
]
frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))

# 查看第一帧视频帧
frame_idx = 0
plt.figure(figsize=(12, 8))
plt.title(f"frame {frame_idx}")
plt.imshow(Image.open(os.path.join(video_dir, frame_names[frame_idx])))

注意:记得加上plt.show()否则可能没有输出

(图见SAM2中video_predictor_example.ipynb,懒的放了)

初始化推理状态
SAM 2 需要有状态的推理来进行交互式视频分割,因此我们需要在这段视频上初始化推理状态。

在初始化过程中,它会加载 video_path 中的所有 JPEG 帧,并将其像素存储在 inference_state 中(如下图进度条所示)。

# 初始化推理状态
inference_state = predictor.init_state(video_path=video_dir)

例 1:分割并跟踪一个对象
注意:如果您之前使用此 inference_state 运行过任何跟踪,请先通过 reset_state 重置它。

(下面的单元格只是为了说明;这里不需要调用 reset_state,因为这个 inference_state 只是刚刚初始化)。

# 重置推理状态
predictor.reset_state(inference_state)

步骤 1:在框架上添加第一次点击
首先ÿ

### SAM2 Video Tracking Implementation and Usage SAM2 (Segment Anything Model version 2) extends the capabilities of its predecessor by offering advanced features including robust video tracking functionality. The core principle behind this feature lies in leveraging deep learning models specifically trained on diverse datasets that encompass various object categories. For implementing video tracking using SAM2, one must first ensure an environment setup conducive to running state-of-the-art machine learning frameworks such as PyTorch or TensorFlow[^1]. Installation instructions typically include setting up Python virtual environments along with installing necessary libraries via pip commands: ```bash pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117 pip install segment_anything==0.9.5 ``` The actual implementation involves initializing the model followed by processing each frame from a video stream individually while maintaining temporal consistency between frames through techniques like optical flow estimation or keypoint matching algorithms which help track objects across consecutive images within videos. To utilize pre-trained weights provided officially for better performance without extensive training requirements, users can load these into their instances easily during initialization stages of coding scripts written primarily in Python due to widespread support among researchers and developers alike who contribute actively towards improving open-source projects related to computer vision tasks involving segmentation and tracking operations over time-series data represented visually as moving pictures captured sequentially at regular intervals forming what is commonly referred to as 'video'. Incorporating user-defined parameters allows customization according to specific application needs whether it be adjusting confidence thresholds for detections made per frame processed or specifying target classes intended for focus when performing multi-object tracking scenarios where numerous entities might coexist simultaneously inside scenes being analyzed dynamically throughout playback duration spanning potentially long periods depending upon input source characteristics regarding resolution quality versus computational resource availability constraints present locally on hardware executing said software solutions designed around efficient handling large-scale visual recognition challenges posed by real-world conditions encountered frequently today especially given rise recent advancements achieved within fields pertaining artificial intelligence research focused heavily nowadays toward achieving greater levels autonomy possible machines interacting intelligently alongside humans sharing common spaces virtually everywhere globally interconnected networks facilitating seamless exchange information instantly anytime anywhere effortlessly regardless distance separating parties involved communication exchanges conducted digitally online platforms supporting multimedia content delivery services widely adopted society modern era digital transformation ongoing rapidly pace unprecedented history mankind's technological evolution journey thus far documented extensively literature available publicly accessible formats suitable consumption broad audiences ranging technical experts casual observers interested staying informed latest trends emerging continuously fast-moving landscape characterized innovation disruption constant basis year after year pushing boundaries knowledge ever further outward horizons uncharted territories exploration discovery new possibilities previously unimaginable becoming reality thanks contributions countless individuals working collaboratively teams spread out geographically diverse locations worldwide united shared goal advancing collective understanding harnessing power technology improve lives all inhabitants planet Earth collectively.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值