【第二十三周】SAM（Segment Anything）

最新推荐文章于 2025-02-27 10:52:46 发布

L-含光承影

最新推荐文章于 2025-02-27 10:52:46 发布

阅读量1.6k

点赞数 14

文章标签：计算机视觉

本文链接：https://blog.csdn.net/m0_59510256/article/details/145665210

版权

SAM

摘要
Abstract
文章信息
研究背景
- 1.计算机视觉的“基础模型”缺失
- 2.图像分割的固有挑战
SAM的任务
SAM的结构
数据引擎
实验结果（零样本迁移下）
创新与不足
总结

摘要

本篇博客介绍了Segment Anything Model (SAM)，这是Meta AI团队于2023年提出的通用图像分割基础模型，旨在通过用户提供的交互式提示（如点、框、文本等）实现任意目标的零样本分割。SAM的核心思想是将分割任务转化为提示驱动的生成问题，通过大规模预训练和高效架构设计实现跨领域泛化能力。针对传统分割模型依赖大量标注数据、缺乏通用性及难以实时交互的问题，SAM提出可提示分割任务（Promptable Segmentation），并设计了三模块架构：基于MAE预训练的ViT图像编码器、支持多模态提示的提示编码器，以及轻量级Transformer掩码解码器。SAM通过构建SA-1B数据集和高效数据引擎，显著提升了模型训练数据量和泛化能力。然而，SAM在精细分割能力、计算资源需求及提示质量依赖方面仍存在不足。未来改进方向包括轻量化设计、多模态增强、动态场景扩展及领域自适应。

Abstract

This blog post introduces the Segment Anything Model (SAM), a universal image segmentation foundation model proposed by Meta AI in 2023. It aims to achieve zero-shot segmentation of any target through interactive prompts provided by users (such as points, boxes, text, etc.). The core idea of SAM is to transform the segmentation task into a prompt-driven generation problem and achieve cross-domain generalization capabilities through large-scale pre-training and efficient architecture design. To address the issues of traditional segmentation models relying on a large amount of annotated data, lacking universality, and being difficult to interact with in real-time, SAM introduces promptable segmentation and designs a three-module architecture: a ViT image encoder pre-trained with MAE, a prompt encoder supporting multimodal prompts, and a lightweight Transformer mask decoder. SAM significantly increases the model’s training data and generalization capabilities by building the SA-1B dataset and an efficient data engine. However, SAM still has shortcomings in fine-grained segmentation capabilities, computational resource requirements, and dependence on prompt quality. Future improvement directions include lightweight design, multimodal enhancement, dynamic scene expansion, and domain adaptation.

文章信息

Title：Segment Anything
Author：Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick
Source：https://arxiv.org/abs/2304.02643

研究背景

1.计算机视觉的“基础模型”缺失

在自然语言处理（NLP）领域，以BERT、GPT为代表的基础模型（Foundation Models）通过大规模预训练实现了强大的通用性，能通过微调或提示（Prompting）快速适应下游任务。然而，在计算机视觉（尤其是图像分割领域），缺乏类似的通用模型。传统分割方法（如Mask R-CNN、U-Net）需针对每个具体任务单独训练，依赖大量标注数据且泛化能力有限。

2.图像分割的固有挑战

标注成本高：像素级标注（如COCO、Cityscapes数据集）需耗费大量人力。
任务多样性：实例分割、语义分割、交互式分割等任务需独立设计模型，难以统一。
模糊性处理不足：现有模型难以应对用户提示的歧义（如一个点可能对应多个物体）。

SAM的任务

NLP中的基础模型通常通过使用“提示”技术，能够对新的数据集和任务进行零样本和少量样本的学习。受NLP中和基础模型的启发，作者提出了可提示的分割任务，其目标是在给定任何分割提示的情况下返回一个有效的分割掩码（如下图）。
task
将可提示的分割任务既用作预训练目标，也用作通过提示工程解决一般的下游分割任务的方法。
提示信息可以是mask、points、box或者是text，如果是mask的话则使用卷积来进行表示，如果是其余的三种提示词则使用位置编码的形式来进行表示，其中text可以通过clip一类的模型获取词嵌入。其中主干网络的部分使用的特征提取能力更强的VIT网络。
有效输出掩码的要求意味着，即使提示含糊不清且可能指代多个对象（例如，衬衫上的一个点可能表示衬衫或穿衬衫的人），输出也应该是其中至少一个对象的合理掩码，如下图：
在这里插入图片描述

SAM的结构

在这里插入图片描述
SAM主要由三个组件构成：分别是图像编码器、提示编码器和掩码解码器。
SAM首先将image和prompt分别送入图像编码器和提示编码器进行编码，然后将这两个信息源在一个轻量级的掩码解码器中结合起来，预测分割掩码。在这里插入图片描述
为了使SAM具备处理歧义的能力，作者设计它针对单个提示预测多个掩码，从而使SAM能够自然地处理歧义。
接下来介绍SAM的搭建：

搭建辅助模块：位置编码与轻量级Transformer

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision.transforms import Resize
from transformers import CLIPTextModel, CLIPTokenizer

# ----------------------
# 辅助模块1: 随机位置编码 (用于点/框提示)
# ----------------------
class PositionEmbeddingRandom(nn.Module):
    def __init__(self, embed_dim: int):
        super().__init__()
        self.embed_dim = embed_dim
        # 初始化可学习的位置编码参数
        self.register_buffer("positional_encoding", torch.randn(1, embed_dim, 1, 1))

    def forward(self, coords: torch.Tensor) -> torch.Tensor:
        """
        输入: coords [B, N, 2] (坐标范围归一化到[0,1])
        输出: 位置嵌入 [B, N, embed_dim]
        """
        # 将坐标映射到正弦/余弦空间
        coords = coords * 2 * torch.pi  # 缩放到[0, 2π]
        # 生成位置编码（简化版，实际SAM使用更复杂的映射）
        pos_emb = torch.cat([torch.sin(coords), torch.cos(coords)], dim=-1)
        pos_emb = pos_emb @ self.positional_encoding.squeeze()
        return pos_emb

# ----------------------
# 辅助模块2: 轻量级Transformer层 (用于图像编码器)
# ----------------------
class LightweightTransformer(nn.Module):
    def __init__(self, dim=256, num_heads=8, depth=2):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.TransformerEncoderLayer

最低0.47元/天解锁文章