SAM（Segment Anything）阅读笔记

小兑对对

于 2024-08-01 17:59:33 发布

阅读量489

点赞数 18

文章标签：人工智能计算机视觉笔记

本文链接：https://blog.csdn.net/m0_67832368/article/details/140853196

版权

SAM（Segment Anything ， 2023，Facebook ）

🔗arxiv.org/abs/2304.02643

（一）创新点

（二）论文主要内容

（1）数据收集循环a data collection loop/data engine

（2）可提示模型 SAM

1.an image encoder

2.a fast prompt encoder

3.a lightweight mask decoder

（一）创新点

（1）提出了一个 foundation model for segmentation 分割任务的基本模型。

该模型作为视觉⼤模型领域的代表，旨在解决所有下游任务（one to billion tasks），根据分割提示返回有效掩码（这⾥的有效指的是即便提供的prompt 模棱两可，也能返回与其相关的各个部分的 mask。例如当prompt点在了衣服上，会返回衣服和穿着衣服的人的掩码）

（2）实时，可在CPU上运行。

“Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on CPU, in ~ 50ms.”

（3）利用data collection loop，生成了当时最大的分割数据集biggest dataset

（4）responsible AI，对于不同人群等没有偏⻅。

“SAM performs similarly across different groups of people. Together, we hope this will make our work more equitable for real-world use cases.“

（二）论文主要内容

（1）数据收集循环a data collection loop/data engine

这也是⼀个很⼤的创新点，可以使数据⽣成数据。通过基于SAM的模型生成数据，再把数据输入模型强化模型，进而得到更多更好的数据。

“Our data engine has three stages: assisted-manual, semi-automatic, and fully automatic.”

数据引擎包括了三个阶段：⼀个⼈⼯助⼿（完成初始标注任务， 50w ），⼀个半监督（该阶段有家底了，进⼀步⼈为辅助标注、纠正丰富mask 类型， 100w ），⼀个全⾃动（最终产生比以往大了至少400倍的数据集）

（2）可提示模型 SAM

该模型使⽤⼤量数据进⾏预训练，实现对下游任务的zero-shot segmentation。主要包含image encoder，fast prompt enconder和lightwei mask decoder三部分。

“pre-train it on a broad dataset using a task that enables powerful generalization”

1.image encoder

该部分并没有提出新的网络结构，而是沿⽤已有的技术，使⽤经 MAE 预训练的 ViT 模型进⾏图像编码。

“use an MAE pre-trained Vision Transformer (ViT) with minimal adaptations to process high

resolution inputs”

2.fast prompt encoder

a.fast ：实现了实时交互，能在50ms内预测一个掩码。

“ Given an image embedding, the prompt encoder and mask decoder predict a mask from a prompt in 50ms in a web browser. ”

b.SAM的prompt有多种形式，具体可分为分为sparse稀疏/dense密集两种。稀疏提示包括points，boxes ， text，密集提示包括 mask，经过处理都转换为 256 维的向量

c.模型使⽤既有技术处理 prompt ： text ⽤ CLIP ， points&boxes ⽤位置编码， mask ⽤卷积

3.a lightweight mask decoder

a.两层的decoder ：只⽤了两层，说明前⾯的图像特征提取已经⾮常成熟（兔⼦已经很像⼀只兔

⼦），这也简介说明了前期的数据量极⼤

b.每⼀层的输⼊都加上了 prompt （类似于重要的事情说三遍 hh ）

c.自注意力机制和交叉注意力机制（划重点！！）

“uses prompt self-attention and cross-attention in two directions (prompt-to-image

embedding and vice-versa) to update all embeddings.”

⾃注意⼒机制⽤于当前 token 告诉其他 token ⾃⼰的任务，避免任务重复 / 遗漏；

交叉注意⼒机制（双向）用于令不同 token 在图像上找不同⽬标，不同⽬标根据 token 也进⾏⾃我调整。有点像 Few-Shot Classification via Adaptive Attention Adaptive Attention Module中 s ， q的 cross attention。

下图为token和image的双向注意力机制。

最后做了⼀个四倍上采样（特征膨胀，增加细节），并输出各个mask

小兑对对

关注

18
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
SAM（Segment Anything）阅读笔记

作为图像分割大模型的经典代表和基础模型，SAM（Segment Anything）对于图像分割任务乃至视觉大模型都有着不可忽视的重要启蒙作用。该论文保留了当下流行且简洁的各模型对模型进行一定的创新修整，提出数据收集循环实现“数据产生数据”的伟大设想，对大数据大模型做出了极大的贡献。
复制链接

扫一扫