斜照124-CSDN博客

原创《论文笔记》ViP-LLaVA:Making Large Multimodal Models Understand Arbitrary Visual Prompts

（2）对于具有真实像素级掩膜标注（ground truth pixel-level mask annotations）的区域，我们从以下 8 种视觉提示中进行采样并进行标注：矩形（rectangle）、椭圆（ellipse）、点（point）、三角形（triangle）、掩膜填充（mask）、掩膜轮廓（mask contour）、箭头（arrow）以及使用贝塞尔曲线（Bézier curves）生成的涂鸦（scribble）（如下图所示）。然而，这些模型通常采用固定的视觉指代格式，这对于用户而言不够直观。

2025-06-07 16:39:42 655 1

原创论文笔记《End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames》

《End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames》论文个人总结！！

2025-01-11 20:18:07 732 3

空空如也

TA创建的收藏夹 TA关注的收藏夹

TA关注的人

原创 《论文笔记》ViP-LLaVA:Making Large Multimodal Models Understand Arbitrary Visual Prompts

原创 论文笔记《End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames》

空空如也

空空如也

原创《论文笔记》ViP-LLaVA:Making Large Multimodal Models Understand Arbitrary Visual Prompts

原创论文笔记《End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames》