[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--大模型、扩散模型、视觉导航

本文链接：https://blog.csdn.net/u011573853/article/details/135759472

专属领域论文订阅

VX 扫吗关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持
如果你感觉对你有帮助可以扫吗关注，每日准时为你推送最新论文

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉导航
具身智能，机器人
强化学习
开放词汇，检测分割

[晓理紫]每日论文分享(有中文摘要，源码或项目地址)

== LLM ==

标题: LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language Models

作者: Ahmad Faiz, Sotaro Kaneda, Ruhan Wang

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2309.14393v2

[GitHub:]https://github.com/SotaroKaneda/MLCarbon|

中文摘要: 与大型语言模型（LLMs）相关的碳足迹是一个重要问题，包括来自其训练、推理、实验和存储过程的排放，包括操作和嵌入的碳排放。一个重要的方面是准确估计新兴LLM的碳影响，甚至在它们接受培训之前，这严重依赖于GPU的使用。现有的研究报告了LLM训练的碳足迹，但只有一种工具mlco2可以在体能训练前预测新神经网络的碳足迹。然而，mlco2有几个严重的局限性。它不能将其估计扩展到密集或混合专家（MoE）LLMs，忽略关键的架构参数，只关注GPU，并且不能模拟包含的碳足迹。为了解决这些差距，我们引入了\textit{\carb}，这是一个为dense和MoE LLMs设计的端到端碳足迹预测模型。与mlco2相比，\carb~显著提高了各种LLMs碳足迹估计的准确性。源代码发布于\url{https：//github.com/SotaroKaneda/MLCarbon}。

摘要: The carbon footprint associated with large language models (LLMs) is a significant concern, encompassing emissions from their training, inference, experimentation, and storage processes, including operational and embodied carbon emissions. An essential aspect is accurately estimating the carbon impact of emerging LLMs even before their training, which heavily relies on GPU usage. Existing studies have reported the carbon footprint of LLM training, but only one tool, mlco2, can predict the carbon footprint of new neural networks prior to physical training. However, mlco2 has several serious limitations. It cannot extend its estimation to dense or mixture-of-experts (MoE) LLMs, disregards critical architectural parameters, focuses solely on GPUs, and cannot model embodied carbon footprints. Addressing these gaps, we introduce \textit{\carb}, an end-to-end carbon footprint projection model designed for both dense and MoE LLMs. Compared to mlco2, \carb~significantly enhances the accuracy of carbon footprint estimations for various LLMs. The source code is released at \url{https://github.com/SotaroKaneda/MLCarbon}.

标题: Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

作者: Tianle Cai, Yuhong Li, Zhengyang Geng

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10774v1

[GitHub:]https://github.com/FasterDecoding/Medusa|

摘要: The inference process in Large Language Models (LLMs) is often limited due to the absence of parallelism in the auto-regressive decoding process, resulting in most operations being restricted by the memory bandwidth of accelerators. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Using a tree-based attention mechanism, Medusa constructs multiple candidate continuations and verifies them simultaneously in each decoding step. By leveraging parallel processing, Medusa introduces only minimal overhead in terms of single-step latency while substantially reducing the number of decoding steps required. We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa is fine-tuned together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup but needing a special training recipe that preserves the backbone model’s capabilities. Moreover, we propose several extensions that improve or expand the utility of Medusa, including a self-distillation to handle situations where no training data is available and a typical acceptance scheme to boost the acceptance rate while maintaining generation quality. We evaluate Medusa on models of various sizes and training procedures. Our experiments demonstrate that Medusa-1 can achieve over 2.2x speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2.3-3.6x.

标题: Mitigating Hallucinations of Large Language Models via Knowledge Consistent Alignment

作者: Fanqi Wan, Xinting Huang, Leyang Cui

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10768v1

[GitHub:]https://github.com/fanqiwan/KCA|

中文摘要: 虽然大型语言模型（LLMs）在对齐后的各种任务中表现出色，但它们仍然可能产生与上下文或世界知识相矛盾的反应，这种现象被称为“幻觉”。在本文中，我们证明了减少训练数据中封装的外部知识和预训练语料库中继承的内在知识之间的不一致性可以减轻对齐幻觉。具体来说，我们引入了一种新的知识一致对齐（KCA）方法，该方法涉及基于外部知识自动制定考试，以访问对LLMs的理解。对于包含知识不一致的数据，KCA实施了几种简单而有效的处理策略。我们使用不同主干和规模的LLMs，在六个基准测试中说明了所提出的KCA方法在减轻幻觉方面的卓越性能。此外，我们证实了知识不一致性和幻觉之间的相关性，表明减少知识不一致性在减轻幻觉中的有效性。我们的代码、模型权重和数据在\url{https：//github.com/fanqiwan/KCA}公开。

摘要: While Large Language Models (LLMs) have proven to be exceptional on a variety of tasks after alignment, they may still produce responses that contradict the context or world knowledge confidently, a phenomenon known as ``hallucination’'. In this paper, we demonstrate that reducing the inconsistency between the external knowledge encapsulated in the training data and the intrinsic knowledge inherited in the pretraining corpus could mitigate hallucination in alignment. Specifically, we introduce a novel knowledge consistent alignment (KCA) approach, which involves automatically formulating examinations based on external knowledge for accessing the comprehension of LLMs. For data encompassing knowledge inconsistency, KCA implements several simple yet efficient strategies for processing. We illustrate the superior performance of the proposed KCA approach in mitigating hallucinations across six benchmarks using LLMs of different backbones and scales. Furthermore, we confirm the correlation between knowledge inconsistency and hallucination, signifying the effectiveness of reducing knowledge inconsistency in alleviating hallucinations. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/KCA}.

标题: Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning

作者: Chenyu Wang, Weixin Luo, Qianyu Chen

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10727v1

[GitHub:]https://github.com/Tool-LMM/Tool-LMM.|

中文摘要: 最近，大型语言模型（LLMs）在自然语言理解和生成任务中的惊人性能引发了许多使用它们作为中央控制器来构建代理系统的探索。多项研究侧重于将LLMs与外部工具联系起来，以扩展应用场景。然而，目前LLMs的感知工具使用能力局限于单一的文本查询，这可能会导致对用户真实意图的理解模糊不清。LLMs被期望通过感知基于视觉或听觉的指令信息来消除这种情况。因此，在本文中，我们提出了工具LMM，一个结合开源LLMs和多模态编码器的系统，以便学习的LLMs可以意识到多模态输入指令，然后正确选择功能匹配的工具。为了便于评估模型的能力，我们从HuggingFace收集了一个由多模态输入工具组成的数据集。我们的数据集的另一个重要特征是，由于相同函数和同义函数的存在，我们的数据集还包含同一指令的多个潜在选择，这为同一查询提供了更多潜在的解决方案。实验表明，我们的LMM能够为多模态指令推荐合适的工具。代码和数据可在https：//github.com/Tool-LMM/Tool-LMM获得。

摘要: Recently, the astonishing performance of large language models (LLMs) in natural language comprehension and generation tasks triggered lots of exploration of using them as central controllers to build agent systems. Multiple studies focus on bridging the LLMs to external tools to extend the application scenarios. However, the current LLMs’ perceiving tool-use ability is limited to a single text query, which may result in ambiguity in understanding the users’ real intentions. LLMs are expected to eliminate that by perceiving the visual- or auditory-grounded instructions’ information. Therefore, in this paper, we propose Tool-LMM, a system incorporating open-source LLMs and multi-modal encoders so that the learnt LLMs can be conscious of multi-modal input instruction and then select the function-matched tool correctly. To facilitate the evaluation of the model’s capability, we collect a dataset featured by consisting of multi-modal input tools from HuggingFace. Another important feature of our dataset is that our dataset also contains multiple potential choices for the same instruction due to the existence of identical functions and synonymous functions, which provides more potential solutions for the same query. The experiments reveal that our LMM is capable of recommending appropriate tools for multi-modal instructions. Codes and data are available at https://github.com/Tool-LMM/Tool-LMM.

标题: ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

作者: Chen Liang, Yu Wu, Yawei Luo

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2103.10702v4

[Project:]https://ieeexplore.ieee.org/abstract/document/10083244|

中文摘要: 基于文本的视频分割是一项具有挑战性的任务，它分割出视频中自然语言引用的对象。它本质上需要语义理解和细粒度的视频理解。现有的方法以自下而上的方式将语言表示引入分割模型，这仅仅是在ConvNets的局部感受野内进行视觉——语言交互。我们认为，这种交互没有实现，因为在给定部分观察的情况下，该模型几乎不能构建区域级关系，这与自然语言/指称表达式的描述逻辑相反。事实上，人们通常使用与其他对象的关系来描述目标对象，如果不看完整个视频，可能不容易理解。为了解决这个问题，我们引入了一种新的自上而下的方法，通过模仿我们如何在语言指导下分割对象。我们首先找出视频中的所有候选对象，然后通过解析这些高级对象之间的关系来选择被引用的对象。为了精确理解关系，研究了三种对象级关系，即位置关系、文本引导语义关系和时间关系。对A2D句子和J-HMDB句子的大量实验表明，我们的方法远远优于最先进的方法。定性结果也表明我们的结果更容易解释。

摘要: Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue that such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel top-down approach by imitating how we human segment an object with the language guidance. We first figure out all candidate objects in videos and then choose the refereed one by parsing relations among those high-level objects. Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, text-guided semantic relation, and temporal relation. Extensive experiments on A2D Sentences and J-HMDB Sentences show our method outperforms state-of-the-art methods by a large margin. Qualitative results also show our results are more explainable.

标题: Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching

作者: Yang Liu, Muzhi Zhu, Hengtao Li

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2305.13310v2

[GitHub:]https://github.com/aim-uofa/Matcher.|

中文摘要: 在大规模预训练的支持下，vision foundation模型在开放世界图像理解方面表现出巨大的潜力。然而，与擅长直接处理各种语言任务的大型语言模型不同，vision foundation模型需要特定于任务的模型结构，然后对特定任务进行微调。在这项工作中，我们提出了Matcher，一种新的感知范式，利用现成的视觉基础模型来解决各种感知任务。Matcher可以通过使用上下文中的示例来分割任何内容，而无需训练。此外，我们在Matcher框架内设计了三个有效的组件，以与这些基础模型协作，并在不同的感知任务中释放它们的全部潜力。Matcher在各种分割任务中展示了令人印象深刻的泛化性能，所有这些都不需要训练。例如，它在COCO-20 $^i$ 上实现了52.7%的mIoU，超过了最先进的专家模型1.6%。此外，Matcher在所提出的LVIS-92 $^i$ 上实现了33.0%的mIoU，用于一次性语义分割，比最先进的通才模型高出14.4%。我们的可视化结果进一步展示了Matcher在野外应用于图像时的开放世界通用性和灵活性。我们的代码可以在https://github.com/aim-uofa/Matcher找到。

摘要: Powered by large-scale pre-training, vision foundation models exhibit significant potential in open-world image understanding. However, unlike large language models that excel at directly tackling various language tasks, vision foundation models require a task-specific model structure followed by fine-tuning on specific tasks. In this work, we present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks. Matcher can segment anything by using an in-context example without training. Additionally, we design three effective components within the Matcher framework to collaborate with these foundation models and unleash their full potential in diverse perception tasks. Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20 $^i$ with one example, surpassing the state-of-the-art specialist model by 1.6%. In addition, Matcher achieves 33.0% mIoU on the proposed LVIS-92 $^i$ for one-shot semantic segmentation, outperforming the state-of-the-art generalist model by 14.4%. Our visualization results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild. Our code can be found at https://github.com/aim-uofa/Matcher.

== CLIP @ Visual transformers @ VLM @ visual model ==

标题: Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

作者: Lihe Yang, Bingyi Kang, Zilong Huang

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10891v1

[Project:]https://depth-anything.github.io|

[GitHub:]https://github.com/LiheYoung/Depth-Anything.|

中文摘要: 这项工作提出了深度任何东西，一个非常实用的解决方案，用于鲁棒的单目深度估计。不追求新的技术模块，我们的目标是建立一个简单而强大的基础模型，处理任何情况下的任何图像。为此，我们通过设计一个数据引擎来收集和自动注释大规模未标记数据（~62M），从而扩大数据覆盖范围，从而能够减少泛化误差。我们研究了两种简单而有效的策略，它们使数据扩展变得有希望。首先，通过利用数据扩充工具创建更具挑战性的优化目标。它迫使模型主动寻求额外的视觉知识并获得鲁棒的表示。其次，开发了一个辅助监督来加强模型从预训练的编码器继承丰富的语义先验。我们广泛评估了它的零拍摄能力，包括六个公共数据集和随机拍摄的照片。它展示了令人印象深刻的概括能力。此外，通过使用来自NYUv2和KITTI的度量深度信息对其进行微调，可以设置新的SOTA。我们更好的深度模型也产生了更好的深度调节控制网络。我们的模型在https：//github.com/LiheYoung/Depth-Anything发布。

摘要: This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything.

标题: Synthesizing Moving People with 3D Control

作者: Boyi Li, Jathushan Rajasegaran, Yossi Gandelsman

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10889v1

[Project:]https://boyiliee.github.io/3DHM.github.io/.|

中文摘要: 在本文中，我们提出了一个基于扩散模型的框架，用于为给定的目标3D运动序列从单个图像中制作人物动画。我们的方法有两个核心组成部分：a）学习关于人体和服装的不可见部分的先验知识，b）用适当的服装和纹理渲染新的身体姿势。在第一部分，我们学习了一个填充扩散模型，在给定一幅图像的情况下，对一个人看不见的部分产生幻觉。我们在纹理贴图空间上训练这个模型，这使得它的样本效率更高，因为它对姿势和视点是不变的。其次，我们开发了一个基于扩散的渲染管道，它由3D人体姿势控制。这产生了人的新姿势的真实渲染，包括衣服、头发和看不见的区域的可信填充。这种解开的方法允许我们的方法生成一系列图像，这些图像忠实于3D姿态中的目标运动，并且在视觉相似性方面忠实于输入图像。除此之外，3D控件允许各种合成相机轨迹来渲染一个人。我们的实验表明，我们的方法在基因上是有弹性的与以前的方法相比，对长时间的运动和各种具有挑战性和复杂的姿势进行评级。请查看我们的网站了解更多详情：https：//boyiliee.github.io/3DHM.github.io/。

摘要: In this paper, we present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence. Our approach has two core components: a) learning priors about invisible parts of the human body and clothing, and b) rendering novel body poses with proper clothing and texture. For the first part, we learn an in-filling diffusion model to hallucinate unseen parts of a person given a single image. We train this model on texture map space, which makes it more sample-efficient since it is invariant to pose and viewpoint. Second, we develop a diffusion-based rendering pipeline, which is controlled by 3D human poses. This produces realistic renderings of novel poses of the person, including clothing, hair, and plausible in-filling of unseen regions. This disentangled approach allows our method to generate a sequence of images that are faithful to the target motion in the 3D pose and, to the input image in terms of visual similarity. In addition to that, the 3D control allows various synthetic camera trajectories to render a person. Our experiments show that our method is resilient in generating prolonged motions and varied challenging and complex poses compared to prior methods. Please check our website for more details: https://boyiliee.github.io/3DHM.github.io/.

标题: ActAnywhere: Subject-Aware Video Background Generation

作者: Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10822v1

[Project:]https://actanywhere.github.io.|

中文摘要: 生成适合前景主体运动的视频背景是电影行业和视觉效果社区的一个重要问题。这项任务包括合成背景，与前景主题的运动和外观保持一致，同时也符合艺术家的创作意图。我们引入了ActAnywhere，这是一个生成模型，可以自动化传统上需要繁琐的手动工作的过程。我们的模型利用了大规模视频扩散模型的力量，并专门为此任务量身定制。ActAnywhere将前景主体分割序列作为输入，将描述所需场景的图像作为条件，以生成具有逼真前景——背景交互的连贯视频，同时坚持条件帧。我们在人类场景交互视频的大规模数据集上训练我们的模型。广泛的评估证明了我们模型的卓越性能，明显优于基线。此外，我们表明ActAnywhere推广到不同的分布外样本，包括非人类受试者。请访问我们的项目网页https：//actanywhere.github.io。

摘要: Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist’s creative intention. We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere takes a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere generalizes to diverse out-of-distribution samples, including non-human subjects. Please visit our project webpage at https://actanywhere.github.io.

标题: AUPIMO: Redefining Visual Anomaly Detection Benchmarks with High Speed and Low Tolerance

作者: Joao P. C. Bertoldo, Dick Ameln, Ashwin Vaidya

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.01984v2

[Project:]https://summerofcode.withgoogle.com/archive/2023/projects/SPMopugd|

[GitHub:]https://github.com/jpcbertoldo/aupimo.|

中文摘要: 视觉异常检测研究的最新进展表明，在MVTec和VisA等公共基准数据集上的AUROC和AURO分数趋于完美召回，给人的印象是这些基准接近解决。然而，高AUROC和AURO分数并不总是反映定性性能，这限制了这些指标在现实世界应用中的有效性。我们认为，缺乏足够的评估指标所造成的人为上限限制了该领域的发展，我们重新审视用于评估我们算法的评估指标是至关重要的。作为回应，我们引入了每图像重叠（PIMO），这是一种新的度量，解决了AUROC和AURO的缺点。PIMO保留了现有度量的基于回忆的性质，但引入了两个区别：曲线（和曲线下的相应区域）的分配是每个图像的，其X轴仅依赖于正常图像。测量每个图像的召回率简化了实例分数索引，并且对有噪声的注释更加鲁棒。正如我们所展示的，它还加速了计算，并允许使用统计测试来比较模型。通过对正常图像施加低误报容忍度，PIMO提供了增强的模型验证程序，并突出了数据集之间的性能差异。我们的实验表明，PIMO提供了实际优势和细致入微的性能见解，重新定义了异常检测基准——特别是挑战了MVTec AD和VisA数据集已被当代模型解决的看法。可在GitHub上获得：https：//github.com/jpcbertoldo/aupimo。

摘要: Recent advances in visual anomaly detection research have seen AUROC and AUPRO scores on public benchmark datasets such as MVTec and VisA converge towards perfect recall, giving the impression that these benchmarks are near-solved. However, high AUROC and AUPRO scores do not always reflect qualitative performance, which limits the validity of these metrics in real-world applications. We argue that the artificial ceiling imposed by the lack of an adequate evaluation metric restrains progression of the field, and it is crucial that we revisit the evaluation metrics used to rate our algorithms. In response, we introduce Per-IMage Overlap (PIMO), a novel metric that addresses the shortcomings of AUROC and AUPRO. PIMO retains the recall-based nature of the existing metrics but introduces two distinctions: the assignment of curves (and respective area under the curve) is per-image, and its X-axis relies solely on normal images. Measuring recall per image simplifies instance score indexing and is more robust to noisy annotations. As we show, it also accelerates computation and enables the usage of statistical tests to compare models. By imposing low tolerance for false positives on normal images, PIMO provides an enhanced model validation procedure and highlights performance variations across datasets. Our experiments demonstrate that PIMO offers practical advantages and nuanced performance insights that redefine anomaly detection benchmarks – notably challenging the perception that MVTec AD and VisA datasets have been solved by contemporary models. Available on GitHub: https://github.com/jpcbertoldo/aupimo.

标题: Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning

作者: Chenyu Wang, Weixin Luo, Qianyu Chen

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10727v1

[GitHub:]https://github.com/Tool-LMM/Tool-LMM.|

标题: Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

作者: Chen Liang, Yu Wu, Tianfei Zhou

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2106.01061v2

[Project:]https://ieeexplore.ieee.org/abstract/document/10083244|

中文摘要: 引用视频对象分割（RVOS）旨在自然语言引用的指导下对视频对象进行分割。以前的方法通常通过在图像格上直接基于语言引用来处理rvo。这种自下而上的策略未能探索对象级线索，容易导致较差的结果。在这项工作中，我们提出了一个两阶段，自上而下的RVOS解决方案。首先，通过将从几个采样帧中检测到的对象遮罩传播到整个视频来构建一组详尽的对象轨迹。其次，提出了一个基于Transformer model的tracklet语言基础模块，该模块同时有效地建模实例级视觉关系和跨模态交互。我们的模型在CVPR2021上排名第一，涉及Youtube-VOS挑战赛。

摘要: Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. Such bottom-up strategy fails to explore object-level cues, easily leading to inferior results. In this work, we instead put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently. Our model ranks first place on CVPR2021 Referring Youtube-VOS challenge.

== diffusion policy@diffusion formulation@diffusion model ==

标题: Synthesizing Moving People with 3D Control

作者: Boyi Li, Jathushan Rajasegaran, Yossi Gandelsman

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10889v1

[Project:]https://boyiliee.github.io/3DHM.github.io/.|

标题: ActAnywhere: Subject-Aware Video Background Generation

作者: Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10822v1

[Project:]https://actanywhere.github.io.|

标题: Hierarchical Masked 3D Diffusion Model for Video Outpainting

作者: Fanda Fan, Chaoxu Guo, Litong Gong

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2309.02119v3

[Project:]https://fanfanda.github.io/M3DDM/.|

中文摘要: 视频输出旨在充分完成视频帧边缘的缺失区域。与图像绘制相比，它提出了一个额外的挑战，因为模型应该保持填充区域的时间一致性。本文介绍了一种用于视频输出的屏蔽三维扩散模型。我们使用掩模建模技术来训练三维扩散模型。这允许我们使用多个引导帧来连接多个视频剪辑推理的结果，从而确保时间一致性并减少相邻帧之间的抖动。同时，我们提取视频的全局帧作为提示，并使用交叉注意力引导模型获取当前视频剪辑以外的信息。我们还引入了一种混合的从粗到细的推理流水线来缓解伪影积累问题。现有的粗细流水线仅采用填充策略，由于稀疏帧的时间间隔过大，导致性能下降。我们的流水线受益于掩模建模的双向学习，因此在生成稀疏帧时可以采用填充和插值的混合策略。实验表明，我们的方法在视频输出任务中取得了最先进的结果。我们的https：//fanfanda.github.io/M3DDM/。提供了更多的结果和代码

摘要: Video outpainting aims to adequately complete missing areas at the edges of video frames. Compared to image outpainting, it presents an additional challenge as the model should maintain the temporal consistency of the filled area. In this paper, we introduce a masked 3D diffusion model for video outpainting. We use the technique of mask modeling to train the 3D diffusion model. This allows us to use multiple guide frames to connect the results of multiple video clip inferences, thus ensuring temporal consistency and reducing jitter between adjacent frames. Meanwhile, we extract the global frames of the video as prompts and guide the model to obtain information other than the current video clip using cross-attention. We also introduce a hybrid coarse-to-fine inference pipeline to alleviate the artifact accumulation problem. The existing coarse-to-fine pipeline only uses the infilling strategy, which brings degradation because the time interval of the sparse frames is too large. Our pipeline benefits from bidirectional learning of the mask modeling and thus can employ a hybrid strategy of infilling and interpolation when generating sparse frames. Experiments show that our method achieves state-of-the-art results in video outpainting tasks. More results and codes are provided at our https://fanfanda.github.io/M3DDM/.

标题: Diffusion-based Data Augmentation for Nuclei Image Segmentation

作者: Xinyi Yu, Guanbin Li, Wei Lou

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2310.14197v2

[GitHub:]https://github.com/lhaof/Nudiff|https://github.com/lhaof/Nudiff|

中文摘要: 细胞核分割是组织病理学图像定量分析中一项基本但具有挑战性的任务。虽然基于完全监督的深度学习方法取得了重大进展，但需要大量标记的图像才能实现出色的分割性能。考虑到手动标记数据集的所有nuclei实例是低效的，获得大规模人工注释的数据集是耗时和劳动密集型的。因此，仅用少量标记图像扩充数据集以提高分割性能具有重要的研究和应用价值。在本文中，我们介绍了第一种基于扩散的增强方法用于细胞核分割。其思想是合成大量标记的图像，以便于训练分割模型。为了实现这一点，我们提出了一个两步走的策略。在第一步中，我们训练一个无条件扩散模型来合成核结构，该核结构被定义为像素级语义和距离变换的表示。每个合成的细胞核结构将作为组织病理学图像合成的约束，并被进一步后处理为实例图。在第二步中，我们训练一个条件扩散模型来合成基于细胞核结构的组织病理学图像。将与合成实例图配对的合成组织病理学图像添加到真实数据集中，用于训练分割模型。实验结果表明，通过用合成样本增加10%的标记真实数据集，可以获得与完全监督基线相当的分割结果。该代码发布于：https：//github.com/lhaof/Nudiff

摘要: Nuclei segmentation is a fundamental but challenging task in the quantitative analysis of histopathology images. Although fully-supervised deep learning-based methods have made significant progress, a large number of labeled images are required to achieve great segmentation performance. Considering that manually labeling all nuclei instances for a dataset is inefficient, obtaining a large-scale human-annotated dataset is time-consuming and labor-intensive. Therefore, augmenting a dataset with only a few labeled images to improve the segmentation performance is of significant research and application value. In this paper, we introduce the first diffusion-based augmentation method for nuclei segmentation. The idea is to synthesize a large number of labeled images to facilitate training the segmentation model. To achieve this, we propose a two-step strategy. In the first step, we train an unconditional diffusion model to synthesize the Nuclei Structure that is defined as the representation of pixel-level semantic and distance transform. Each synthetic nuclei structure will serve as a constraint on histopathology image synthesis and is further post-processed to be an instance map. In the second step, we train a conditioned diffusion model to synthesize histopathology images based on nuclei structures. The synthetic histopathology images paired with synthetic instance maps will be added to the real dataset for training the segmentation model. The experimental results show that by augmenting 10% labeled real dataset with synthetic samples, one can achieve comparable segmentation results with the fully-supervised baseline. The code is released in: https://github.com/lhaof/Nudiff

标题: Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

作者: Xin Yuan, Jinoo Baek, Keyang Xu

[UpdateTime:]2024-01-18

[Downlink:]http://arxiv.org/abs/2401.10404v1

[Project:]https://drive.google.com/drive/folders/1YVc-KMSJqOrEUdQWVaI-Yfu8Vsfu_1aO?usp=sharing|

中文摘要: 我们提出了一种有效的基于扩散的文本到视频超分辨率（SR）调整方法，该方法利用像素级图像扩散模型的易于学习的能力来捕获用于视频生成的空间信息。为了实现这一目标，我们设计了一个高效的架构，将文本到图像SR模型的权重膨胀到我们的视频生成框架中。此外，我们还集成了一个时间适配器，以确保视频帧之间的时间一致性。我们基于我们膨胀的架构研究了不同的调谐方法，并报告了计算成本和超分辨率质量之间的权衡。对Shutterstock视频数据集的定量和定性经验评估表明，我们的方法能够以良好的视觉质量和时间一致性执行文本到视频SR生成。为了评估时间一致性，我们还在https://drive.google.com/drive/folders/1YVc-KMSJqOrEUdQWVaI-Yfu8Vsfu_1aO中提供了视频格式的可视化。

摘要: We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach that leverages the readily learned capacity of pixel level image diffusion model to capture spatial information for video generation. To accomplish this goal, we design an efficient architecture by inflating the weightings of the text-to-image SR model into our video generation framework. Additionally, we incorporate a temporal adapter to ensure temporal coherence across video frames. We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality. Empirical evaluation, both quantitative and qualitative, on the Shutterstock video dataset, demonstrates that our approach is able to perform text-to-video SR generation with good visual quality and temporal consistency. To evaluate temporal coherence, we also present visualizations in video format in https://drive.google.com/drive/folders/1YVc-KMSJqOrEUdQWVaI-Yfu8Vsfu_1aO?usp=sharing .

标题: A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting

作者: Wouter Van Gansbeke, Bert De Brabandere

[UpdateTime:]2024-01-18

[Downlink:]http://arxiv.org/abs/2401.10227v1

[GitHub:]https://github.com/segments-ai/latent-diffusion-segmentation|

中文摘要: 全景和实例分割网络通常用专门的对象检测模块、复杂的损失函数和特设的后处理步骤来训练，以处理实例掩码的排列不变性。这项工作建立在稳定扩散的基础上，并提出了一种用于全景分割的潜在扩散方法，从而产生了一种简单的架构，省略了这些复杂性。我们的训练过程包括两个步骤：（1）训练一个浅层自动编码器将分割掩模投影到潜在空间；（2）训练扩散模型以允许潜在空间中的图像条件采样。生成模型的使用开启了掩模完成或修复的探索，这在交互式分割中具有应用。实验验证产生了全景分割和掩模修复的有希望的结果。虽然没有设置新的技术水平，但我们的模型的简单性、通用性和掩码完成能力是可取的属性。

摘要: Panoptic and instance segmentation networks are often trained with specialized object detection modules, complex loss functions, and ad-hoc post-processing steps to handle the permutation-invariance of the instance masks. This work builds upon Stable Diffusion and proposes a latent diffusion approach for panoptic segmentation, resulting in a simple architecture which omits these complexities. Our training process consists of two steps: (1) training a shallow autoencoder to project the segmentation masks to latent space; (2) training a diffusion model to allow image-conditioned sampling in latent space. The use of a generative model unlocks the exploration of mask completion or inpainting, which has applications in interactive segmentation. The experimental validation yields promising results for both panoptic segmentation and mask inpainting. While not setting a new state-of-the-art, our model’s simplicity, generality, and mask completion capability are desirable properties.

== Visual Navigation@Visual Exploration @ VSLAM ==

标题: Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning

作者: Chenyu Wang, Weixin Luo, Qianyu Chen

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10727v1

[GitHub:]https://github.com/Tool-LMM/Tool-LMM.|

标题: SEINE: Structure Encoding and Interaction Network for Nuclei Instance Segmentation

作者: Ye Zhang, Linghan Cai, Ziyue Wang

[UpdateTime:]2024-01-18

[Downlink:]http://arxiv.org/abs/2401.09773v1

[GitHub:]https://github.com/zhangye-zoe/SEINE|https://github.com/zhangye-zoe/SEINE|

中文摘要: 组织病理学图像中的细胞核实例分割对于生物分析和癌症诊断非常重要，但由于两个原因仍然具有挑战性。（1）嫌色细胞核的核内和核外区域的相似视觉呈现经常导致分割不足，以及（2）当前的方法缺乏对核结构的探索，导致碎片化的实例预测。为了解决这些问题，本文提出了一种结构编码和交互网络，称为SEINE，它开发了核的结构建模方案，并利用核之间的结构相似性来提高每个分段实例的完整性。具体来说，SEINE引入了一种基于轮廓的结构编码（SE），该编码考虑了原子核结构和语义之间的相关性，实现了原子核结构的合理表示。基于编码，我们提出了一种以清晰核为原型的结构引导注意（SGA）来增强模糊核的结构学习。为了增强结构学习能力，提出了语义特征融合（SFF）来提高语义分支和结构分支的语义一致性。此外，位置增强（PE）方法被应用于抑制不正确的核边界预测。大量的实验证明了我们的方法的优越性，SEINE在四个数据集上实现了最先进的（SOTA）性能。该代码可从\href{https：//github.com/zhangye-zoe/SEINE}{https：//github.com/zhangye-zoe/SEINE}获得。

摘要: Nuclei instance segmentation in histopathological images is of great importance for biological analysis and cancer diagnosis but remains challenging for two reasons. (1) Similar visual presentation of intranuclear and extranuclear regions of chromophobe nuclei often causes under-segmentation, and (2) current methods lack the exploration of nuclei structure, resulting in fragmented instance predictions. To address these problems, this paper proposes a structure encoding and interaction network, termed SEINE, which develops the structure modeling scheme of nuclei and exploits the structure similarity between nuclei to improve the integrality of each segmented instance. Concretely, SEINE introduces a contour-based structure encoding (SE) that considers the correlation between nuclei structure and semantics, realizing a reasonable representation of the nuclei structure. Based on the encoding, we propose a structure-guided attention (SGA) that takes the clear nuclei as prototypes to enhance the structure learning for the fuzzy nuclei. To strengthen the structural learning ability, a semantic feature fusion (SFF) is presented to boost the semantic consistency of semantic and structure branches. Furthermore, a position enhancement (PE) method is applied to suppress incorrect nuclei boundary predictions. Extensive experiments demonstrate the superiority of our approaches, and SEINE achieves state-of-the-art (SOTA) performance on four datasets. The code is available at \href{https://github.com/zhangye-zoe/SEINE}{https://github.com/zhangye-zoe/SEINE}.

标题: SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

作者: Yang Zhan, Zhitong Xiong, Yuan Yuan

[UpdateTime:]2024-01-18

[Downlink:]http://arxiv.org/abs/2401.09712v1

[GitHub:]https://github.com/ZhanYang-nwpu/SkyEyeGPT.|

中文摘要: 大型语言模型（LLMs）最近已经扩展到视觉语言领域，获得了令人印象深刻的通用多模态能力。然而，针对遥感数据的多模态大语言模型（MLLMs）的探索仍处于起步阶段，性能并不令人满意。在这项工作中，我们介绍了SkyEyeGPT，一个统一的多模态大型语言模型，专门为RS视觉语言理解而设计。为此，我们精心策划了一个RS多模态指令调优数据集，包括单任务和多任务对话指令。经过人工验证，我们获得了968k样本的高质量RS指令跟随数据集。我们的研究表明，通过简单而有效的设计，SkyEyeGPT在相当不同的任务上工作得非常好，而不需要额外的编码模块。具体来说，在通过对齐层将RS视觉特征投影到语言域之后，它们与特定于任务的指令一起被馈送到基于LLM的RS解码器中，以预测RS开放式任务的答案。此外，我们设计了一种两阶段调整方法，以增强不同粒度下的指令跟随和多回合对话能力。在8个RS视觉语言任务数据集上的实验证明了SkyEyeGPT在图像级和区域级任务中的优势，如字幕和视觉接地。特别是，与GPT-4V相比，SkyEyeGPT在一些定性测试中表现出令人鼓舞的结果。在线演示、代码和数据集将在https：//github.com/ZhanYang-nwpu/SkyEyeGPT中发布。

摘要: Large language models (LLMs) have recently been extended to the vision-language realm, obtaining impressive general multi-modal capabilities. However, the exploration of multi-modal large language models (MLLMs) for remote sensing (RS) data is still in its infancy, and the performance is not satisfactory. In this work, we introduce SkyEyeGPT, a unified multi-modal large language model specifically designed for RS vision-language understanding. To this end, we meticulously curate an RS multi-modal instruction tuning dataset, including single-task and multi-task conversation instructions. After manual verification, we obtain a high-quality RS instruction-following dataset with 968k samples. Our research demonstrates that with a simple yet effective design, SkyEyeGPT works surprisingly well on considerably different tasks without the need for extra encoding modules. Specifically, after projecting RS visual features to the language domain via an alignment layer, they are fed jointly with task-specific instructions into an LLM-based RS decoder to predict answers for RS open-ended tasks. In addition, we design a two-stage tuning method to enhance instruction-following and multi-turn dialogue ability at different granularities. Experiments on 8 datasets for RS vision-language tasks demonstrate SkyEyeGPT’s superiority in image-level and region-level tasks, such as captioning and visual grounding. In particular, SkyEyeGPT exhibits encouraging results compared to GPT-4V in some qualitative tests. The online demo, code, and dataset will be released in https://github.com/ZhanYang-nwpu/SkyEyeGPT.

标题: 360ORB-SLAM: A Visual SLAM System for Panoramic Images with Depth Completion Network

作者: Yichen Chen, Yiqi Pan, Ruyu Liu

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10560v1

中文摘要: 为了增强AR/VR应用和视觉辅助及检查系统的性能和效果，视觉同步定位和测绘（vSLAM）是计算机视觉和机器人学中的一项基本任务。然而，传统的vSLAM系统受到相机窄视场的限制，导致诸如稀疏的特征分布和缺乏密集的深度信息等挑战。为了克服这些限制，本文提出了一个与深度完成网络相结合的全景图像360ORB-SLAM系统。该系统从全景图像中提取特征点，利用全景三角测量模块生成稀疏深度信息，并利用深度完成网络获得密集的全景深度图。在基于Carla构建的新型全景数据集上的实验结果表明，与现有的单目SLAM方法相比，该方法实现了更高的尺度精度，并有效地解决了特征关联和尺度模糊的挑战。深度完井网络的集成增强了系统稳定性，减轻了动态因素对SLAM性能的影响。

摘要: To enhance the performance and effect of AR/VR applications and visual assistance and inspection systems, visual simultaneous localization and mapping (vSLAM) is a fundamental task in computer vision and robotics. However, traditional vSLAM systems are limited by the camera’s narrow field-of-view, resulting in challenges such as sparse feature distribution and lack of dense depth information. To overcome these limitations, this paper proposes a 360ORB-SLAM system for panoramic images that combines with a depth completion network. The system extracts feature points from the panoramic image, utilizes a panoramic triangulation module to generate sparse depth information, and employs a depth completion network to obtain a dense panoramic depth map. Experimental results on our novel panoramic dataset constructed based on Carla demonstrate that the proposed method achieves superior scale accuracy compared to existing monocular SLAM methods and effectively addresses the challenges of feature association and scale ambiguity. The integration of the depth completion network enhances system stability and mitigates the impact of dynamic elements on SLAM performance.

标题: Cross-Modality Perturbation Synergy Attack for Person Re-identification

作者: Yunpeng Gong, Zhun Zhong, Zhiming Luo

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10090v2

中文摘要: 近年来，有大量研究集中在解决基于RGB图像的单模态人员再识别（ReID）系统中的安全问题。然而，在涉及红外摄像机捕获的图像的实际应用中更常见的跨模态场景的安全性没有得到足够的重视。跨模态ReID的主要挑战在于有效地处理不同模态之间的视觉差异。例如，红外图像通常是灰度的，不像可见光图像包含颜色信息。现有的攻击方法主要集中在可见图像模态的特征上，而忽略了其他模态的特征以及不同模态之间数据分布的差异。这种疏忽可能会潜在地破坏这些方法在跨不同模态的图像检索中的有效性。这项研究代表了对跨模态ReID模型安全性的首次探索，并提出了一种专门为跨模态ReID设计的通用扰动攻击。这种攻击通过利用来自不同模态数据的梯度来优化扰动，从而破坏鉴别器并加强模态之间的差异。我们在两个广泛使用的跨模态数据集上进行了实验，即RegDB和SYSU，这不仅证明了我们方法的有效性，而且为未来跨模态ReID系统的鲁棒性增强提供了见解。

摘要: In recent years, there has been significant research focusing on addressing security concerns in single-modal person re-identification (ReID) systems that are based on RGB images. However, the safety of cross-modality scenarios, which are more commonly encountered in practical applications involving images captured by infrared cameras, has not received adequate attention. The main challenge in cross-modality ReID lies in effectively dealing with visual differences between different modalities. For instance, infrared images are typically grayscale, unlike visible images that contain color information. Existing attack methods have primarily focused on the characteristics of the visible image modality, overlooking the features of other modalities and the variations in data distribution among different modalities. This oversight can potentially undermine the effectiveness of these methods in image retrieval across diverse modalities. This study represents the first exploration into the security of cross-modality ReID models and proposes a universal perturbation attack specifically designed for cross-modality ReID. This attack optimizes perturbations by leveraging gradients from diverse modality data, thereby disrupting the discriminator and reinforcing the differences between modalities. We conducted experiments on two widely used cross-modality datasets, namely RegDB and SYSU, which not only demonstrated the effectiveness of our method but also provided insights for future enhancements in the robustness of cross-modality ReID systems.

标题: A Semantic Approach for Big Data Exploration in Industry 4.0

作者: Idoia Berges, Víctor Julio Ramírez-Durán, Arantza Illarramendi

[UpdateTime:]2024-01-18

[Downlink:]http://arxiv.org/abs/2401.09789v1

中文摘要: 自动化、物联网、大数据和云计算技术的增长趋势导致了第四次工业革命（工业4.0），在这场革命中，可以可视化和识别模式和见解，从而更好地理解数据，并可以改善制造过程。然而，很多时候，数据探索的任务对制造专家来说很困难，因为他们可能对分析预先设计的可视化中没有出现的数据感兴趣，因此他们必须得到信息技术专家的帮助。在本文中，我们提出了一个基于语义的可视化查询系统，该系统是为真实的工业4.0场景开发的，允许领域专家以友好的方式探索和可视化数据。该系统的主要新颖之处在于它结合使用了首先进行语义注释的捕获数据，以及还与语义描述相关联的机器的2D定制数字表示。这些描述使用本体的术语来表达，其中，除其他外，用于捕获属于工业4.0场景的机器的性能指标的传感器已经被建模。此外，这种语义描述允许：在更高的抽象层次上制定查询，根据数据的格式和性质提供结果的定制图形可视化，以及下载丰富的数据以支持进一步类型的分析。

摘要: The growing trends in automation, Internet of Things, big data and cloud computing technologies have led to the fourth industrial revolution (Industry 4.0), where it is possible to visualize and identify patterns and insights, which results in a better understanding of the data and can improve the manufacturing process. However, many times, the task of data exploration results difficult for manufacturing experts because they might be interested in analyzing also data that does not appear in pre-designed visualizations and therefore they must be assisted by Information Technology experts. In this paper, we present a proposal materialized in a semantic-based visual query system developed for a real Industry 4.0 scenario that allows domain experts to explore and visualize data in a friendly way. The main novelty of the system is the combined use that it makes of captured data that are semantically annotated first, and a 2D customized digital representation of a machine that is also linked with semantic descriptions. Those descriptions are expressed using terms of an ontology, where, among others, the sensors that are used to capture indicators about the performance of a machine that belongs to a Industry 4.0 scenario have been modeled. Moreover, this semantic description allows to: formulate queries at a higher level of abstraction, provide customized graphical visualizations of the results based on the format and nature of the data, and download enriched data enabling further types of analysis.