[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--大模型、扩散模型、视觉导航_spatial-semantic collaborative cropping for user g-CSDN博客

本文链接：https://blog.csdn.net/u011573853/article/details/135661172

专属领域论文订阅

VX关注晓理紫，每日定时更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

晓理紫

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉导航
具身智能，机器人
强化学习
开放词汇，检测分割

[晓理紫]每日论文分享(有中文摘要，源码或项目地址)

== LLM ==

标题: Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models

作者: Jianhui Pang, Fanghua Ye, Longyue Wang

中文摘要: 神经机器翻译（NMT）的发展受到六项核心挑战的显著影响（Koehn和Knowles，2017），这六项挑战是该领域进步的基准。这项研究重新审视了这些挑战，深入了解了它们在高级大型语言模型（LLM）背景下的持续相关性：领域不匹配、并行数据量、罕见词预测、长句翻译、作为单词对齐的注意力模型和次优波束搜索。我们的实证研究结果表明，LLM在预训练阶段有效地减少了对主要语言并行数据的依赖。此外，基于LLM的翻译系统显著增强了包含大约80个单词的长句的翻译，并显示出翻译多达512个单词的文档的能力。然而，尽管有这些显著的改进，领域错配和稀有词预测的挑战仍然存在。虽然单词对齐和波束搜索的挑战，特别是与NMT相关的挑战，可能不适用于LLM，但我们确定了LLM在翻译任务中的三个新挑战：推理效率、预训练阶段低资源语言的翻译以及人工对齐评估。数据集和模型发布于https://github.com/pangjh3/LLM4MT.

摘要: The evolution of Neural Machine Translation (NMT) has been significantly influenced by six core challenges (Koehn and Knowles, 2017), which have acted as benchmarks for progress in this field. This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models (LLMs): domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search. Our empirical findings indicate that LLMs effectively lessen the reliance on parallel data for major languages in the pretraining phase. Additionally, the LLM-based translation system significantly enhances the translation of long sentences that contain approximately 80 words and shows the capability to translate documents of up to 512 words. However, despite these significant improvements, the challenges of domain mismatch and prediction of rare words persist. While the challenges of word alignment and beam search, specifically associated with NMT, may not apply to LLMs, we identify three new challenges for LLMs in translation tasks: inference efficiency, translation of low-resource languages in the pretraining phase, and human-aligned evaluation. The datasets and models are released at https://github.com/pangjh3/LLM4MT.

[Downlink:]http://arxiv.org/abs/2401.08350v1

[GitHub:]https://github.com/pangjh3/LLM4MT.|

标题: RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning

作者: Junjie Ye, Yilong Wu, Songyang Gao

中文摘要: 工具学习作为大型语言模型（LLM）和物理世界之间互动的重要手段，引起了人们的广泛兴趣。目前的研究主要强调LLM在结构良好的环境中使用工具的能力，而忽略了它们在面对现实世界不可避免的噪音时的稳定性。为了弥补这一差距，我们引入了RoTBench，这是一个用于评估LLM在工具学习中的稳健性的多级基准。具体而言，我们建立了五个外部环境，每个环境都具有不同的噪声水平（即干净、轻微、中等、重度和联合），从而深入分析了模型在三个关键阶段的弹性：工具选择、参数识别和内容填充。涉及六个广泛使用的模型的实验强调了增强LLM在工具学习中的稳健性的迫切必要性。例如，当手动精度没有实质性变化时，GPT-4的性能甚至从80.00显著下降到58.10。更令人惊讶的是，GPT家族固有的噪声校正能力矛盾地阻碍了其在轻度噪声面前的适应性。根据这些发现，我们提出了RoTTuning，这是一种丰富训练环境多样性的策略，以增强LLM在工具学习中的稳健性。代码和数据可在https://github.com/Junjie-Ye/RoTBench.

摘要: Tool learning has generated widespread interest as a vital means of interaction between Large Language Models (LLMs) and the physical world. Current research predominantly emphasizes LLMs’ capacity to utilize tools in well-structured environments while overlooking their stability when confronted with the inevitable noise of the real world. To bridge this gap, we introduce RoTBench, a multi-level benchmark for evaluating the robustness of LLMs in tool learning. Specifically, we establish five external environments, each featuring varying levels of noise (i.e., Clean, Slight, Medium, Heavy, and Union), providing an in-depth analysis of the model’s resilience across three critical phases: tool selection, parameter identification, and content filling. Experiments involving six widely-used models underscore the urgent necessity for enhancing the robustness of LLMs in tool learning. For instance, the performance of GPT-4 even drops significantly from 80.00 to 58.10 when there is no substantial change in manual accuracy. More surprisingly, the noise correction capability inherent in the GPT family paradoxically impedes its adaptability in the face of mild noise. In light of these findings, we propose RoTTuning, a strategy that enriches the diversity of training environments to bolster the robustness of LLMs in tool learning. The code and data are available at https://github.com/Junjie-Ye/RoTBench.

[Downlink:]http://arxiv.org/abs/2401.08326v1

[GitHub:]https://github.com/Junjie-Ye/RoTBench.|

标题: CLadder: A Benchmark to Assess Causal Reasoning Capabilities of Language Models

作者: Zhijing Jin, Yuen Chen, Felix Leeb

中文摘要: 进行因果推理的能力被广泛认为是智力的核心特征。在这项工作中，我们研究了大型语言模型（LLM）是否能够连贯地推理因果关系。自然语言处理（NLP）中的许多现有工作都集中在评估LLM中的常识性因果推理，因此未能评估模型是否能够根据一组定义明确的形式规则进行因果推理。为了解决这一问题，我们提出了一个新的NLP任务，即自然语言中的因果推理，其灵感来自Judea Pearl等人假设的“因果推理引擎”。我们用10K个样本组成了一个大型数据集CLadder：基于因果图和查询（关联、介入和反事实）的集合，我们获得了符号问题和基本事实答案，通过oracle因果推理引擎。然后将其翻译成自然语言。我们在数据集上评估了多个LLM，并引入和评估了一种定制的思想链提示策略CausalCoT。我们表明，我们的任务对LLM来说极具挑战性，我们进行了深入分析，以深入了解LLM的因果推理能力。我们的数据来源于https://huggingface.co/datasets/causalNLP/cladder，我们的代码可以在https://github.com/causalNLP/cladder.

摘要: The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the “causal inference engine” postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insights into the causal reasoning abilities of LLMs. Our data is open-sourced at https://huggingface.co/datasets/causalNLP/cladder, and our code can be found at https://github.com/causalNLP/cladder.

[Downlink:]http://arxiv.org/abs/2312.04350v2

[Project:]https://huggingface.co/datasets/causalNLP/cladder,|

[GitHub:]https://github.com/causalNLP/cladder.|

标题: ALYMPICS: LLM Agents Meet Game Theory -- Exploring Strategic Decision-Making with AI Agents

作者: Shaoguang Mao, Yuzhe Cai, Yan Xia

中文摘要: 本文介绍了一种利用大型语言模型（LLM）代理进行博弈论研究的系统仿真框架Alympics（Olympics for Agents）。Alympics为研究复杂的博弈论问题创造了一个通用的平台，通过提供一个可控的环境来模拟与LLM代理的类人战略互动，弥合理论博弈论和实证研究之间的差距。在我们的试点案例研究“水资源分配挑战”中，我们通过一个具有挑战性的战略游戏来探索奥运会，该游戏专注于稀缺生存资源的多轮拍卖。这项研究展示了该框架定性和定量分析游戏决定因素、策略和结果的能力。此外，我们还对战略决策场景中的LLM代理进行了全面的人力评估和深入评估。我们的研究结果不仅扩展了对LLM代理在模拟人类战略行为方面的熟练程度的理解，还突出了它们在推进博弈论知识方面的潜力，从而丰富了我们对博弈论的理解，并使LLM代理能够进一步研究战略决策领域。代码、提示和所有相关资源可在https://github.com/microsoft/Alympics.

摘要: This paper introduces Alympics (Olympics for Agents), a systematic simulation framework utilizing Large Language Model (LLM) agents for game theory research. Alympics creates a versatile platform for studying complex game theory problems, bridging the gap between theoretical game theory and empirical investigations by providing a controlled environment for simulating human-like strategic interactions with LLM agents. In our pilot case study, the “Water Allocation Challenge,” we explore Alympics through a challenging strategic game focused on the multi-round auction on scarce survival resources. This study demonstrates the framework’s ability to qualitatively and quantitatively analyze game determinants, strategies, and outcomes. Additionally, we conduct a comprehensive human assessment and an in-depth evaluation of LLM agents in strategic decision-making scenarios. Our findings not only expand the understanding of LLM agents’ proficiency in emulating human strategic behavior but also highlight their potential in advancing game theory knowledge, thereby enriching our understanding of both game theory and empowering further research into strategic decision-making domains with LLM agents. Codes, prompts, and all related resources are available at https://github.com/microsoft/Alympics.

[Downlink:]http://arxiv.org/abs/2311.03220v4

[GitHub:]https://github.com/microsoft/Alympics.|

标题: ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings

作者: Shibo Hao, Tianyang Liu, Zhen Wang

中文摘要: 使用外部工具增强大型语言模型（LLM）已成为解决复杂问题的一种很有前途的方法。然而，使用工具演示数据微调LLM的传统方法可能成本高昂，而且仅限于预定义的一组工具。最近的上下文学习范式缓解了这些问题，但有限的上下文长度只允许进行几次演示，导致对工具的理解不理想。此外，当有许多工具可供选择时，上下文学习可能完全失败。在本文中，我们提出了一种替代方法 $\textbf｛ToolkenGPT｝$ ，它结合了双方的优点。我们的方法将每个 $KaTeX parse error: Undefined control sequence: \anderline at position 1: \̲a̲n̲d̲e̲r̲l̲i̲n̲e̲｛tool｝$ 表示为 $KaTeX parse error: Undefined control sequence: \anderline at position 1: \̲a̲n̲d̲e̲r̲l̲i̲n̲e̲{ken｝美元（$ \textit{toolken｝$），并为其学习嵌入，以与生成常规单词标记相同的方式启用工具调用。一旦触发了toolken，LLM就会被提示完成要执行的工具的参数。ToolkenGPT通过动态扩展工具集，提供了插入任意数量工具的灵活性。此外，它还通过允许大量的演示数据来学习toolken嵌入，从而改进了工具的使用。在不同的领域，包括数字推理、基于知识的问答和具体计划生成，我们的方法通过工具有效地增强了LLM，并大大优于各种最新的基线。ToolkenGPT展示了在复杂场景中使用大型工具集中的相关工具的良好能力

摘要: Augmenting large language models (LLMs) with external tools has emerged as a promising approach to solving complex problems. However, traditional methods, which finetune LLMs with tool demonstration data, can be both costly and restricted to a predefined set of tools. Recent in-context learning paradigm alleviates these issues, but the limited context length only allows for a few shots of demonstrations, leading to suboptimal understandings of the tools. Moreover, when there are numerous tools to choose from, in-context learning could completely fail to work. In this paper, we propose an alternative approach, $\textbf{ToolkenGPT}$ , which combines the benefits of both sides. Our approach represents each $\underline{tool}$ as a to $\underline{ken}$ ( $\textit{toolken}$ ) and learns an embedding for it, enabling tool calls in the same way as generating a regular word token. Once a toolken is triggered, the LLM is prompted to complete arguments for the tool to execute. ToolkenGPT offers the flexibility to plug in an arbitrary number of tools by expanding the set of toolkens on the fly. In addition, it improves tool use by allowing extensive demonstration data for learning the toolken embeddings. In diverse domains, including numerical reasoning, knowledge-based question answering, and embodied plan generation, our approach effectively augments LLMs with tools and substantially outperforms various latest baselines. ToolkenGPT demonstrates the promising ability to use relevant tools from a large tool set in complex scenarios.

[Downlink:]http://arxiv.org/abs/2305.11554v4

[GitHub:]https://github.com/Ber666/ToolkenGPT|

标题: QuIP: 2-Bit Quantization of Large Language Models With Guarantees

作者: Jerry Chee, Yaohui Cai, Volodymyr Kuleshov

中文摘要: 这项工作研究了大型语言模型（LLM）中的训练后参数量化。我们引入了具有非相干处理（QuIP）的量化，这是一种新方法，基于量化受益于 $\textit{non相干}$ 权重和Hessian矩阵，即权重大小均匀，并且精确舍入它们的重要方向与坐标轴不对齐。QuIP包括两个步骤：（1）最小化二次代理目标的自适应舍入过程；（2）高效的预处理和后处理，通过与随机正交矩阵相乘来确保权重和Hessian不相干。我们用LLM尺度量化算法的第一个理论分析来补充QuIP，并表明我们的理论也适用于现有的方法OPTQ。经验上，我们发现我们的非相干预处理改进了几种现有的量化算法，并产生了第一种LLM量化方法，该方法仅使用每个权重两个比特就产生了可行的结果。我们的代码可以在https://github.com/Cornell-RelaxML/QuIP.

摘要: This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from $\textit{incoherent}$ weight and Hessian matrices, i.e., from the weights being even in magnitude and the directions in which it is important to round them accurately being unaligned with the coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy objective; (2) efficient pre- and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices. We complement QuIP with the first theoretical analysis for an LLM-scale quantization algorithm, and show that our theory also applies to an existing method, OPTQ. Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. Our code can be found at https://github.com/Cornell-RelaxML/QuIP.

[Downlink:]http://arxiv.org/abs/2307.13304v2

[GitHub:]https://github.com/Cornell-RelaxML/QuIP.|

== VLM ==

标题: Multi-view Distillation based on Multi-modal Fusion for Few-shot Action Recognition(CLIP-$\mathrm{M^2}$DF)

作者: Fei Guo, YiKang Wang, Han Qi

中文摘要: 近年来，少镜头动作识别越来越受到关注。它通常采用元学习的范式。在该领域，基于有限样本，克服类和异常值的重叠分布仍然是一个具有挑战性的问题。我们相信，多模态和多视角的结合可以在信息互补的基础上改善这一问题。因此，我们提出了一种基于多模态融合的多视图蒸馏方法。首先，构造查询的概率提示选择器，以基于支持的提示嵌入和查询的视觉嵌入之间的比较得分来生成概率提示嵌入。其次，我们建立了多元视角。在每个视图中，我们将提示嵌入作为一致信息与视觉和全局或局部时间上下文相融合，以克服类和异常值的重叠分布。第三，我们对多视图进行距离融合，并相互提取匹配能力，使模型对分布偏差更具鲁棒性。我们的代码位于URL:\URL{https://github.com/cofly2014/MDMF}.

摘要: In recent years, few-shot action recognition has attracted increasing attention. It generally adopts the paradigm of meta-learning. In this field, overcoming the overlapping distribution of classes and outliers is still a challenging problem based on limited samples. We believe the combination of Multi-modal and Multi-view can improve this issue depending on information complementarity. Therefore, we propose a method of Multi-view Distillation based on Multi-modal Fusion. Firstly, a Probability Prompt Selector for the query is constructed to generate probability prompt embedding based on the comparison score between the prompt embeddings of the support and the visual embedding of the query. Secondly, we establish a Multi-view. In each view, we fuse the prompt embedding as consistent information with visual and the global or local temporal context to overcome the overlapping distribution of classes and outliers. Thirdly, we perform the distance fusion for the Multi-view and the mutual distillation of matching ability from one to another, enabling the model to be more robust to the distribution bias. Our code is available at the URL: \url{https://github.com/cofly2014/MDMF}.

[Downlink:]http://arxiv.org/abs/2401.08345v1

[GitHub:]https://github.com/cofly2014/MDMF|

标题: Spatial-Semantic Collaborative Cropping for User Generated Content

作者: Yukun Su, Yiwen Cao, Jingliang Deng

中文摘要: 每天都有大量的用户生成内容（UGC）上传到互联网上，并通过客户端（如手机和PC）广泛地显示给世界各地的人们。这需要裁剪算法在不同的设备上以特定的纵横比产生美观的缩略图。然而，现有的图像裁剪工作主要集中在地标或景观图像上，未能对UGC中具有复杂背景的多对象之间的关系进行建模。此外，以前的方法只考虑裁剪图像的美观性，而忽略了内容的完整性，这对UGC裁剪至关重要。在本文中，我们提出了一个空间语义协作裁剪网络（S2CNet），用于任意用户生成的内容，并附带一个新的裁剪基准。具体来说，我们首先挖掘潜在物体的视觉基因。然后，所提出的自适应注意力图将这项任务重新定义为视觉节点上的信息关联过程。潜在的空间和语义关系最终通过可区分的消息传递集中到候选作物，这有助于我们的网络有效地保持美观和内容完整性。在所提出的UGCrop5K和其他公共数据集上进行的大量实验证明了我们的方法优于最先进的同类方法。我们的项目可在https://github.com/suyukun666/S2CNet.

摘要: A large amount of User Generated Content (UGC) is uploaded to the Internet daily and displayed to people world-widely through the client side (e.g., mobile and PC). This requires the cropping algorithms to produce the aesthetic thumbnail within a specific aspect ratio on different devices. However, existing image cropping works mainly focus on landmark or landscape images, which fail to model the relations among the multi-objects with the complex background in UGC. Besides, previous methods merely consider the aesthetics of the cropped images while ignoring the content integrity, which is crucial for UGC cropping. In this paper, we propose a Spatial-Semantic Collaborative cropping network (S2CNet) for arbitrary user generated content accompanied by a new cropping benchmark. Specifically, we first mine the visual genes of the potential objects. Then, the suggested adaptive attention graph recasts this task as a procedure of information association over visual nodes. The underlying spatial and semantic relations are ultimately centralized to the crop candidate through differentiable message passing, which helps our network efficiently to preserve both the aesthetics and the content integrity. Extensive experiments on the proposed UGCrop5K and other public datasets demonstrate the superiority of our approach over state-of-the-art counterparts. Our project is available at https://github.com/suyukun666/S2CNet.

[Downlink:]http://arxiv.org/abs/2401.08086v1

[GitHub:]https://github.com/suyukun666/S2CNet.|

标题: Self-Imagine: Effective Unimodal Reasoning with Multimodal Models using Self-Imagination

作者: Syeda Nahida Akter, Aman Madaan, Sangwu Lee

中文摘要: 视觉语言模型的潜力（\textsc{vlm}s)在处理复杂的基于文本的问题时，尤其是当这些问题可以从视觉表示中受益时，通常仍然没有得到充分利用。为了与人类解决复杂的基于文本的问题的能力相共鸣，我们提出\textsc｛Self Imagine｝，方法是：（1）从问题中创建一个视觉图，（2）推断他们需要采取哪些步骤来解决问题。我们利用单个视觉语言模型（\textsc｛vlm｝）使用HTML生成问题的结构化表示，然后将HTML呈现为图像，最后使用相同的\vlm使用问题和图像回答问题。我们的方法不需要任何额外的训练数据或训练。我们使用最先进的\textsc｛vlm｝在三个数学任务和九个通用推理任务中评估了我们的方法。我们的方法将\textsc｛vlm｝在所有数学任务（\gsm:+4.62%；\asdiv:+4.49%；\samp:+9.30%）和大多数通用推理任务上的性能提高了0.4%至13.20%，同时在其他任务中实现了可比的性能。代码和数据位于https://github.com/snat1505027/self-imagine.

摘要: The potential of Vision-Language Models (\textsc{vlm}s) often remains underutilized in handling complex text-based problems, particularly when these problems could benefit from visual representation. Resonating with humans’ ability to solve complex text-based problems by (1) creating a visual diagram from the problem and (2) deducing what steps they need to take to solve it, we propose \textsc{Self-Imagine}. We leverage a single Vision-Language Model (\textsc{vlm}) to generate a structured representation of the question using HTML, then render the HTML as an image, and finally use the same \vlm to answer the question using both the question and the image. Our approach does not require any additional training data or training. We evaluate our approach in three mathematics tasks and nine general-purpose reasoning tasks using state-of-the-art \textsc{vlm}. Our approach boosts the performance of \textsc{vlm} on all math tasks (\gsm: +4.62%; \asdiv: +4.49%; \svamp: +9.30%) and the majority of the general-purpose reasoning tasks by 0.4% to 13.20% while achieving comparable performance in other tasks. Code and data at https://github.com/snat1505027/self-imagine .

[Downlink:]http://arxiv.org/abs/2401.08025v1

[GitHub:]https://github.com/snat1505027/self-imagine|

标题: 6-DoF Grasp Pose Evaluation and Optimization via Transfer Learning from NeRFs

作者: Gergely Sóti, Xi Huang, Christian Wurll

中文摘要: 我们使用隐式行为克隆来解决机器人抓取已知和未知物体的问题。我们从少量的演示中训练了一个抓取评估模型，该模型为更有可能成功抓取的抓取候选者输出更高的值。这个评估模型是一个目标函数，我们最大限度地确定成功的把握。我们方法的关键是利用从预先训练的NeRF中获得的视觉和几何特征的学习隐含表示。尽管只在具有简化对象和4-DoF自上而下抓取的模拟环境中进行训练，但我们的评估模型和优化程序证明了在模拟和现实世界环境中对6-DoF抓取和新对象的泛化，而不需要额外的数据。补充材料可在以下网址获取：https://gergely-soti.github.io/grasp

摘要: We address the problem of robotic grasping of known and unknown objects using implicit behavior cloning. We train a grasp evaluation model from a small number of demonstrations that outputs higher values for grasp candidates that are more likely to succeed in grasping. This evaluation model serves as an objective function, that we maximize to identify successful grasps. Key to our approach is the utilization of learned implicit representations of visual and geometric features derived from a pre-trained NeRF. Though trained exclusively in a simulated environment with simplified objects and 4-DoF top-down grasps, our evaluation model and optimization procedure demonstrate generalization to 6-DoF grasps and novel objects both in simulation and in real-world settings, without the need for additional data. Supplementary material is available at: https://gergely-soti.github.io/grasp

[Downlink:]http://arxiv.org/abs/2401.07935v1

[Project:]https://gergely-soti.github.io/grasp|

标题: Towards A Better Metric for Text-to-Video Generation

作者: Jay Zhangjie Wu, Guian Fang, Haoning Wu

中文摘要: 生成模型在合成高质量文本、图像和视频方面表现出了非凡的能力。对于视频生成，当代文本到视频模型展现出令人印象深刻的功能，制作出视觉上令人惊叹的视频。尽管如此，评估此类视频还是带来了重大挑战。目前的研究主要采用自动化指标，如FVD、IS和CLIP评分。然而，这些度量提供了不完整的分析，特别是在视频内容的时间评估中，因此使它们成为真实视频质量的不可靠指标。此外，虽然用户研究有可能准确反映人类的感知，但它们受到时间密集和费力性质的阻碍，其结果往往受到主观偏见的影响。在本文中，我们研究了现有指标固有的局限性，并引入了一种新的评估管道，即文本到视频评分（T2VScore）。该指标集成了两个关键标准：（1）文本-视频对齐，它在表示给定文本描述时仔细检查视频的保真度；（2）视频质量，它与专家一起评估视频的整体制作水平。此外，为了评估所提出的指标并促进未来的改进，我们提出了TVGE数据集，收集了2543个基于这两个标准的文本到视频生成视频的人类判断。在TVGE数据集上的实验证明了所提出的T2VScore在为文本到视频生成提供更好的度量方面的优势

摘要: Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis, particularly in the temporal assessment of video content, thus rendering them unreliable indicators of true video quality. Furthermore, while user studies have the potential to reflect human perception accurately, they are hampered by their time-intensive and laborious nature, with outcomes that are often tainted by subjective bias. In this paper, we investigate the limitations inherent in existing metrics and introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video’s overall production caliber with a mixture of experts. Moreover, to evaluate the proposed metrics and facilitate future improvements on them, we present the TVGE dataset, collecting human judgements of 2,543 text-to-video generated videos on the two criteria. Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.

[Downlink:]http://arxiv.org/abs/2401.07781v1

[Project:]https://showlab.github.io/T2VScore/|

标题: DiffSketcher: Text Guided Vector Sketch Synthesis through Latent Diffusion Models

作者: Ximing Xing, Chuang Wang, Haitao Zhou

中文摘要: 尽管主要在图像上进行训练，但我们发现预训练的扩散模型在指导草图合成方面显示出令人印象深刻的力量。在本文中，我们介绍了DiffSketcher，这是一种创新的算法，它使用自然语言输入创建\textit｛矢量化｝徒手草图。DiffSketcher是基于预先训练的文本到图像扩散模型开发的。它通过使用分数蒸馏采样（SDS）损失的扩展版本直接优化一组B’zier曲线来执行任务，这使我们能够使用光栅级扩散模型作为优化参数矢量化草图生成器的先验。此外，我们探索了嵌入扩散模型中的注意力图，用于有效的笔划初始化，以加快生成过程。生成的草图展示了多个抽象层次，同时保持了所画主题的可识别性、底层结构和基本视觉细节。我们的实验表明，DiffSketcher实现了比先前工作更高的质量。DiffSketcher的代码和演示可以在https://ximinng.github.io/DiffSketcher-project/.

摘要: Even though trained mainly on images, we discover that pretrained diffusion models show impressive power in guiding sketch synthesis. In this paper, we present DiffSketcher, an innovative algorithm that creates \textit{vectorized} free-hand sketches using natural language input. DiffSketcher is developed based on a pre-trained text-to-image diffusion model. It performs the task by directly optimizing a set of B’ezier curves with an extended version of the score distillation sampling (SDS) loss, which allows us to use a raster-level diffusion model as a prior for optimizing a parametric vectorized sketch generator. Furthermore, we explore attention maps embedded in the diffusion model for effective stroke initialization to speed up the generation process. The generated sketches demonstrate multiple levels of abstraction while maintaining recognizability, underlying structure, and essential visual details of the subject drawn. Our experiments show that DiffSketcher achieves greater quality than prior work. The code and demo of DiffSketcher can be found at https://ximinng.github.io/DiffSketcher-project/.

[Downlink:]http://arxiv.org/abs/2306.14685v4

[Project:]https://ximinng.github.io/DiffSketcher-project/.|https://ximinng.github.io/DiffSketcher-project/|

== diffusion model ==

标题: Key-point Guided Deformable Image Manipulation Using Diffusion Model

作者: Seok-Hwan Oh, Guil Jung, Myeong-Gee Kim

中文摘要: 在本文中，我们介绍了一种关键点引导的扩散概率模型（KDM），该模型通过操纵对象的关键点来获得对图像的精确控制。我们提出了一个两阶段生成模型，其中包含一个光流图作为中间输出。通过这样做，可以对图像和稀疏关键点之间的语义关系进行密集的像素理解，从而生成更真实的图像。此外，光流的集成有助于调节序列图像的帧间方差，证明了真实的序列图像生成。KDM通过各种关键点条件图像合成任务进行评估，包括面部图像生成、人体姿态合成和超声心动图视频预测，证明与最先进的模型相比，KDM能够增强图像的一致性和照片逼真度

摘要: In this paper, we introduce a Key-point-guided Diffusion probabilistic Model (KDM) that gains precise control over images by manipulating the object’s key-point. We propose a two-stage generative model incorporating an optical flow map as an intermediate output. By doing so, a dense pixel-wise understanding of the semantic relation between the image and sparse key point is configured, leading to more realistic image generation. Additionally, the integration of optical flow helps regulate the inter-frame variance of sequential images, demonstrating an authentic sequential image generation. The KDM is evaluated with diverse key-point conditioned image synthesis tasks, including facial image generation, human pose synthesis, and echocardiography video prediction, demonstrating the KDM is proving consistency enhanced and photo-realistic images compared with state-of-the-art models.

[Downlink:]http://arxiv.org/abs/2401.08178v1

[GitHub:]https://github.com/joseph9337/Key-point-Guided-Deformable-Image-Manipulation-Using-Diffusion-Mode|

标题: DiffSketcher: Text Guided Vector Sketch Synthesis through Latent Diffusion Models

作者: Ximing Xing, Chuang Wang, Haitao Zhou

[Downlink:]http://arxiv.org/abs/2306.14685v4

[Project:]https://ximinng.github.io/DiffSketcher-project/.|https://ximinng.github.io/DiffSketcher-project/|

标题: InstantID: Zero-shot Identity-Preserving Generation in Seconds

作者: Qixun Wang, Xu Bai, Haofan Wang

中文摘要: 使用纹理反转、DreamBooth和LoRA等方法进行个性化图像合成已经取得了重大进展。然而，它们在现实世界中的适用性受到高存储需求、漫长的微调过程以及对多个参考图像的需求的阻碍。相反，现有的基于ID嵌入的方法虽然只需要单一的前向推理，但面临着挑战：它们要么需要对众多模型参数进行广泛的微调，要么与社区预先训练的模型缺乏兼容性，要么无法保持高的人脸保真度。为了解决这些限制，我们引入了InstantID，这是一个强大的基于扩散模型的解决方案。我们的即插即用模块仅使用一张面部图像即可熟练地处理各种风格的图像个性化，同时确保高保真度。为了实现这一点，我们设计了一个新颖的IdentityNet，通过强加强语义和弱空间条件，将面部和地标图像与文本提示相结合来引导图像生成。InstantID展示了卓越的性能和效率，在身份保护至关重要的现实应用程序中证明了这一点。此外，我们的工作与流行的预训练文本到图像扩散模型（如SD1.5和SDXL）无缝集成，作为一个适应性插件。我们的代码和预先培训的检查站将在https://github.com/InstantID/InstantID.

摘要: There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin. Our codes and pre-trained checkpoints will be available at https://github.com/InstantID/InstantID.

[Downlink:]http://arxiv.org/abs/2401.07519v1

[Project:]https://instantid.github.io/|

[GitHub:]https://github.com/InstantID/InstantID.|

标题: IVIM-Morph: Motion-compensated quantitative Intra-voxel Incoherent Motion (IVIM) analysis for functional fetal lung maturity assessment from diffusion-weighted MRI data

作者: Noga Kertes, Yael Zaffrani-Reznikov, Onur Afacan

中文摘要: 扩散加权磁共振成像（DWI）数据中伪扩散的定量分析显示了评估胎儿肺成熟度和生成有价值的成像生物标志物的潜力。然而，DWI数据的临床应用受到采集过程中不可避免的胎儿运动的阻碍。我们提出了IVIM morph，这是一种自监督的深度神经网络模型，用于使用体素内非相干运动（IVIM）模型对DWI数据进行运动校正定量分析。IVIM morph结合了两个子网络，一个注册子网络和一个IVIM模型拟合子网络，能够同时估计IVIM模型参数和运动。为了促进物理上合理的图像配准，我们引入了一种生物物理知情损失函数，该函数有效地平衡了配准和模型拟合质量。我们使用39名受试者的胎儿DWI数据，通过建立预测的肺IVIM模型参数与胎龄（GA）之间的相关性，验证了IVIM变体的疗效。当在小管期对胎儿肺DWI数据进行体内定量分析时，IVIM形态与胎龄（GA）的相关性显著改善。IVIM变体显示出开发有价值的生物标志物的潜力，用于利用DWI数据无创评估胎儿肺成熟度。此外，它的适应性为运动补偿对定量DWI分析至关重要的其他临床环境中的潜在应用打开了大门。IVIM变形代码可在以下位置获得：https://github.com/TechnionComputationalMRILab/qDWI-Morph.

摘要: Quantitative analysis of pseudo-diffusion in diffusion-weighted magnetic resonance imaging (DWI) data shows potential for assessing fetal lung maturation and generating valuable imaging biomarkers. Yet, the clinical utility of DWI data is hindered by unavoidable fetal motion during acquisition. We present IVIM-morph, a self-supervised deep neural network model for motion-corrected quantitative analysis of DWI data using the Intra-voxel Incoherent Motion (IVIM) model. IVIM-morph combines two sub-networks, a registration sub-network, and an IVIM model fitting sub-network, enabling simultaneous estimation of IVIM model parameters and motion. To promote physically plausible image registration, we introduce a biophysically informed loss function that effectively balances registration and model-fitting quality. We validated the efficacy of IVIM-morph by establishing a correlation between the predicted IVIM model parameters of the lung and gestational age (GA) using fetal DWI data of 39 subjects. IVIM-morph exhibited a notably improved correlation with gestational age (GA) when performing in-vivo quantitative analysis of fetal lung DWI data during the canalicular phase. IVIM-morph shows potential in developing valuable biomarkers for non-invasive assessment of fetal lung maturity with DWI data. Moreover, its adaptability opens the door to potential applications in other clinical contexts where motion compensation is essential for quantitative DWI analysis. The IVIM-morph code is readily available at: https://github.com/TechnionComputationalMRILab/qDWI-Morph.

[Downlink:]http://arxiv.org/abs/2401.07126v1

[GitHub:]https://github.com/TechnionComputationalMRILab/qDWI-Morph.|

标题: Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking

作者: Wei Cao, Chang Luo, Biao Zhang

中文摘要: 我们介绍了Motion2VecSets，一种用于从点云序列重建动态表面的4D扩散模型。虽然现有的最先进的方法已经证明在使用神经场表示重建非刚性对象方面取得了成功，但传统的前馈网络遇到了来自噪声、部分或稀疏点云的模糊观测的挑战。为了应对这些挑战，我们引入了一种扩散模型，该模型通过压缩潜在表示的迭代去噪过程来显式学习非刚性对象的形状和运动分布。当处理模糊输入时，基于扩散的先验能够进行更合理和概率的重建。我们用潜在向量集参数化4D动力学，而不是使用全局潜在。这种新颖的4D表示使我们能够学习局部表面形状和变形模式，从而实现更准确的非线性运动捕捉，并显著提高对看不见的运动和身份的可推广性。对于更具时间连贯性的目标跟踪，我们同步地对变形潜集进行去噪，并在多个帧之间交换信息。为了避免计算开销，我们设计了一个交错的空间和时间注意力块，以沿着空间和时间域交替聚集变形潜伏期。与最先进的方法进行了广泛的比较，证明了我们的Motion2VenSets在从各种不完美的观测进行4D重建方面的优势，特别是在从DeformingThings4D Animals数据集上的稀疏点云重建看不见的个体方面，与CaDex相比，交集优于并集（IoU）提高了19%。更多详细信息，请访问https://vveicao.github.io/projects/Motion2VecSets/.

摘要: We introduce Motion2VecSets, a 4D diffusion model for dynamic surface reconstruction from point cloud sequences. While existing state-of-the-art methods have demonstrated success in reconstructing non-rigid objects using neural field representations, conventional feed-forward networks encounter challenges with ambiguous observations from noisy, partial, or sparse point clouds. To address these challenges, we introduce a diffusion model that explicitly learns the shape and motion distribution of non-rigid objects through an iterative denoising process of compressed latent representations. The diffusion-based prior enables more plausible and probabilistic reconstructions when handling ambiguous inputs. We parameterize 4D dynamics with latent vector sets instead of using a global latent. This novel 4D representation allows us to learn local surface shape and deformation patterns, leading to more accurate non-linear motion capture and significantly improving generalizability to unseen motions and identities. For more temporal-coherent object tracking, we synchronously denoise deformation latent sets and exchange information across multiple frames. To avoid the computational overhead, we design an interleaved space and time attention block to alternately aggregate deformation latents along spatial and temporal domains. Extensive comparisons against the state-of-the-art methods demonstrate the superiority of our Motion2VecSets in 4D reconstruction from various imperfect observations, notably achieving a 19% improvement in Intersection over Union (IoU) compared to CaDex for reconstructing unseen individuals from sparse point clouds on the DeformingThings4D-Animals dataset. More detailed information can be found at https://vveicao.github.io/projects/Motion2VecSets/.

[Downlink:]http://arxiv.org/abs/2401.06614v1

[Project:]https://vveicao.github.io/projects/Motion2VecSets/.|

标题: DDPM-CD: Denoising Diffusion Probabilistic Models as Feature Extractors for Change Detection

作者: Wele Gedara Chaminda Bandara, Nithin Gopalakrishnan Nair, Vishal M. Patel

中文摘要: 遥感变化检测对于了解地球表面的动态、促进环境变化监测、评估人类影响、预测未来趋势和支持决策至关重要。在这项工作中，我们介绍了一种新的变化检测方法，该方法可以通过预训练去噪扩散概率模型（DDPM）——一类用于图像合成的生成模型——在训练过程中利用现成的未标记遥感图像。DDPM通过使用马尔可夫链将训练图像逐渐转换为高斯分布来学习训练数据分布。在推理（即采样）过程中，他们可以从高斯噪声开始生成更接近训练分布的不同样本集，从而获得最先进的图像合成结果。然而，在这项工作中，我们的重点不是图像合成，而是将其用作预先训练的特征提取器，用于变化检测的下游应用。具体来说，我们利用预先训练的DDPM与变化标签一起产生的特征表示来微调轻量级变化分类器。在LEVIR-CD、WHU-CD、DSIFN-CD和CDD数据集上进行的实验表明，所提出的DDPM-CD方法在F1得分、IoU和总体准确性方面显著优于现有最先进的变化检测方法，突出了预训练的DDPM作为下游应用的特征提取器的关键作用。我们已在上提供了代码和预训练模型https://github.com/wgcban/ddpm-cd

摘要: Remote sensing change detection is crucial for understanding the dynamics of our planet’s surface, facilitating the monitoring of environmental changes, evaluating human impact, predicting future trends, and supporting decision-making. In this work, we introduce a novel approach for change detection that can leverage off-the-shelf, unlabeled remote sensing images in the training process by pre-training a Denoising Diffusion Probabilistic Model (DDPM) - a class of generative models used in image synthesis. DDPMs learn the training data distribution by gradually converting training images into a Gaussian distribution using a Markov chain. During inference (i.e., sampling), they can generate a diverse set of samples closer to the training distribution, starting from Gaussian noise, achieving state-of-the-art image synthesis results. However, in this work, our focus is not on image synthesis but on utilizing it as a pre-trained feature extractor for the downstream application of change detection. Specifically, we fine-tune a lightweight change classifier utilizing the feature representations produced by the pre-trained DDPM alongside change labels. Experiments conducted on the LEVIR-CD, WHU-CD, DSIFN-CD, and CDD datasets demonstrate that the proposed DDPM-CD method significantly outperforms the existing state-of-the-art change detection methods in terms of F1 score, IoU, and overall accuracy, highlighting the pivotal role of pre-trained DDPM as a feature extractor for downstream applications. We have made both the code and pre-trained models available at https://github.com/wgcban/ddpm-cd

[Downlink:]http://arxiv.org/abs/2206.11892v3

[GitHub:]https://github.com/wgcban/ddpm-cd|https://github.com/wgcban/ddpm-cd|

== Visual Navigation ==

标题: Multimotion Visual Odometry (MVO)

作者: Kevin M. Judd, Jonathan D. Gammell

中文摘要: 视觉运动估计是自主导航中一个研究得很好的挑战。最近的工作集中于解决高度动态环境中的多运动估计问题。这些环境不仅包括多个复杂的运动，而且往往表现出显著的遮挡。很难同时估计第三方运动和传感器自运动，因为物体的观测运动包括其真实运动和传感器运动。先前在多运动估计中的大多数工作通过依赖于基于外观的对象检测或特定于应用程序的运动约束来简化这个问题。这些方法在特定的应用程序和环境中是有效的，但不能很好地推广到完整的多运动估计问题（MEP）。本文介绍了Multimotion Visual Odometry（MVO），这是一种多运动估计管道，它估计场景中每个运动的完整SE（3）轨迹，包括传感器自身运动，而不依赖于基于外观的信息。MVO通过多运动分割和跟踪技术扩展了传统的视觉里程计（VO）管道。它使用物理建立的运动先验来推断通过临时遮挡的运动，并通过运动闭合来识别运动的再现。对牛津多运动数据集（OMD）和KITTI Vision Benchmark Suite的真实世界数据的评估表明，与类似方法相比，MVO实现了良好的估计精度，并适用于各种多运动估计挑战

摘要: Visual motion estimation is a well-studied challenge in autonomous navigation. Recent work has focused on addressing multimotion estimation in highly dynamic environments. These environments not only comprise multiple, complex motions but also tend to exhibit significant occlusion. Estimating third-party motions simultaneously with the sensor egomotion is difficult because an object’s observed motion consists of both its true motion and the sensor motion. Most previous works in multimotion estimation simplify this problem by relying on appearance-based object detection or application-specific motion constraints. These approaches are effective in specific applications and environments but do not generalize well to the full multimotion estimation problem (MEP). This paper presents Multimotion Visual Odometry (MVO), a multimotion estimation pipeline that estimates the full SE(3) trajectory of every motion in the scene, including the sensor egomotion, without relying on appearance-based information. MVO extends the traditional visual odometry (VO) pipeline with multimotion segmentation and tracking techniques. It uses physically founded motion priors to extrapolate motions through temporary occlusions and identify the reappearance of motions through motion closure. Evaluations on real-world data from the Oxford Multimotion Dataset (OMD) and the KITTI Vision Benchmark Suite demonstrate that MVO achieves good estimation accuracy compared to similar approaches and is applicable to a variety of multimotion estimation challenges.

[Downlink:]http://arxiv.org/abs/2110.15169v3

[Project:]https://www.youtube.com/watch?v=mNj3s1nf-6A|https://www.youtube.com/playlist?list=PLbaQBz4TuPcxMIXKh5Q80s0N9ISezFcpi|

标题: Learning Interactive Real-World Simulators

作者: Mengjiao Yang, Yilun Du, Kamyar Ghasemipour

中文摘要: 基于互联网数据训练的生成模型彻底改变了文本、图像和视频内容的创建方式。也许生成模型的下一个里程碑是模拟现实体验，以响应人类、机器人和其他交互式代理所采取的行动。真实世界模拟器的应用范围从游戏和电影中的可控内容创建，到纯粹在模拟中训练可直接部署在现实世界中的具体代理。我们探索了通过生成建模学习真实世界交互的通用模拟器的可能性。我们首先提出了一个重要的观察结果，即可用于学习真实世界模拟器的自然数据集通常在不同维度上是丰富的（例如，图像数据中的大量对象、机器人数据中的密集采样动作以及导航数据中的不同运动）。通过仔细编排不同的数据集，每个数据集都提供了整体体验的不同方面，我们可以从静态场景和对象中模拟高级指令（如“打开抽屉”）和低级控件（如“按x，y移动”）的视觉结果。我们使用模拟器来训练高级视觉语言策略和低级强化学习策略，在纯模拟训练后，每一种策略都可以在现实世界中零次部署。我们还表明，其他类型的智能，如视频字幕模型，可以从模拟经验的训练中受益，从而开辟更广泛的应用。视频演示可在https://universal-simulator.github.io.

摘要: Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different dimensions (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, we can simulate the visual outcome of both high-level instructions such as ``open the drawer’’ and low-level controls such as “move by x, y” from otherwise static scenes and objects. We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies, each of which can be deployed in the real world in zero shot after training purely in simulation. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience, opening up even wider applications. Video demos can be found at https://universal-simulator.github.io.

[Downlink:]http://arxiv.org/abs/2310.06114v2

[Project:]https://universal-simulator.github.io.|https://universal-simulator.github.io|

标题: Multi-Technique Sequential Information Consistency For Dynamic Visual Place Recognition In Changing Environments

作者: Bruno Arcanjo, Bruno Ferrarini, Michael Milford

中文摘要: 视觉位置识别（VPR）是机器人导航和定位系统的一个重要组成部分，它允许机器人仅使用图像数据来识别位置。VPR具有挑战性，因为不同的日常照明、季节性天气变化和不同的视角会导致一个地方的外观发生重大变化。目前，没有一种VPR技术在每种环境条件下都表现出独特的优点和缺点，因此将多种技术相结合可以实现更可靠的VPR性能。目前的多方法方法要么依赖于通常不可用的在线地面实况信息，要么依赖于强力技术组合，这可能会降低高方差技术集的性能。针对这些缺点，我们提出了一种称为多序列信息一致性（MuSIC）的VPR系统，该系统利用序列信息在每帧在线的基础上选择最具凝聚力的技术。对于集合中的每种技术，MuSIC通过分析其顶部匹配候选者的帧到帧连续性来计算其各自的序列一致性，然后直接将其进行比较，以选择用于当前查询图像的最佳技术。使用顺序信息在VPR方法之间进行选择，可以提高不同基准数据集的整体VPR性能，同时避免对运行时环境的额外事实的需要

摘要: Visual place recognition (VPR) is an essential component of robot navigation and localization systems that allows them to identify a place using only image data. VPR is challenging due to the significant changes in a place’s appearance driven by different daily illumination, seasonal weather variations and diverse viewpoints. Currently, no single VPR technique excels in every environmental condition, each exhibiting unique benefits and shortcomings, and therefore combining multiple techniques can achieve more reliable VPR performance. Present multi-method approaches either rely on online ground-truth information, which is often not available, or on brute-force technique combination, potentially lowering performance with high variance technique sets. Addressing these shortcomings, we propose a VPR system dubbed Multi-Sequential Information Consistency (MuSIC) which leverages sequential information to select the most cohesive technique on an online per-frame basis. For each technique in a set, MuSIC computes their respective sequential consistencies by analysing the frame-to-frame continuity of their top match candidates, which are then directly compared to select the optimal technique for the current query image. The use of sequential information to select between VPR methods results in an overall VPR performance increase across different benchmark datasets, while avoiding the need for extra ground-truth of the runtime environment.

[Downlink:]http://arxiv.org/abs/2401.08263v1

标题: Haptic search with the Smart Suction Cup on adversarial objects

作者: Jungpyo Lee, Sebastian D. Lee, Tae Myung Huh

中文摘要: 吸盘是工业机器人应用中的一种重要抓握器类型，现有文献侧重于使用基于视觉的规划者来提高抓握在这些任务中的成功率。如果不重新训练学习的算法，基于视觉的规划者可能会因对抗性对象而失败，或失去对看不见的场景的可推广性。当视觉抓取计划失败时，我们提出了触觉探索来改进吸盘抓取。我们介绍了智能吸盘，这是一种利用内部流量测量进行触觉传感的末端执行器。我们表明，在这些流量测量的指导下，基于模型的触觉搜索方法与在垃圾箱拾取任务中仅使用视觉规划器相比，可将抓取成功率提高2.5倍。在对智能吸盘的几何边缘和曲线进行表征时，我们发现即使存在较大的姿势误差，流速也能准确预测理想的运动方向。智能吸盘本身不包括电子设备，因此设计易于制造，触觉探索不会损坏传感器。这项工作促使人们在特别是对抗性场景中使用具有自主触觉搜索功能的吸盘

摘要: Suction cups are an important gripper type in industrial robot applications, and prior literature focuses on using vision-based planners to improve grasping success in these tasks. Vision-based planners can fail due to adversarial objects or lose generalizability for unseen scenarios, without retraining learned algorithms. We propose haptic exploration to improve suction cup grasping when visual grasp planners fail. We present the Smart Suction Cup, an end-effector that utilizes internal flow measurements for tactile sensing. We show that model-based haptic search methods, guided by these flow measurements, improve grasping success by up to 2.5x as compared with using only a vision planner during a bin-picking task. In characterizing the Smart Suction Cup on both geometric edges and curves, we find that flow rate can accurately predict the ideal motion direction even with large postural errors. The Smart Suction Cup includes no electronics on the cup itself, such that the design is easy to fabricate and haptic exploration does not damage the sensor. This work motivates the use of suction cups with autonomous haptic search capabilities in especially adversarial scenarios.

[Downlink:]http://arxiv.org/abs/2309.07360v2

标题: Decomposition, Compression, and Synthesis (DCS)-based Video Coding: A Neural Exploration via Resolution-Adaptive Learning

作者: Ming Lu, Tong Chen, Dandan Ding

中文摘要: 受视网膜细胞实际上将视觉场景分离为不同属性（例如，空间细节、时间运动）以用于各自的神经元处理的事实的启发，我们建议首先以其固有的空间分辨率将输入视频分解为各自的空间纹理帧（STF），以保留丰富的空间细节，以及保持运动平滑度的较低空间分辨率的其他时间运动帧（TMF）；然后使用任何流行的视频编码器将它们压缩在一起；并最终合成解码的STF和TMF，用于以与其本地输入相同的分辨率进行高保真视频重建。这项工作简单地应用了分解中的双三次重采样和压缩中的HEVC兼容编解码器，并将重点放在了合成部分。对于分辨率自适应合成，在TMF上设计了运动补偿网络（MCN），以有效地对齐和聚合将使用非局部纹理传输网络（NL-TTN）与相应的STF联合处理的时间运动特征，从而更好地增强空间细节，从而可以以更好的率失真效率有效地减轻压缩和分辨率重采样噪声。这种基于“分解、压缩、合成（DCS）”的方案与编解码器无关，目前示例了使用参考软件的HEVC锚点的平均$\approxy $1 d BPSNR 增益或$ \appassy$25%BD速率节省。此外，还进行了与最先进方法的实验比较和消融研究，以进一步报道DCS算法的效率和通用性，为未来的视频编码提供了令人鼓舞的方向

摘要: Inspired by the facts that retinal cells actually segregate the visual scene into different attributes (e.g., spatial details, temporal motion) for respective neuronal processing, we propose to first decompose the input video into respective spatial texture frames (STF) at its native spatial resolution that preserve the rich spatial details, and the other temporal motion frames (TMF) at a lower spatial resolution that retain the motion smoothness; then compress them together using any popular video coder; and finally synthesize decoded STFs and TMFs for high-fidelity video reconstruction at the same resolution as its native input. This work simply applies the bicubic resampling in decomposition and HEVC compliant codec in compression, and puts the focus on the synthesis part. For resolution-adaptive synthesis, a motion compensation network (MCN) is devised on TMFs to efficiently align and aggregate temporal motion features that will be jointly processed with corresponding STFs using a non-local texture transfer network (NL-TTN) to better augment spatial details, by which the compression and resolution resampling noises can be effectively alleviated with better rate-distortion efficiency. Such “Decomposition, Compression, Synthesis (DCS)” based scheme is codec agnostic, currently exemplifying averaged $\approx$1 dB PSNR gain or $\approx$25% BD-rate saving, against the HEVC anchor using reference software. In addition, experimental comparisons to the state-of-the-art methods and ablation studies are conducted to further report the efficiency and generalization of DCS algorithm, promising an encouraging direction for future video coding.

[Downlink:]http://arxiv.org/abs/2012.00650v5