[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--强化学习、模仿学习、机器人、开放词汇

最新推荐文章于 2025-02-06 20:49:15 发布

晓理紫

最新推荐文章于 2025-02-06 20:49:15 发布

阅读量1.4k

点赞数 22

文章标签：学习机器人人工智能大模型深度学习

本文链接：https://blog.csdn.net/u011573853/article/details/136062576

版权

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

为了答谢各位网友的支持，从今日起免费为300名读者提供订阅主题论文服务，只需VX关注公号并回复{邮箱+论文主题}（如：123456@xx.com + chatgpt@large language model @LLM）,主题必须是同一个领域，最多三个关键词。解释权归博主所有

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

==RL==

标题: Diffusion Models for Reinforcement Learning: A Survey

作者: Zhengbang Zhu, Hanye Zhao, Haoran He

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2311.01223v3

GitHub: https://github.com/apexrl/Diff4RLSurvey|

中文摘要: 扩散模型在样本质量和训练稳定性方面优于以前的生成模型。最近的工作显示了扩散模型在改进强化学习（RL）解决方案方面的优势。这项调查旨在提供这一新兴领域的概述，并希望激发新的研究途径。首先，我们研究RL算法遇到的几个挑战。然后，我们提出了一个基于扩散模型在RL中的作用的现有方法的分类，并探讨了如何解决前面的挑战。我们进一步概述了扩散模型在各种RL相关任务中的成功应用。最后，我们总结了调查结果，并对未来的研究方向提出了见解。我们正在积极维护一个GitHub存储库，用于存储论文和其他相关资源，以利用RL中的扩散模型：https：//github.com/apex rl/diff 4 rl survey。

摘要: Diffusion models surpass previous generative models in sample quality and training stability. Recent works have shown the advantages of diffusion models in improving reinforcement learning (RL) solutions. This survey aims to provide an overview of this emerging field and hopes to inspire new avenues of research. First, we examine several challenges encountered by RL algorithms. Then, we present a taxonomy of existing methods based on the roles of diffusion models in RL and explore how the preceding challenges are addressed. We further outline successful applications of diffusion models in various RL-related tasks. Finally, we conclude the survey and offer insights into future research directions. We are actively maintaining a GitHub repository for papers and other related resources in utilizing diffusion models in RL: https://github.com/apexrl/Diff4RLSurvey.

标题: Deal, or no deal (or who knows)? Forecasting Uncertainty in Conversations using Large Language Models

作者: Anthony Sicilia, Hyunwoo Kim, Khyathi Raghavi Chandu

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.03284v1

中文摘要: 有效的对话者会解释他人不确定的目标、信念和情绪。但是，即使是最优秀的人类健谈者也无法完美地预测对话的轨迹。语言模型能在多大程度上代表对话中固有的不确定性？我们提出了FortUne Dial，这是长期存在的“对话预测”任务的扩展：评估不仅仅是准确性，而是使用不确定性感知指标进行，有效地实现了对单个实例的弃权。我们研究了语言模型潜在表示结果不确定性的两种方式（内部使用分数和直接使用标记），并提出了微调策略来改善这两种表示的校准。在八个困难的谈判语料库上的实验表明，我们提出的微调策略（传统的监督策略和非策略强化学习策略）可以校准较小的开源模型，以与10倍于其大小的预训练模型竞争。

摘要: Effective interlocutors account for the uncertain goals, beliefs, and emotions of others. But even the best human conversationalist cannot perfectly anticipate the trajectory of a dialogue. How well can language models represent inherent uncertainty in conversations? We propose FortUne Dial, an expansion of the long-standing “conversation forecasting” task: instead of just accuracy, evaluation is conducted with uncertainty-aware metrics, effectively enabling abstention on individual instances. We study two ways in which language models potentially represent outcome uncertainty (internally, using scores and directly, using tokens) and propose fine-tuning strategies to improve calibration of both representations. Experiments on eight difficult negotiation corpora demonstrate that our proposed fine-tuning strategies (a traditional supervision strategy and an off-policy reinforcement learning strategy) can calibrate smaller open-source models to compete with pre-trained models 10x their size.

标题: A Framework for Partially Observed Reward-States in RLHF

作者: Chinmaya Kausik, Mirco Mutti, Aldo Pacchiano

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.03282v1

中文摘要: 近年来，基于人类反馈的强化学习（RLHF）的研究因其在LLMs开发中的作用而变得突出。神经科学研究表明，人类对刺激的反应取决于部分观察到的“内部状态”。不幸的是，目前的RLHF模型没有考虑到这一点。此外，大多数RLHF模型没有考虑中间反馈，这在实证工作中越来越重要，可以帮助提高样本复杂性和一致性。为了解决这些限制，我们将RLHF建模为具有部分观察奖励状态（PORRL）的强化学习。我们显示了从RLHF中人类反馈的两种主要形式——基数反馈和决斗反馈到PORRL的减少。对于基数反馈，我们开发了通用的统计有效算法，并实例化它们以呈现POR-UCRL和POR-UCBVI。对于决斗反馈，我们证明了基数反馈的天真简化不能实现次线性决斗后悔。然后，我们提出了第一个明确的减少，将基本后悔的保证转换为决斗后悔。我们表明，我们的模型和保证在这两种情况下推广和扩展了现有的。最后，我们在我们的模型上确定了一个递归结构，它可以改善PORRL的统计和计算易处理性，给出了过去在RLHF上的工作的例子，以及PORRL包含的学习完美奖励机。

摘要: The study of reinforcement learning from human feedback (RLHF) has gained prominence in recent years due to its role in the development of LLMs. Neuroscience research shows that human responses to stimuli are known to depend on partially-observed “internal states.” Unfortunately current models of RLHF do not take take this into consideration. Moreover most RLHF models do not account for intermediate feedback, which is gaining importance in empirical work and can help improve both sample complexity and alignment. To address these limitations, we model RLHF as reinforcement learning with partially observed reward-states (PORRL). We show reductions from the the two dominant forms of human feedback in RLHF - cardinal and dueling feedback to PORRL. For cardinal feedback, we develop generic statistically efficient algorithms and instantiate them to present POR-UCRL and POR-UCBVI. For dueling feedback, we show that a naive reduction to cardinal feedback fails to achieve sublinear dueling regret. We then present the first explicit reduction that converts guarantees for cardinal regret to dueling regret. We show that our models and guarantees in both settings generalize and extend existing ones. Finally, we identify a recursive structure on our model that could improve the statistical and computational tractability of PORRL, giving examples from past work on RLHF as well as learning perfect reward machines, which PORRL subsumes.

标题: Mixed Traffic Control and Coordination from Pixels

作者: Michael Villarreal, Bibek Poudel, Jia Pan

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2302.09167v4

中文摘要: 交通拥堵是我们社会中一个长期存在的问题。以前的交通控制方法已被证明在缓解当前的拥堵水平方面是徒劳的，鉴于道路上不同自主水平的车辆越来越多，研究人员开始探索机器人车辆的想法。这就产生了混合交通控制，机器人车辆通过强化学习（RL）来调节人类驾驶的车辆。然而，大多数现有的研究使用精确的观测，这需要领域专业知识和每个道路网络的观测空间的手工工程。此外，精确观测使用全球信息，如环境外流，和局部信息，即车辆位置和速度。获得这些信息需要用巨大的传感器环境更新现有的道路基础设施，并与潜在的不情愿的人类驾驶员进行通信。我们考虑图像观测，一种尚未通过RL广泛探索用于混合交通控制的模态，作为替代方案：1）图像不需要从一个环境到另一个环境对观测空间进行完全的重新想象；2）图像通过卫星图像、车载摄像系统和交通监控系统无处不在；以及3）图像只需要与设备通信。在这项工作中，我们展示了使用图像观察的机器人车辆可以实现与使用环境上的精确信息相比具有竞争力的性能，包括环形、8字形、交叉口、合并和瓶颈。在某些情况下，我们的方法甚至优于使用精确观察，例如，在合并环境中平均车辆速度增加了8%，尽管仅使用本地交通信息而不是全局交通信息。

摘要: Traffic congestion is a persistent problem in our society. Previous methods for traffic control have proven futile in alleviating current congestion levels leading researchers to explore ideas with robot vehicles given the increased emergence of vehicles with different levels of autonomy on our roads. This gives rise to mixed traffic control, where robot vehicles regulate human-driven vehicles through reinforcement learning (RL). However, most existing studies use precise observations that require domain expertise and hand engineering for each road network’s observation space. Additionally, precise observations use global information, such as environment outflow, and local information, i.e., vehicle positions and velocities. Obtaining this information requires updating existing road infrastructure with vast sensor environments and communication to potentially unwilling human drivers. We consider image observations, a modality that has not been extensively explored for mixed traffic control via RL, as the alternative: 1) images do not require a complete re-imagination of the observation space from environment to environment; 2) images are ubiquitous through satellite imagery, in-car camera systems, and traffic monitoring systems; and 3) images only require communication to equipment. In this work, we show robot vehicles using image observations can achieve competitive performance to using precise information on environments, including ring, figure eight, intersection, merge, and bottleneck. In certain scenarios, our approach even outperforms using precision observations, e.g., up to 8% increase in average vehicle velocity in the merge environment, despite only using local traffic information as opposed to global traffic information.

标题: MobilityGPT: Enhanced Human Mobility Modeling with a GPT model

作者: Ammar Haydari, Dongjie Chen, Zhengfeng Lai

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.03264v1

中文摘要: 生成模型在捕捉人类移动特征和生成方面显示出有希望的结果合成trajectories.然而，确保生成的地理空间移动性数据在语义上是真实的，包括一致的位置序列，并反映真实世界的特征，例如对地理空间限制的约束，仍然具有挑战性。为了解决这些问题，我们利用生成式预训练Transformer model（GPT），将人类移动建模重新格式化为自回归生成任务。为了确保其可控生成以缓解上述挑战，我们提出了一个地理空间感知生成模型MobilityGPT。我们提出了一种基于重力的采样方法来训练语义序列相似性的Transformer model。然后，我们通过道路连通性矩阵来约束训练过程，该矩阵在轨迹生成中提供序列的连通性，从而将生成的轨迹保持在地理空间限制内。最后，我们构建了一个来自轨迹反馈的强化学习（RLTF），以最小化训练和合成生成的轨迹之间的移动距离。我们在真实世界数据集上的实验表明，MobilityGPT在生成高质量的移动轨迹方面优于最先进的方法，这些轨迹在起点——目的地相似性、行程长度、行程半径、链接和重力分布方面最接近真实数据。

摘要: Generative models have shown promising results in capturing human mobility characteristics and generating synthetic trajectories. However, it remains challenging to ensure that the generated geospatial mobility data is semantically realistic, including consistent location sequences, and reflects real-world characteristics, such as constraining on geospatial limits. To address these issues, we reformat human mobility modeling as an autoregressive generation task, leveraging Generative Pre-trained Transformer (GPT). To ensure its controllable generation to alleviate the above challenges, we propose a geospatially-aware generative model, MobilityGPT. We propose a gravity-based sampling method to train a transformer for semantic sequence similarity. Then, we constrained the training process via a road connectivity matrix that provides the connectivity of sequences in trajectory generation, thereby keeping generated trajectories in geospatial limits. Lastly, we constructed a Reinforcement Learning from Trajectory Feedback (RLTF) to minimize the travel distance between training and the synthetically generated trajectories. Our experiments on real-world datasets demonstrate that MobilityGPT outperforms state-of-the-art methods in generating high-quality mobility trajectories that are closest to real data in terms of origin-destination similarity, trip length, travel radius, link, and gravity distributions.

标题: Multi-agent Reinforcement Learning for Energy Saving in Multi-Cell Massive MIMO Systems

作者: Tianzhang Cai, Qichen Wang, Shuai Zhang

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.03204v1

摘要: We develop a multi-agent reinforcement learning (MARL) algorithm to minimize the total energy consumption of multiple massive MIMO (multiple-input multiple-output) base stations (BSs) in a multi-cell network while preserving the overall quality-of-service (QoS) by making decisions on the multi-level advanced sleep modes (ASMs) and antenna switching of these BSs. The problem is modeled as a decentralized partially observable Markov decision process (DEC-POMDP) to enable collaboration between individual BSs, which is necessary to tackle inter-cell interference. A multi-agent proximal policy optimization (MAPPO) algorithm is designed to learn a collaborative BS control policy. To enhance its scalability, a modified version called MAPPO-neighbor policy is further proposed. Simulation results demonstrate that the trained MAPPO agent achieves better performance compared to baseline policies. Specifically, compared to the auto sleep mode 1 (symbol-level sleeping) algorithm, the MAPPO-neighbor policy reduces power consumption by approximately 8.7% during low-traffic hours and improves energy efficiency by approximately 19% during high-traffic hours, respectively.

== Imitation Learning ==

标题: Vision-Language Foundation Models as Effective Robot Imitators

作者: Xinghang Li, Minghuan Liu, Hanbo Zhang

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2311.01378v3

Project: https://roboflamingo.github.io|

中文摘要: 视觉语言基础模型的最新进展显示了它们理解多模态数据和解决复杂视觉语言任务的能力，包括机器人操作。我们寻求一种直接的方法来利用现有的视觉语言模型（VLM），并对机器人数据进行简单的微调。为此，我们推导出一个简单而新颖的视觉语言操作框架，称为RoboFlamingo，建立在开源VLMs OpenFlamingo的基础上。与之前的工作不同，RoboFlamingo利用预先训练的VLMs进行单步视觉语言理解，用明确的策略头对连续的历史信息进行建模，并通过仅在语言条件操纵数据集上的模仿学习进行轻微微调。这种分解为RoboFlamingo提供了在低性能平台上进行开环控制和部署的灵活性。通过在测试基准测试中大幅超越最先进的性能，我们表明RoboFlamingo可以成为使VLM适应机器人控制的有效且有竞争力的替代方案。我们广泛的实验结果也揭示了几个有趣的结论，关于不同的预训练VLM在操纵任务中的行为。我们相信RoboFlamingo有潜力成为一种经济高效且易于使用的机器人操作解决方案，使每个人都有能力微调自己的机器人政策。

摘要: Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a straightforward way of making use of existing vision-language models (VLMs) with simple fine-tuning on robotics data. To this end, we derive a simple and novel vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo. Unlike prior works, RoboFlamingo utilizes pre-trained VLMs for single-step vision-language comprehension, models sequential history information with an explicit policy head, and is slightly fine-tuned by imitation learning only on language-conditioned manipulation datasets. Such a decomposition provides RoboFlamingo the flexibility for open-loop control and deployment on low-performance platforms. By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control. Our extensive experimental results also reveal several interesting conclusions regarding the behavior of different pre-trained VLMs on manipulation tasks. We believe RoboFlamingo has the potential to be a cost-effective and easy-to-use solution for robotics manipulation, empowering everyone with the ability to fine-tune their own robotics policy.

标题: ILBiT: Imitation Learning for Robot Using Position and Torque Information based on Bilateral Control with Transformer

作者: Masato Kobayashi, Thanpimon Buamanee, Yuki Uranishi

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2401.16653v2

中文摘要: 机器人手臂中的自主操纵是机器人学中一个复杂且不断发展的研究领域。本文介绍了一种创新的方法来应对这一挑战，重点是模仿学习（IL）。与传统的模仿方法不同，我们的方法使用基于双边控制的IL，允许更精确和适应性更强的机器人运动。传统的基于双边控制方法的IL依赖于长短期记忆（LSTM）网络。在本文中，我们提出了基于Transformer model双边控制（ILBiT）的基于位置和扭矩信息的机器人IL。这种提出的方法采用了Transformer模型，该方法以其在处理不同数据集方面的稳健性能和超越LSTM限制的能力而闻名，特别是在需要详细力量调整的任务中。ILBiT的一个突出特点是其100 Hz的高频操作，这大大提高了系统对不同环境和不同硬度水平物体的适应性和响应能力。基于Transformer model的ILBiT方法的有效性可以通过全面的真实世界实验来看。

摘要: Autonomous manipulation in robot arms is a complex and evolving field of study in robotics. This paper introduces an innovative approach to this challenge by focusing on imitation learning (IL). Unlike traditional imitation methods, our approach uses IL based on bilateral control, allowing for more precise and adaptable robot movements. The conventional IL based on bilateral control method have relied on Long Short-Term Memory (LSTM) networks. In this paper, we present the IL for robot using position and torque information based on Bilateral control with Transformer (ILBiT). This proposed method employs the Transformer model, known for its robust performance in handling diverse datasets and its capability to surpass LSTM’s limitations, especially in tasks requiring detailed force adjustments. A standout feature of ILBiT is its high-frequency operation at 100 Hz, which significantly improves the system’s adaptability and response to varying environments and objects of different hardness levels. The effectiveness of the Transformer-based ILBiT method can be seen through comprehensive real-world experiments.

==robotic agent==

标题: Exploring the Effects of Shared Autonomy on Cognitive Load and Trust in Human-Robot Interaction

作者: Jiahe Pan, Jonathan Eden, Denny Oetomo

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.02758v1

中文摘要: 遥操作越来越被认为是在危险环境中部署机器人的可行解决方案。控制机器人执行复杂或高要求的任务可能会使操作员过载，导致性能不佳。为了设计一个机器人控制器来帮助人类执行这种具有挑战性的任务，全面了解机器人的自主行为和操作员的内部状态之间的相互作用是必不可少的。在本文中，我们研究了机器人自主性与人类用户的认知负荷和信任水平之间的关系，以及在机器人辅助执行任务中三方交互的潜在存在。我们的用户研究（N=24）结果表明，虽然自主水平影响遥控操作员的感知认知负荷和信任，但这些因素之间没有明确的相互作用。相反，这些元素似乎是独立运作的，因此强调了在共享控制设置中改变机器人自主水平时，需要将认知负荷和信任作为不同但相互关联的因素来考虑。这种洞察力对于开发更有效和适应性更强的辅助机器人系统至关重要。

摘要: Teleoperation is increasingly recognized as a viable solution for deploying robots in hazardous environments. Controlling a robot to perform a complex or demanding task may overload operators resulting in poor performance. To design a robot controller to assist the human in executing such challenging tasks, a comprehensive understanding of the interplay between the robot’s autonomous behavior and the operator’s internal state is essential. In this paper, we investigate the relationships between robot autonomy and both the human user’s cognitive load and trust levels, and the potential existence of three-way interactions in the robot-assisted execution of the task. Our user study (N=24) results indicate that while autonomy level influences the teleoperator’s perceived cognitive load and trust, there is no clear interaction between these factors. Instead, these elements appear to operate independently, thus highlighting the need to consider both cognitive load and trust as distinct but interrelated factors in varying the robot autonomy level in shared-control settings. This insight is crucial for the development of more effective and adaptable assistive robotic systems.

标题: Learning Mutual Excitation for Hand-to-Hand and Human-to-Human Interaction Recognition

作者: Mengyuan Liu, Chen Chen, Songtao Wu

PubTime: 2024-02-04

Downlink: http://arxiv.org/abs/2402.02431v1

中文摘要: 识别交互动作，包括手对手的交互和人对人的交互，在视频分析和人机交互领域的各种应用中引起了越来越多的关注。考虑到图卷积在从骨架数据建模拓扑感知特征方面的成功，最近的方法通常在单独的实体上操作图卷积，并使用后期融合进行交互式动作识别，这几乎不能建模成对实体之间的相互语义关系。为此，我们提出了一种通过堆叠互激励图卷积（me-GC）层的互激励图卷积网络（me-GCN）。具体来说，me-GC使用相互拓扑激励模块，首先从单个实体中提取邻接矩阵，然后自适应地建模它们之间的相互约束。此外，me-GC扩展了上述思想，并进一步使用相互特征激励模块从成对实体中提取和合并深度特征。与图卷积相比，我们提出的me-GC逐渐学习图卷积运算的每一层和每一阶段的互信息。在具有挑战性的手对手交互数据集（即Assembely101数据集）和两个大规模人对人交互数据集（即NTU60-Interaction和NTU120-Interaction）上的大量实验一致验证了我们提出的方法的优越性，该方法优于最先进的基于GCN和基于Transformer model的方法。

摘要: Recognizing interactive actions, including hand-to-hand interaction and human-to-human interaction, has attracted increasing attention for various applications in the field of video analysis and human-robot interaction. Considering the success of graph convolution in modeling topology-aware features from skeleton data, recent methods commonly operate graph convolution on separate entities and use late fusion for interactive action recognition, which can barely model the mutual semantic relationships between pairwise entities. To this end, we propose a mutual excitation graph convolutional network (me-GCN) by stacking mutual excitation graph convolution (me-GC) layers. Specifically, me-GC uses a mutual topology excitation module to firstly extract adjacency matrices from individual entities and then adaptively model the mutual constraints between them. Moreover, me-GC extends the above idea and further uses a mutual feature excitation module to extract and merge deep features from pairwise entities. Compared with graph convolution, our proposed me-GC gradually learns mutual information in each layer and each stage of graph convolution operations. Extensive experiments on a challenging hand-to-hand interaction dataset, i.e., the Assembely101 dataset, and two large-scale human-to-human interaction datasets, i.e., NTU60-Interaction and NTU120-Interaction consistently verify the superiority of our proposed method, which outperforms the state-of-the-art GCN-based and Transformer-based methods.

标题: Utilization of Non-verbal Behaviour and Social Gaze in Classroom Human-Robot Interaction Communications

作者: Sahand Shaghaghi, Pourya Aliasghari, Bryan Tripp

PubTime: 2024-02-04

Downlink: http://arxiv.org/abs/2312.06825v2

中文摘要: 本摘要探讨了课堂人机交互（HRI）场景，重点是在机器人认知架构中适应人类启发的社交凝视模型，以促进更无缝的社交交互。首先，我们详细介绍了我们在研究中探索的HRI场景，然后描述了我们研究中使用的社会凝视模型。我们强调了在课堂HRI场景中使用这种注意力模型的优势。我们还详细介绍了我们即将进行的研究的预期目标，涉及这个社会凝视模型。

摘要: This abstract explores classroom Human-Robot Interaction (HRI) scenarios with an emphasis on the adaptation of human-inspired social gaze models in robot cognitive architecture to facilitate a more seamless social interaction. First, we detail the HRI scenarios explored by us in our studies followed by a description of the social gaze model utilized for our research. We highlight the advantages of utilizing such an attentional model in classroom HRI scenarios. We also detail the intended goals of our upcoming study involving this social gaze model.

== Segmentation ==

标题: HASSOD: Hierarchical Adaptive Self-Supervised Object Detection

作者: Shengcao Cao, Dhiraj Joshi, Liang-Yan Gui

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.03311v1

Project: https://HASSOD-NeurIPS23.github.io|

中文摘要: 人类的视觉感知系统在没有明确监督和理解物体的部分到整体组成的情况下表现出非凡的学习能力。从这两种能力中获得灵感，我们提出了分层自适应自监督对象检测（HASSOD），这是一种在没有人类监督的情况下学习检测对象并理解其组成的新方法。HASSOD采用分层自适应聚类策略，基于自监督视觉表示将区域分组到对象掩模中，自适应地确定每个图像的对象数量。此外，HASSOD通过分析掩模之间的覆盖关系和构建树结构来识别对象在组成方面的层次。这种额外的自我监督学习任务导致检测性能的提高和可解释性的增强。最后，我们放弃了先前方法中使用的低效的多轮自我训练过程，而是从半监督学习中调整平均教师框架，这导致了更平滑和更有效的训练过程。通过在流行图像数据集上的大量实验，我们证明了HASSOD相对于现有方法的优越性，从而推进了自监督目标检测的技术水平。值得注意的是，我们将LVIS上的掩模AR从20.2提高到22.5，将SA-1B上的掩模AR从17.0提高到26.0。项目页面：https：//hassod-neurips 23.github.io。

摘要: The human visual perception system demonstrates exceptional capabilities in learning without explicit supervision and understanding the part-to-whole composition of objects. Drawing inspiration from these two abilities, we propose Hierarchical Adaptive Self-Supervised Object Detection (HASSOD), a novel approach that learns to detect objects and understand their compositions without human supervision. HASSOD employs a hierarchical adaptive clustering strategy to group regions into object masks based on self-supervised visual representations, adaptively determining the number of objects per image. Furthermore, HASSOD identifies the hierarchical levels of objects in terms of composition, by analyzing coverage relations between masks and constructing tree structures. This additional self-supervised learning task leads to improved detection performance and enhanced interpretability. Lastly, we abandon the inefficient multi-round self-training process utilized in prior methods and instead adapt the Mean Teacher framework from semi-supervised learning, which leads to a smoother and more efficient training process. Through extensive experiments on prevalent image datasets, we demonstrate the superiority of HASSOD over existing methods, thereby advancing the state of the art in self-supervised object detection. Notably, we improve Mask AR from 20.2 to 22.5 on LVIS, and from 17.0 to 26.0 on SA-1B. Project page: https://HASSOD-NeurIPS23.github.io.

标题: Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining

作者: Jiarun Liu, Hao Yang, Hong-Yu Zhou

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.03302v1

GitHub: https://github.com/JiarunLiu/Swin-UMamba|

中文摘要: 精确的医学图像分割需要整合多尺度信息，从局部特征到全局依赖关系。然而，现有的方法对远程全局信息进行建模是具有挑战性的，其中卷积神经网络（CNN）受到其局部感受野的限制，并且视觉转换器（vit）遭受其注意机制的高二次复杂性。最近，基于Mamba的模型因其在长序列建模方面令人印象深刻的能力而获得了极大的关注。几项研究表明，这些模型可以在各种任务中优于流行的视觉模型，提供更高的准确性，更低的内存消耗和更少的计算负担。然而，现有的基于Mamba的模型大多是从头开始训练的，没有探索预训练的力量，预训练已被证明对于数据高效的医学图像分析相当有效。本文介绍了一种新的基于Mamba的模型Swin-UMamba，它是专门为医学图像分割任务设计的，利用了基于ImageNet的预训练的优势。我们的实验结果揭示了基于ImageNet的训练在增强基于Mamba的模型的性能中的重要作用。与CNN、ViTs和最新的基于Mamba的模型相比，Swin-UMamba表现出了卓越的性能和较大的优势。值得注意的是，在腹部核磁共振成像、内窥镜检查和显微镜数据集上，Swin-UMamba比其最接近的对手U-Mamba平均得分高出3.58%。Swin-UMamba的代码和模型可在以下网址公开获得：https://github.com/JiarunLiu/Swin-UMamba

摘要: Accurate medical image segmentation demands the integration of multi-scale information, spanning from local features to global dependencies. However, it is challenging for existing methods to model long-range global information, where convolutional neural networks (CNNs) are constrained by their local receptive fields, and vision transformers (ViTs) suffer from high quadratic complexity of their attention mechanism. Recently, Mamba-based models have gained great attention for their impressive ability in long sequence modeling. Several studies have demonstrated that these models can outperform popular vision models in various tasks, offering higher accuracy, lower memory consumption, and less computational burden. However, existing Mamba-based models are mostly trained from scratch and do not explore the power of pretraining, which has been proven to be quite effective for data-efficient medical image analysis. This paper introduces a novel Mamba-based model, Swin-UMamba, designed specifically for medical image segmentation tasks, leveraging the advantages of ImageNet-based pretraining. Our experimental results reveal the vital role of ImageNet-based training in enhancing the performance of Mamba-based models. Swin-UMamba demonstrates superior performance with a large margin compared to CNNs, ViTs, and latest Mamba-based models. Notably, on AbdomenMRI, Encoscopy, and Microscopy datasets, Swin-UMamba outperforms its closest counterpart U-Mamba by an average score of 3.58%. The code and models of Swin-UMamba are publicly available at: https://github.com/JiarunLiu/Swin-UMamba

标题: InstanceDiffusion: Instance-level Control for Image Generation

作者: Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.03290v1

Project: https://people.eecs.berkeley.edu/|

中文摘要: 文本到图像扩散模型产生高质量的图像，但不提供对图像中单个实例的控制。我们引入了InstanceDiffusion，它为文本到图像的扩散模型添加了精确的实例级控制。InstanceDiffusion支持每个实例的自由形式语言条件，并允许以灵活的方式指定实例位置，如简单的单点、涂鸦、边界框或复杂的实例分割遮罩及其组合。我们对文本到图像模型提出了三个主要的改变，以实现精确的实例级控制。我们的UniFusion块支持文本到图像模型的实例级条件，ScaleU块提高了图像保真度，我们的多实例采样器提高了多实例的生成。对于每种位置条件，InstanceDiffusion都大大超过了专门的最先进的模型。值得注意的是，在COCO数据集上，我们在框输入方面比以前的技术水平高出20.4%AP $_{50}^\text{box}$ ，在掩码输入方面比以前的技术水平高出25.4%IoU。

摘要: Text-to-image diffusion models produce high quality images but do not offer control over individual instances in the image. We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. InstanceDiffusion supports free-form language conditions per instance and allows flexible ways to specify instance locations such as simple single points, scribbles, bounding boxes or intricate instance segmentation masks, and combinations thereof. We propose three major changes to text-to-image models that enable precise instance-level control. Our UniFusion block enables instance-level conditions for text-to-image models, the ScaleU block improves image fidelity, and our Multi-instance Sampler improves generations for multiple instances. InstanceDiffusion significantly surpasses specialized state-of-the-art models for each location condition. Notably, on the COCO dataset, we outperform previous state-of-the-art by 20.4% AP $_{50}^\text{box}$ for box inputs, and 25.4% IoU for mask inputs.

标题: ActiveAnno3D -- An Active Learning Framework for Multi-Modal 3D Object Detection

作者: Ahmed Ghita, Bjørk Antoniussen, Walter Zimmer

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.03235v1

Project: https://active3d-framework.github.io/active3d-framework|

中文摘要: 大规模数据集的监管仍然成本高昂，需要大量时间和资源。数据通常被手动标记，创建高质量数据集的挑战依然存在。在这项工作中，我们填补了研究空白，使用主动学习的多模态三维目标检测。我们提出了ActiveAnno3D，这是一个主动学习框架，用于选择具有最大信息量的数据样本进行标记，以进行训练。我们探索了各种连续训练方法，并整合了关于计算需求和检测性能的最有效的方法。此外，我们使用BEVFusion和PV-RCNN在nuScenes和TUM交通交叉口数据集上进行了广泛的实验和消融研究。我们表明，当仅使用TUM交通交叉口数据集的一半训练数据（77.25 mAP对83.50 mAP）时，我们可以使用PV-RCNN和基于熵的查询策略实现几乎相同的性能。BEVFusion在使用一半训练数据时实现了64.31的mAP，在使用完整的nuScenes数据集时实现了75.0的mAP。我们将我们的主动学习框架集成到proAnno标记工具中，以实现人工智能辅助的数据选择和标记，并最大限度地降低标记成本。最后，我们在我们的网站上提供代码、权重和可视化结果：https：//active3d-framework.github.io/active3d-framework。

摘要: The curation of large-scale datasets is still costly and requires much time and resources. Data is often manually labeled, and the challenge of creating high-quality datasets remains. In this work, we fill the research gap using active learning for multi-modal 3D object detection. We propose ActiveAnno3D, an active learning framework to select data samples for labeling that are of maximum informativeness for training. We explore various continuous training methods and integrate the most efficient method regarding computational demand and detection performance. Furthermore, we perform extensive experiments and ablation studies with BEVFusion and PV-RCNN on the nuScenes and TUM Traffic Intersection dataset. We show that we can achieve almost the same performance with PV-RCNN and the entropy-based query strategy when using only half of the training data (77.25 mAP compared to 83.50 mAP) of the TUM Traffic Intersection dataset. BEVFusion achieved an mAP of 64.31 when using half of the training data and 75.0 mAP when using the complete nuScenes dataset. We integrate our active learning framework into the proAnno labeling tool to enable AI-assisted data selection and labeling and minimize the labeling costs. Finally, we provide code, weights, and visualization results on our website: https://active3d-framework.github.io/active3d-framework.

标题: RRWNet: Recursive Refinement Network for Effective Retinal Artery/Vein Segmentation and Classification

作者: José Morano, Guilherme Aresta, Hrvoje Bogunović

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.03166v1

GitHub: https://github.com/j-morano/rrwnet|

中文摘要: 视网膜血管的口径和配置是各种疾病和医疗状况的重要生物标志物。对视网膜脉管系统的彻底分析需要对血管进行分割并将其分类为动脉和静脉，这通常在通过视网膜造影（一种广泛使用的成像技术）获得的彩色眼底图像上进行。尽管如此，手动执行这些任务是劳动密集型的，并且容易出现人为错误。已经提出了各种自动化方法来解决这个问题。然而，由于影响分割图拓扑一致性的明显分类错误，动脉/静脉分割和分类的当前技术水平面临挑战。本研究提出了一个创新的端到端框架，RRWNet，旨在递归细化语义分割图和纠正清单分类错误。该框架由一个完全卷积的神经网络和一个递归细化子网组成，该神经网络具有一个从输入图像生成基本分割图的基本子网，以及一个迭代和递归地改进这些图的递归细化子网。在公共数据集上的评估证明了所提出的方法的最先进的性能，产生了比现有方法更拓扑一致的分割图和更少的明显分类错误。此外，递归细化模块在对来自其他方法的分割图进行后处理、自动纠正分类错误和提高拓扑一致性方面证明是有效的。模型代码、权重和预测可在https://github.com/j-morano/rrwnet。

摘要: The caliber and configuration of retinal blood vessels serve as important biomarkers for various diseases and medical conditions. A thorough analysis of the retinal vasculature requires the segmentation of blood vessels and their classification into arteries and veins, which is typically performed on color fundus images obtained by retinography, a widely used imaging technique. Nonetheless, manually performing these tasks is labor-intensive and prone to human error. Various automated methods have been proposed to address this problem. However, the current state of art in artery/vein segmentation and classification faces challenges due to manifest classification errors that affect the topological consistency of segmentation maps. This study presents an innovative end-to-end framework, RRWNet, designed to recursively refine semantic segmentation maps and correct manifest classification errors. The framework consists of a fully convolutional neural network with a Base subnetwork that generates base segmentation maps from input images, and a Recursive Refinement subnetwork that iteratively and recursively improves these maps. Evaluation on public datasets demonstrates the state-of-the-art performance of the proposed method, yielding more topologically consistent segmentation maps with fewer manifest classification errors than existing approaches. In addition, the Recursive Refinement module proves effective in post-processing segmentation maps from other methods, automatically correcting classification errors and improving topological consistency. The model code, weights, and predictions are publicly available at https://github.com/j-morano/rrwnet.

标题: Context-self contrastive pretraining for crop type semantic segmentation

作者: Michail Tarasiou, Riza Alp Guler, Stefanos Zafeiriou

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2104.04310v3

GitHub: https://github.com/michaeltrs/DeepSatModels|

摘要: In this paper, we propose a fully supervised pre-training scheme based on contrastive learning particularly tailored to dense classification tasks. The proposed Context-Self Contrastive Loss (CSCL) learns an embedding space that makes semantic boundaries pop-up by use of a similarity metric between every location in a training sample and its local context. For crop type semantic segmentation from Satellite Image Time Series (SITS) we find performance at parcel boundaries to be a critical bottleneck and explain how CSCL tackles the underlying cause of that problem, improving the state-of-the-art performance in this task. Additionally, using images from the Sentinel-2 (S2) satellite missions we compile the largest, to our knowledge, SITS dataset densely annotated by crop type and parcel identities, which we make publicly available together with the data generation pipeline. Using that data we find CSCL, even with minimal pre-training, to improve all respective baselines and present a process for semantic segmentation at super-resolution for obtaining crop classes at a more granular level. The code and instructions to download the data can be found in https://github.com/michaeltrs/DeepSatModels.