[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--强化学习、模仿学习、机器人、开放词汇_learning dynamic mechanisms in unknown environment-CSDN博客

本文链接：https://blog.csdn.net/u011573853/article/details/135901242

该博客提供专属领域论文订阅，关注每日更新。内容涵盖强化学习、模仿学习、机器人、目标检测等领域论文。如强化学习中提出Distributional RND解决探索问题，蒙特卡罗树搜索用于建筑需求响应；还涉及机器人交互、图像分割等方面的研究。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

专属领域论文订阅

关注{晓理紫}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

== RL ==

标题: Exploration and Anti-Exploration with Distributional Random Network Distillation

作者: Kai Yang, Jian Tao, Jiafei Lyu

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.09750v2

中文摘要: 探索仍然是深度强化学习中的一个关键问题，以便代理在未知环境中获得高回报。虽然流行的探索随机网络蒸馏（RND）算法已被证明在许多环境中是有效的，但它在奖金分配中通常需要更多的区分能力。本文强调了RND中的“奖金不一致性”问题，指出了其主要局限性。为了解决这个问题，我们引入了分布RND（DRND），它是RND的衍生物。DRND通过提取随机网络的分布并隐式合并伪计数来提高奖金分配的精度，从而增强了探索过程。这种改进鼓励代理进行更广泛的探索。我们的方法有效地缓解了不一致性问题，而没有引入大量的计算开销。理论分析和实验结果都证明了该方法相对于原始RND算法的优越性。我们的方法擅长挑战在线探索场景，并在D4RL离线任务中有效地充当反探索机制。

摘要: Exploration remains a critical issue in deep reinforcement learning for an agent to attain high returns in unknown environments. Although the prevailing exploration Random Network Distillation (RND) algorithm has been demonstrated to be effective in numerous environments, it often needs more discriminative power in bonus allocation. This paper highlights the ``bonus inconsistency’’ issue within RND, pinpointing its primary limitation. To address this issue, we introduce the Distributional RND (DRND), a derivative of the RND. DRND enhances the exploration process by distilling a distribution of random networks and implicitly incorporating pseudo counts to improve the precision of bonus allocation. This refinement encourages agents to engage in more extensive exploration. Our method effectively mitigates the inconsistency issue without introducing significant computational overhead. Both theoretical analysis and experimental results demonstrate the superiority of our approach over the original RND algorithm. Our method excels in challenging online exploration scenarios and effectively serves as an anti-exploration mechanism in D4RL offline tasks.

标题: Demand response for residential building heating: Effective Monte Carlo Tree Search control based on physics-informed neural networks

作者: Fabio Pavirani, Gargya Gokhale, Bert Claessens

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2312.03365v3

中文摘要: 通过需求响应（DR）控制建筑能耗对于减少全球碳排放和限制气候变化变得越来越重要。在本文中，我们特别关注控制住宅建筑的供暖系统，以优化其能耗，同时尊重用户的热舒适性。该领域最近的工作主要集中在基于模型的控制，例如模型预测控制（MPC），或者无模型强化学习（RL），以实现实用的DR算法。最近在棋盘游戏（围棋、国际象棋）等领域取得了令人印象深刻的成功的一种特定的RL方法是蒙特卡罗树搜索（MCTS）。然而，对于建筑控制，它在很大程度上仍未被探索。因此，我们研究MCTS专门用于建筑需求响应。它的自然结构允许灵活的优化，隐式集成外生约束（例如，与传统的RL解决方案相反），使MCT成为DR控制问题的有前途的候选。我们展示了如何通过将物理信息神经网络（PiNN）模型用于其底层热状态预测来提高MCTS控制性能，而不是传统的纯数据驱动的黑盒方法。与基于规则的控制器相比，我们与PiNN模型一致的MCTS实现能够获得3%的获得奖励增量；当应用于人工价格曲线时，导致10%的成本降低和35%的温差降低。我们进一步在蒙特卡罗树搜索技术中实现了深度学习层，使用神经网络引导树搜索通过更多的最佳节点。然后，我们将这一添加与它的普通版本进行了比较，显示了所需计算成本的改进。

摘要: Controlling energy consumption in buildings through demand response (DR) has become increasingly important to reduce global carbon emissions and limit climate change. In this paper, we specifically focus on controlling the heating system of a residential building to optimize its energy consumption while respecting user’s thermal comfort. Recent works in this area have mainly focused on either model-based control, e.g., model predictive control (MPC), or model-free reinforcement learning (RL) to implement practical DR algorithms. A specific RL method that recently has achieved impressive success in domains such as board games (go, chess) is Monte Carlo Tree Search (MCTS). Yet, for building control it has remained largely unexplored. Thus, we study MCTS specifically for building demand response. Its natural structure allows a flexible optimization that implicitly integrate exogenous constraints (as opposed, for example, to conventional RL solutions), making MCTS a promising candidate for DR control problems. We demonstrate how to improve MCTS control performance by incorporating a Physics-informed Neural Network (PiNN) model for its underlying thermal state prediction, as opposed to traditional purely data-driven Black-Box approaches. Our MCTS implementation aligned with a PiNN model is able to obtain a 3% increment of the obtained reward compared to a rule-based controller; leading to a 10% cost reduction and 35% reduction on temperature difference with the desired one when applied to an artificial price profile. We further implemented a Deep Learning layer into the Monte Carlo Tree Search technique using a neural network that leads the tree search through more optimal nodes. We then compared this addition with its Vanilla version, showing the improvement in computational cost required.

标题: The Synergy Between Optimal Transport Theory and Multi-Agent Reinforcement Learning

作者: Ali Baheri, Mykel J. Kochenderfer

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.10949v2

中文摘要: 本文探讨了最优运输（OT）理论与多智能体强化学习（MARL）的集成。这种集成使用OT来处理分配和运输问题，以提高MARL的效率、协调性和适应性。OT可以影响MARL的五个关键领域是：（1）政策协调，其中OT的Wasserstein度量用于协调不同的代理策略以实现统一的目标；（2）分布式资源管理，利用OT优化Agent之间的资源分配；（3）解决非平稳性，利用OT适应动态环境变化；（4）可扩展的多智能体学习，利用OT将大规模的学习目标分解为可管理的任务；及（5）提高能源效率，应用OT原则开发可持续泥灰岩系统。本文阐述了OT和MARL之间的协同作用如何解决可伸缩性问题，优化资源分配，在协作环境中调整代理策略，并确保在动态变化的条件下的适应性。

摘要: This paper explores the integration of optimal transport (OT) theory with multi-agent reinforcement learning (MARL). This integration uses OT to handle distributions and transportation problems to enhance the efficiency, coordination, and adaptability of MARL. There are five key areas where OT can impact MARL: (1) policy alignment, where OT’s Wasserstein metric is used to align divergent agent strategies towards unified goals; (2) distributed resource management, employing OT to optimize resource allocation among agents; (3) addressing non-stationarity, using OT to adapt to dynamic environmental shifts; (4) scalable multi-agent learning, harnessing OT for decomposing large-scale learning objectives into manageable tasks; and (5) enhancing energy efficiency, applying OT principles to develop sustainable MARL systems. This paper articulates how the synergy between OT and MARL can address scalability issues, optimize resource distribution, align agent policies in cooperative environments, and ensure adaptability in dynamically changing conditions.

标题: Safe and Generalized end-to-end Autonomous Driving System with Reinforcement Learning and Demonstrations

作者: Zuojin Tang, Xiaoyu Chen, YongQiang Li

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.11792v3

中文摘要: 智能驾驶系统应该能够根据当前环境和车辆状态动态制定适当的驾驶策略，同时确保系统的安全性和可靠性。然而，基于强化学习和模仿学习的现有方法存在安全性低、泛化能力差和采样效率低的问题。此外，它们不能准确预测未来的驾驶轨迹，而对未来驾驶轨迹的准确预测是做出最优决策的先决条件。为了解决这些问题，本文介绍了一种安全通用的端到端自动驾驶系统（SGADS），适用于复杂多样的场景。我们的SGADS将变分推理与归一化流量相结合，使智能车辆能够准确预测未来的驾驶轨迹。此外，我们提出了鲁棒安全约束的公式。此外，我们将强化学习与演示相结合，以增强代理的搜索过程。实验结果表明，与现有方法相比，我们的SGADS可以显著提高安全性能，表现出很强的泛化能力，并增强智能车辆在复杂城市场景中的训练效率。

摘要: An intelligent driving system should be capable of dynamically formulating appropriate driving strategies based on the current environment and vehicle status, while ensuring the security and reliability of the system. However, existing methods based on reinforcement learning and imitation learning suffer from low safety, poor generalization, and inefficient sampling. Additionally, they cannot accurately predict future driving trajectories, and the accurate prediction of future driving trajectories is a precondition for making optimal decisions. To solve these problems, in this paper, we introduce a Safe and Generalized end-to-end Autonomous Driving System (SGADS) for complex and various scenarios. Our SGADS incorporates variational inference with normalizing flows, enabling the intelligent vehicle to accurately predict future driving trajectories. Moreover, we propose the formulation of robust safety constraints. Furthermore, we combine reinforcement learning with demonstrations to augment search process of the agent. The experimental results demonstrate that our SGADS can significantly improve safety performance, exhibit strong generalization, and enhance the training efficiency of intelligent vehicles in complex urban scenarios compared to existing methods.

标题: Obstacle-Aware Navigation of Soft Growing Robots via Deep Reinforcement Learning

作者: Haitham El-Hussieny, Ibrahim Hameed

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.11203v2

中文摘要: 软生长机器人是一种机器人，旨在以类似于植物生长和移动的方式移动和适应环境，具有潜在的应用，可用于在狭小的空间、危险的地形和难以到达的区域导航。本研究探讨了深度强化Q-学习算法在杂乱环境中促进软生长机器人导航的应用。所提出的算法利用软机器人的灵活性来适应和整合机器人与环境之间的交互到决策过程中。仿真结果表明，该算法提高了软机器人在有限空间内的有效导航能力。这项研究提出了一种很有前途的方法来解决成长中的机器人，特别是软机器人在现实世界场景中规划障碍感知路径所面临的挑战。

摘要: Soft growing robots, are a type of robots that are designed to move and adapt to their environment in a similar way to how plants grow and move with potential applications where they could be used to navigate through tight spaces, dangerous terrain, and hard-to-reach areas. This research explores the application of deep reinforcement Q-learning algorithm for facilitating the navigation of the soft growing robots in cluttered environments. The proposed algorithm utilizes the flexibility of the soft robot to adapt and incorporate the interaction between the robot and the environment into the decision-making process. Results from simulations show that the proposed algorithm improves the soft robot’s ability to navigate effectively and efficiently in confined spaces. This study presents a promising approach to addressing the challenges faced by growing robots in particular and soft robots general in planning obstacle-aware paths in real-world scenarios.

标题: FedDRL: A Trustworthy Federated Learning Model Fusion Method Based on Staged Reinforcement Learning

作者: Leiming Chen, Cihao Dong, Sibo Qiao

PubTime: 2024-01-21

Downlink: http://arxiv.org/abs/2307.13716v3

摘要: Traditional federated learning uses the number of samples to calculate the weights of each client model and uses this fixed weight value to fusion the global model. However, in practical scenarios, each client’s device and data heterogeneity leads to differences in the quality of each client’s model. Thus the contribution to the global model is not wholly determined by the sample size. In addition, if clients intentionally upload low-quality or malicious models, using these models for aggregation will lead to a severe decrease in global model accuracy. Traditional federated learning algorithms do not address these issues. To solve this probelm, we propose FedDRL, a model fusion approach using reinforcement learning based on a two staged approach. In the first stage, Our method could filter out malicious models and selects trusted client models to participate in the model fusion. In the second stage, the FedDRL algorithm adaptively adjusts the weights of the trusted client models and aggregates the optimal global model. We also define five model fusion scenarios and compare our method with two baseline algorithms in those scenarios. The experimental results show that our algorithm has higher reliability than other algorithms while maintaining accuracy.

== Imitation Learning ==

标题: Safe and Generalized end-to-end Autonomous Driving System with Reinforcement Learning and Demonstrations

作者: Zuojin Tang, Xiaoyu Chen, YongQiang Li

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.11792v3

== robotic agent ==

标题: The Conversation is the Command: Interacting with Real-World Autonomous Robot Through Natural Language

作者: Linus Nwankwo, Elmar Rueckert

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2401.11838v1

Project: https://osf.io/wzyf6|

GitHub: https://github.com/LinusNEP/TCC_IRoNL.git).|

中文摘要: 近年来，自主代理在现实世界环境中激增，如我们的家庭、办公室和公共场所。然而，自然的人机交互仍然是一个关键的挑战。在本文中，我们介绍了一种协同利用大型语言模型（LLMs）和多模态视觉语言模型（VLMs）的能力的方法，使人类能够通过对话与自主机器人进行自然交互。我们利用LLMs解码来自人类的高级自然语言指令，并将其抽象为精确的机器人可操作命令或查询。此外，我们利用VLMs来提供对机器人任务环境的视觉和语义理解。我们99.13%的命令识别准确率和97.96%的命令执行成功率表明，我们的方法可以增强现实世界应用中的人机交互。本文的视频演示可以在https：//osf.io/wzyf6找到，代码可以在我们的GitHub资源库（https：//github.com/LinusNEP/tcc_iron.git）找到。

摘要: In recent years, autonomous agents have surged in real-world environments such as our homes, offices, and public spaces. However, natural human-robot interaction remains a key challenge. In this paper, we introduce an approach that synergistically exploits the capabilities of large language models (LLMs) and multimodal vision-language models (VLMs) to enable humans to interact naturally with autonomous robots through conversational dialogue. We leveraged the LLMs to decode the high-level natural language instructions from humans and abstract them into precise robot actionable commands or queries. Further, we utilised the VLMs to provide a visual and semantic understanding of the robot’s task environment. Our results with 99.13% command recognition accuracy and 97.96% commands execution success show that our approach can enhance human-robot interaction in real-world applications. The video demonstrations of this paper can be found at https://osf.io/wzyf6 and the code is available at our GitHub repository (https://github.com/LinusNEP/TCC_IRoNL.git).

标题: Augmented Reality User Interface for Command, Control, and Supervision of Large Multi-Agent Teams

作者: Frank Regal, Chris Suarez, Fabian Parra

PubTime: 2024-01-11

Downlink: http://arxiv.org/abs/2401.05665v1

Project: https://sites.google.com/view/xr-robotics-iros2023/home?authuser=0|

中文摘要: 多智能体人——机器人团队通过利用和结合人类和机器人的优势，可以更有效地收集各种环境的信息。在国防、搜索和救援、急救等行业，异构人机团队有望通过将人类从未知和潜在危险的情况中移除来加速数据收集和提高团队安全性。这项工作建立在AugRE的基础上，AugRE是一个基于增强现实（AR）的可扩展人机团队框架。它使用户能够本地化并与50多个自主代理通信。通过我们的努力，用户能够指挥、控制和监督大型团队中的代理，无论是视距还是非视距，而无需事先修改环境，也无需用户使用典型的硬件（即操纵杆、键盘、笔记本电脑、平板电脑等）。）在外地。演示的工作表明，早期迹象表明，将这些基于AR-HMD的用户交互模式结合起来进行指挥、控制和监督，将有助于改善人机团队协作、健壮性和信任。

摘要: Multi-agent human-robot teaming allows for the potential to gather information about various environments more efficiently by exploiting and combining the strengths of humans and robots. In industries like defense, search and rescue, first-response, and others alike, heterogeneous human-robot teams show promise to accelerate data collection and improve team safety by removing humans from unknown and potentially hazardous situations. This work builds upon AugRE, an Augmented Reality (AR) based scalable human-robot teaming framework. It enables users to localize and communicate with 50+ autonomous agents. Through our efforts, users are able to command, control, and supervise agents in large teams, both line-of-sight and non-line-of-sight, without the need to modify the environment prior and without requiring users to use typical hardware (i.e. joysticks, keyboards, laptops, tablets, etc.) in the field. The demonstrated work shows early indications that combining these AR-HMD-based user interaction modalities for command, control, and supervision will help improve human-robot team collaboration, robustness, and trust.

标题: Unified Learning from Demonstrations, Corrections, and Preferences during Physical Human-Robot Interaction

作者: Shaunak A. Mehta, Dylan P. Losey

PubTime: 2024-01-09

Downlink: http://arxiv.org/abs/2207.03395v2

Project: https://youtu.be/FSUJsTYvEKU|

中文摘要: 人类可以利用物理交互来教授机器人手臂。这种物理交互有多种形式，取决于任务、用户和机器人到目前为止学到的东西。最先进的方法专注于从单一模态中学习，或者通过假设机器人具有关于人类预期任务的先验信息来组合多种交互类型。相比之下，在本文中，我们介绍了一种算法形式主义，它将从演示、纠正和偏好中学习结合起来。我们的方法对人类想要教给机器人的任务没有任何假设；相反，我们通过将人类的输入与附近的替代方案进行比较，从头开始学习奖励模型。我们首先导出一个损失函数，它训练一组奖励模型来匹配人类的演示、纠正和偏好。反馈的类型和顺序由人类老师决定：我们让机器人被动或主动地收集反馈。然后，我们应用约束优化将我们学习到的奖励转换成期望的机器人轨迹。通过模拟和用户研究，我们证明了我们提出的方法比现有的基线更准确地从物理人类交互中学习操纵任务，特别是当机器人面临新的或意想不到的目标时。我们的用户研究视频可在以下网站获得：https：//youtu.be/FSUJsTYvEKU

摘要: Humans can leverage physical interaction to teach robot arms. This physical interaction takes multiple forms depending on the task, the user, and what the robot has learned so far. State-of-the-art approaches focus on learning from a single modality, or combine multiple interaction types by assuming that the robot has prior information about the human’s intended task. By contrast, in this paper we introduce an algorithmic formalism that unites learning from demonstrations, corrections, and preferences. Our approach makes no assumptions about the tasks the human wants to teach the robot; instead, we learn a reward model from scratch by comparing the human’s inputs to nearby alternatives. We first derive a loss function that trains an ensemble of reward models to match the human’s demonstrations, corrections, and preferences. The type and order of feedback is up to the human teacher: we enable the robot to collect this feedback passively or actively. We then apply constrained optimization to convert our learned reward into a desired robot trajectory. Through simulations and a user study we demonstrate that our proposed approach more accurately learns manipulation tasks from physical human interaction than existing baselines, particularly when the robot is faced with new or unexpected objectives. Videos of our user study are available at: https://youtu.be/FSUJsTYvEKU

标题: StROL: Stabilized and Robust Online Learning from Humans

作者: Shaunak A. Mehta, Forrest Meng, Andrea Bajcsy

PubTime: 2024-01-04

Downlink: http://arxiv.org/abs/2308.09863v2

GitHub: https://github.com/VT-Collab/StROL_RAL|

中文摘要: 在当前的互动中，机器人经常需要在线学习人类的奖励功能。这种实时学习需要快速但近似的学习规则：当人类的行为有噪声或次优时，当前的近似会导致机器人学习不稳定。因此，在本文中，我们试图增强梯度下降学习规则在推断人类奖励参数时的鲁棒性和收敛性。我们将机器人的学习算法建模为基于人类偏好参数的动态系统，其中人类的真实（但未知）偏好是平衡点。这使我们能够执行李亚普诺夫稳定性分析，以推导机器人学习动力学收敛的条件。我们提出的算法（StROL）使用这些条件来学习设计鲁棒的学习规则：给定原始的学习动态，StROL输出修改的学习规则，该规则现在在更大的人类输入集下收敛到人类的真实参数。在实践中，这些自主生成的学习规则可以正确地推断出人类试图传达的内容，即使人类是嘈杂的、有偏见的和次优的。通过模拟和用户研究，我们发现StROL比最先进的在线奖励学习方法产生更准确的估计和更少的遗憾。请点击此处查看视频和代码：https://github.com/VT-Collab/StROL_RAL

摘要: Robots often need to learn the human’s reward function online, during the
current interaction. This real-time learning requires fast but approximate
learning rules: when the human’s behavior is noisy or suboptimal, current
approximations can result in unstable robot learning. Accordingly, in this
paper we seek to enhance the robustness and convergence properties of gradient
descent learning rules when inferring the human’s reward parameters. We model
the robot’s learning algorithm as a dynamical system over the human preference
parameters, where the human’s true (but unknown) preferences are the
equilibrium point. This enables us to perform Lyapunov stability analysis to
derive the conditions under which the robot’s learning dynamics converge. Our
proposed algorithm (StROL) uses these conditions to learn robust-by-design
learning rules: given the original learning dynamics, StROL outputs a modified
learning rule that now converges to the human’s true parameters under a larger
set of human inputs. In practice, these autonomously generated learning rules
can correctly infer what the human is trying to convey, even when the human is
noisy, biased, and suboptimal. Across simulations and a user study we find that
StROL results in a more accurate estimate and less regret than state-of-the-art
approaches for online reward learning. See videos and code here:
https://github.com/VT-Collab/StROL_RAL

标题: Sample-efficient Reinforcement Learning in Robotic Table Tennis

作者: Jonas Tebbe, Lukas Krauch, Yapeng Gao

PubTime: 2024-01-04

Downlink: http://arxiv.org/abs/2011.03275v4

Project: https://youtu.be/uRAtdoL6Wpw.|

中文摘要: 强化学习（RL）最近在各种计算机游戏和模拟中取得了一些令人印象深刻的成功。这些成功中的大多数都是基于代理人可以从中学习的大量情节。然而，在典型的机器人应用中，可行的尝试次数非常有限。在本文中，我们提出了一个样本有效的RL算法应用于一个乒乓球机器人的例子。在乒乓球比赛中，每一次击球都是不同的，位置、速度和旋转都不同。因此，必须根据高维连续状态空间找到精确的返回。为了使在少数试验中学习成为可能，该方法被嵌入到我们的机器人系统中。这样我们就可以使用一步到位的环境。状态空间取决于击球时的球（位置、速度、旋转），动作是击球时的球拍状态（方向、速度）。提出了一种基于行动者——批评家的确定性策略梯度算法用于加速学习。在许多具有挑战性的场景中，我们的方法在模拟和真实机器人上都具有竞争力。在不到200美元的训练中，无需预训练即可获得准确的结果。展示我们实验的视频可在https：//youtu.be/uRAtdoL6Wpw。

摘要: Reinforcement learning (RL) has achieved some impressive recent successes in
various computer games and simulations. Most of these successes are based on
having large numbers of episodes from which the agent can learn. In typical
robotic applications, however, the number of feasible attempts is very limited.
In this paper we present a sample-efficient RL algorithm applied to the example
of a table tennis robot. In table tennis every stroke is different, with
varying placement, speed and spin. An accurate return therefore has to be found
depending on a high-dimensional continuous state space. To make learning in few
trials possible the method is embedded into our robot system. In this way we
can use a one-step environment. The state space depends on the ball at hitting
time (position, velocity, spin) and the action is the racket state
(orientation, velocity) at hitting. An actor-critic based deterministic policy
gradient algorithm was developed for accelerated learning. Our approach
performs competitively both in a simulation and on the real robot in a number
of challenging scenarios. Accurate results are obtained without pre-training in
under $200$ episodes of training. The video presenting our experiments is
available at https://youtu.be/uRAtdoL6Wpw.

标题: Motion Control of Interactive Robotic Arms Based on Mixed Reality Development

作者: Hanxiao Chen

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2401.01644v1

Project: http://www.icca.net/,|

中文摘要: 混合现实（MR）正在不断发展，以激发机器人的新模式

摘要: Mixed Reality (MR) is constantly evolving to inspire new patterns of robot
manipulation for more advanced Human- Robot Interaction under the 4th
Industrial Revolution Paradigm. Consider that Mixed Reality aims to connect
physical and digital worlds to provide special immersive experiences, it is
necessary to establish the information exchange platform and robot control
systems within the developed MR scenarios. In this work, we mainly present
multiple effective motion control methods applied on different interactive
robotic arms (e.g., UR5, UR5e, myCobot) for the Unity-based development of MR
applications, including GUI control panel, text input control panel,
end-effector object dynamic tracking and ROS-Unity digital-twin connection.

== Object Detection ==

标题: SymTC: A Symbiotic Transformer-CNN Net for Instance Segmentation of Lumbar Spine MRI

作者: Jiasong Chen, Linchen Qian, Linhai Ma

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.09627v2

GitHub: https://github.com/jiasongchen/SymTC|

中文摘要: 椎间盘疾病是一种常见疾病，经常导致间歇性或持续性腰痛，这种疾病的诊断和评估依赖于从腰椎MR图像中精确测量脊椎骨和椎间盘几何形状。深度神经网络（DNN）模型可以帮助临床医生以自动化的方式对腰椎的单个实例（椎间盘和椎骨）进行更有效的图像分割，这被称为实例图像分割。在这项工作中，我们提出了SymTC，这是一种创新的腰椎MR图像分割模型，结合了Transformer model和卷积神经网络（CNN）的优势。具体来说，我们设计了一种并行的双路径架构来合并CNN层和Transformer model层，并且我们将一种新颖的位置嵌入集成到Transformer model的自我注意模块中，增强了位置信息的利用，以实现更准确的分割。为了进一步提高模型性能，我们引入了一种新的数据增强技术来创建合成而真实的MR图像数据集，命名为SSMSpine，并公开提供。我们在我们的私有内部数据集和公共SSMSpine数据集上评估了我们的SymTC和其他15个现有的图像分割模型，使用了两个指标，Dice相似系数和95%Hausdorff距离。结果表明，我们的SymTC在腰椎MR图像中具有最好的脊椎骨和椎间盘分割性能。SymTC代码和SSMSpine数据集可在https：//github.com/jiasongchen/SymTC。获得

摘要: Intervertebral disc disease, a prevalent ailment, frequently leads to intermittent or persistent low back pain, and diagnosing and assessing of this disease rely on accurate measurement of vertebral bone and intervertebral disc geometries from lumbar MR images. Deep neural network (DNN) models may assist clinicians with more efficient image segmentation of individual instances (disks and vertebrae) of the lumbar spine in an automated way, which is termed as instance image segmentation. In this work, we proposed SymTC, an innovative lumbar spine MR image segmentation model that combines the strengths of Transformer and Convolutional Neural Network (CNN). Specifically, we designed a parallel dual-path architecture to merge CNN layers and Transformer layers, and we integrated a novel position embedding into the self-attention module of Transformer, enhancing the utilization of positional information for more accurate segmentation. To further improves model performance, we introduced a new data augmentation technique to create synthetic yet realistic MR image dataset, named SSMSpine, which is made publicly available. We evaluated our SymTC and the other 15 existing image segmentation models on our private in-house dataset and the public SSMSpine dataset, using two metrics, Dice Similarity Coefficient and 95% Hausdorff Distance. The results show that our SymTC has the best performance for segmenting vertebral bones and intervertebral discs in lumbar spine MR images. The SymTC code and SSMSpine dataset are available at https://github.com/jiasongchen/SymTC.

标题: SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation

作者: Zhaohu Xing, Tian Ye, Yijun Yang

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.13560v2

GitHub: https://github.com/ge-xing/SegMamba|

中文摘要: Transformer model架构在建模全球关系方面表现出非凡的能力。然而，当处理高维医学图像时，它提出了重大的计算挑战。这阻碍了它在这项任务中的发展和广泛采用。Mamba作为一种状态空间模型（SSM），最近成为序列建模中一种显著的长程依赖方式，以其显著的存储效率和计算速度在自然语言处理领域表现出色。受其成功的启发，我们推出了SegMamba，这是一种新颖的3D医学图像\textbf{Seg}mentation\textbf{Mamba}模型，旨在有效地捕捉每个尺度的整体体积特征中的长期依赖性。与基于Transformer model的方法相比，我们的SegMamba从状态空间模型的角度来看，在整体体积特征建模方面表现出色，保持了卓越的处理速度，即使体积特征的分辨率为{ $KaTeX parse error: Undefined control sequence: \乘 at position 3: 64\̲乘̲以64\乘以64$ }。在BraTS2023数据集上的综合实验证明了我们的SegMamba的有效性和效率。SegMamba的代码可从以下网址获得：https://github.com/ge-xing/SegMamba

摘要: The Transformer architecture has shown a remarkable ability in modeling global relationships. However, it poses a significant computational challenge when processing high-dimensional medical images. This hinders its development and widespread adoption in this task. Mamba, as a State Space Model (SSM), recently emerged as a notable manner for long-range dependencies in sequential modeling, excelling in natural language processing filed with its remarkable memory efficiency and computational speed. Inspired by its success, we introduce SegMamba, a novel 3D medical image \textbf{Seg}mentation \textbf{Mamba} model, designed to effectively capture long-range dependencies within whole volume features at every scale. Our SegMamba, in contrast to Transformer-based methods, excels in whole volume feature modeling from a state space model standpoint, maintaining superior processing speed, even with volume features at a resolution of { $64\times 64\times 64$ }. Comprehensive experiments on the BraTS2023 dataset demonstrate the effectiveness and efficiency of our SegMamba. The code for SegMamba is available at: https://github.com/ge-xing/SegMamba

标题: CoSSegGaussians: Compact and Swift Scene Segmenting 3D Gaussians with Dual Feature Fusion

作者: Bin Dou, Tianyu Zhang, Yongjia Ma

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.05925v2

Project: https://David-Dou.github.io/CoSSegGaussians|

中文摘要: 我们提出了紧凑和快速分割3D高斯（CoSSegGaussians），这是一种仅输入RGB图像就能以快速渲染速度进行紧凑3D一致场景分割的方法。以前基于NeRF的分割方法依赖于耗时的神经场景优化。虽然最近的3D高斯飞溅显著提高了速度，但现有的基于高斯的分割方法很难产生紧凑的掩模，特别是在零镜头分割中。这个问题可能源于他们直接将可学习参数分配给每个高斯，导致对交叉视图不一致的2D机器生成的标签缺乏鲁棒性。我们的方法旨在通过使用双特征融合网络作为高斯域来解决这个问题。具体来说，我们首先在RGB监督下优化3D高斯。在高斯定位之后，通过显式反投影应用从图像中提取的DINO特征，这些特征进一步与来自高效点云处理网络的空间特征相结合。特征聚合用于将它们融合在全局到局部的策略中，以实现紧凑的分割特征。实验结果表明，我们的模型在语义和全景零镜头分割任务上都优于基线，同时与基于NeRF的方法相比，推理时间小于10%。代码和更多结果将在https：//David-Dou.github.io/CoSSegGaussians。

摘要: We propose Compact and Swift Segmenting 3D Gaussians(CoSSegGaussians), a method for compact 3D-consistent scene segmentation at fast rendering speed with only RGB images input. Previous NeRF-based segmentation methods have relied on time-consuming neural scene optimization. While recent 3D Gaussian Splatting has notably improved speed, existing Gaussian-based segmentation methods struggle to produce compact masks, especially in zero-shot segmentation. This issue probably stems from their straightforward assignment of learnable parameters to each Gaussian, resulting in a lack of robustness against cross-view inconsistent 2D machine-generated labels. Our method aims to address this problem by employing Dual Feature Fusion Network as Gaussians’ segmentation field. Specifically, we first optimize 3D Gaussians under RGB supervision. After Gaussian Locating, DINO features extracted from images are applied through explicit unprojection, which are further incorporated with spatial features from the efficient point cloud processing network. Feature aggregation is utilized to fuse them in a global-to-local strategy for compact segmentation features. Experimental results show that our model outperforms baselines on both semantic and panoptic zero-shot segmentation task, meanwhile consumes less than 10% inference time compared to NeRF-based methods. Code and more results will be available at https://David-Dou.github.io/CoSSegGaussians.

标题: UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

作者: Wei Li, Xue Xu, Jiachen Liu

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.13388v2

Project: https://unimo-ptm.github.io/|

中文摘要: 现有的文本到图像扩散模型主要从文本提示生成图像。然而，文本描述固有的简洁性对忠实地合成具有复杂细节的图像（如特定实体或场景）提出了挑战。本文介绍了UNIMO-G，这是一个简单的多模态条件扩散框架，它对具有交错文本和视觉输入的多模态提示进行操作，展示了文本驱动和主题驱动图像生成的统一能力。UNIMO-G包括两个核心组件：用于编码多模态提示的多模态大型语言模型（MLLM）和用于基于编码的多模态输入生成图像的条件去噪扩散网络。我们利用两阶段训练策略来有效地训练框架：首先对大规模文本——图像对进行预训练，以开发条件图像生成能力，然后使用多模态提示进行指令调整，以实现统一的图像生成能力。设计良好的数据处理流水线包括语言基础和图像分割，用于构建多模态提示。UNIMO-G在文本到图像生成和零镜头主题驱动合成方面表现出色，在从涉及多个图像实体的复杂多模态提示生成高保真图像方面尤为有效。

摘要: Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.

标题: Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers

作者: Yuling Shi, Hongyu Zhang, Chengcheng Wan

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.06461v2

GitHub: https://github.com/YerbaPage/DetectCodeGPT|

摘要: Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine-and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine and human-authored code. Through a rigorous analysis of code attributes such as length, lexical diversity, and naturalness, we expose unique pat-terns inherent to each source. We particularly notice that the structural segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose a novel machine-generated code detection method called DetectCodeGPT, which improves DetectGPT by capturing the distinct structural patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.

标题: Recurrent Generic Contour-based Instance Segmentation with Progressive Learning

作者: Hao Feng, Keyi Zhou, Wengang Zhou

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2301.08898v3

GitHub: https://github.com/fh2019ustc/PolySnake|

摘要: Contour-based instance segmentation has been actively studied, thanks to its flexibility and elegance in processing visual objects within complex backgrounds. In this work, we propose a novel deep network architecture, i.e., PolySnake, for generic contour-based instance segmentation. Motivated by the classic Snake algorithm, the proposed PolySnake achieves superior and robust segmentation performance with an iterative and progressive contour refinement strategy. Technically, PolySnake introduces a recurrent update operator to estimate the object contour iteratively. It maintains a single estimate of the contour that is progressively deformed toward the object boundary. At each iteration, PolySnake builds a semantic-rich representation for the current contour and feeds it to the recurrent operator for further contour adjustment. Through the iterative refinements, the contour progressively converges to a stable status that tightly encloses the object instance. Beyond the scope of general instance segmentation, extensive experiments are conducted to validate the effectiveness and generalizability of our PolySnake in two additional specific task scenarios, including scene text detection and lane detection. The results demonstrate that the proposed PolySnake outperforms the existing advanced methods on several multiple prevalent benchmarks across the three tasks. The codes and pre-trained models are available at https://github.com/fh2019ustc/PolySnake