[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--强化学习、模仿学习、机器人_graphkd: exploring knowledge distillation towards -CSDN博客

本文链接：https://blog.csdn.net/u011573853/article/details/136224085

专属领域论文订阅

VX关注{晓理紫}免费，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

VX关注{晓理紫}免费

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

== RLHF ==

标题: MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces

作者: Tianyu Zheng, Ge Zhang, Xingwei Qu

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.12845v1

GitHub: https://github.com/Zheng0428/MORE_|

中文摘要: 利用将不同模态与相同语义嵌入空间对齐将允许模型更容易理解状态和动作的直觉，我们提出了离线强化学习（RL）挑战的新视角。更具体地说，我们通过整合多模态和预训练的语言模型，将其转化为监督学习任务。我们的方法结合了从图像中获得的状态信息和从文本中获得的动作相关数据，从而增强了RL训练性能并促进了长期战略思维。我们强调对语言的语境理解，并展示了RL中的决策如何受益于将状态和行为的表征与语言的表征结合起来。在Atari和OpenAI健身房环境中进行的评估证明，我们的方法明显优于当前的基线。这有助于提高离线RL的性能和效率，同时为离线RL提供了一个新的视角。我们的代码和数据可在https：//github.com/Zheng0428/MORE_。

摘要: Drawing upon the intuition that aligning different modalities to the same semantic embedding space would allow models to understand states and actions more easily, we propose a new perspective to the offline reinforcement learning (RL) challenge. More concretely, we transform it into a supervised learning task by integrating multimodal and pre-trained language models. Our approach incorporates state information derived from images and action-related data obtained from text, thereby bolstering RL training performance and promoting long-term strategic thinking. We emphasize the contextual understanding of language and demonstrate how decision-making in RL can benefit from aligning states’ and actions’ representation with languages’ representation. Our method significantly outperforms current baselines as evidenced by evaluations conducted on Atari and OpenAI Gym environments. This contributes to advancing offline RL performance and efficiency while providing a novel perspective on offline RL.Our code and data are available at https://github.com/Zheng0428/MORE_.

标题: Learning in Mean Field Games: A Survey

作者: Mathieu Laurière, Sarah Perrin, Julien Pérolat

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2205.12944v3

摘要: Non-cooperative and cooperative games with a very large number of players have many applications but remain generally intractable when the number of players increases. Introduced by Lasry and Lions, and Huang, Caines and Malham’e, Mean Field Games (MFGs) rely on a mean-field approximation to allow the number of players to grow to infinity. Traditional methods for solving these games generally rely on solving partial or stochastic differential equations with a full knowledge of the model. Recently, Reinforcement Learning (RL) has appeared promising to solve complex problems at scale. The combination of RL and MFGs is promising to solve games at a very large scale both in terms of population size and environment complexity. In this survey, we review the quickly growing recent literature on RL methods to learn equilibria and social optima in MFGs. We first identify the most common settings (static, stationary, and evolutive) of MFGs. We then present a general framework for classical iterative methods (based on best-response computation or policy evaluation) to solve MFGs in an exact way. Building on these algorithms and the connection with Markov Decision Processes, we explain how RL can be used to learn MFG solutions in a model-free way. Last, we present numerical illustrations on a benchmark problem, and conclude with some perspectives.

标题: Analyzing Operator States and the Impact of AI-Enhanced Decision Support in Control Rooms: A Human-in-the-Loop Specialized Reinforcement Learning Framework for Intervention Strategies

作者: Ammar N. Abbas, Chidera W. Amazu, Joseph Mietkiewicz

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.13219v1

中文摘要: 在复杂的工业和化学过程控制室中，有效的决策对于安全和效率至关重要。本文中的实验使用动态影响图、隐马尔可夫模型和深度强化学习，评估了集成到改进的人机界面中的基于人工智能的决策支持系统的影响和应用。增强的支持系统旨在减少操作员的工作量，提高态势感知能力，并为操作员提供适应系统和人员性能当前状态的不同干预策略。这种系统在信息过载的情况下特别有用，当许多警报和输入都在同一时间窗口内出现时，或者对于培训期间的初级操作员特别有用。进行了全面的交叉数据分析，涉及47名参与者和各种数据源，如智能手表指标、眼球追踪数据、流程日志和问卷回复。结果表明，关于该方法在帮助决策、减少感知工作量和提高所考虑场景的情境意识方面的有效性，有一些有趣的见解。此外，这些结果提供了有价值的见解，以比较个人参与者使用该系统时信息收集风格之间的差异。当使用过程和人机交互日志实时预测个体参与者的整体表现及其成功处理工厂故障和与之相关的警报的能力时，这些发现尤其相关。这些预测有助于制定更有效的干预策略。

摘要: In complex industrial and chemical process control rooms, effective decision-making is crucial for safety and effi- ciency. The experiments in this paper evaluate the impact and applications of an AI-based decision support system integrated into an improved human-machine interface, using dynamic influ- ence diagrams, a hidden Markov model, and deep reinforcement learning. The enhanced support system aims to reduce operator workload, improve situational awareness, and provide different intervention strategies to the operator adapted to the current state of both the system and human performance. Such a system can be particularly useful in cases of information overload when many alarms and inputs are presented all within the same time window, or for junior operators during training. A comprehensive cross-data analysis was conducted, involving 47 participants and a diverse range of data sources such as smartwatch metrics, eye- tracking data, process logs, and responses from questionnaires. The results indicate interesting insights regarding the effec- tiveness of the approach in aiding decision-making, decreasing perceived workload, and increasing situational awareness for the scenarios considered. Additionally, the results provide valuable insights to compare differences between styles of information gathering when using the system by individual participants. These findings are particularly relevant when predicting the overall performance of the individual participant and their capacity to successfully handle a plant upset and the alarms connected to it using process and human-machine interaction logs in real-time. These predictions enable the development of more effective intervention strategies.

标题: SONATA: Self-adaptive Evolutionary Framework for Hardware-aware Neural Architecture Search

作者: Halima Bouzidi, Smail Niar, Hamza Ouarnoughi

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.13204v1

中文摘要: 由神经网络（NN）驱动的人工智能（AI）的最新进展需要创新的神经架构设计，特别是在物联网（IoT）系统的受限环境中，以平衡性能和效率。硬件感知神经架构搜索（硬件感知NAS）作为一种有吸引力的策略出现，以使用多目标优化方法（如进化算法）来自动化神经网络的设计。然而，神经网络设计参数和硬件感知的NAS优化目标之间的复杂关系仍然是一个探索不足的研究领域，忽略了有效利用这些知识来相应地指导搜索过程的机会。此外，在搜索过程中产生的大量评估数据对于改进优化策略和改进帕累托前沿的近似具有未开发的潜力。针对这些问题，我们提出了SONATA，一种用于硬件感知NAS的自适应进化算法。我们的方法利用由神经网络设计参数的学习重要性指导的自适应进化算子。具体来说，通过基于树的代理模型和强化学习代理，我们渴望收集关于“如何”和“何时”进化神经网络架构的知识。在ImageNet-1k数据集上对各种NAS搜索空间和硬件设备进行的综合评估显示了SONATA的优势，准确度提高了0.25%，延迟和能量提高了2.42倍。我们的索纳塔已经看到高达sim$93.6%的帕累托优势超过原生NSGA-II，进一步规定了自适应进化算子在硬件感知NAS中的重要性。

摘要: Recent advancements in Artificial Intelligence (AI), driven by Neural Networks (NN), demand innovative neural architecture designs, particularly within the constrained environments of Internet of Things (IoT) systems, to balance performance and efficiency. HW-aware Neural Architecture Search (HW-aware NAS) emerges as an attractive strategy to automate the design of NN using multi-objective optimization approaches, such as evolutionary algorithms. However, the intricate relationship between NN design parameters and HW-aware NAS optimization objectives remains an underexplored research area, overlooking opportunities to effectively leverage this knowledge to guide the search process accordingly. Furthermore, the large amount of evaluation data produced during the search holds untapped potential for refining the optimization strategy and improving the approximation of the Pareto front. Addressing these issues, we propose SONATA, a self-adaptive evolutionary algorithm for HW-aware NAS. Our method leverages adaptive evolutionary operators guided by the learned importance of NN design parameters. Specifically, through tree-based surrogate models and a Reinforcement Learning agent, we aspire to gather knowledge on ‘How’ and ‘When’ to evolve NN architectures. Comprehensive evaluations across various NAS search spaces and hardware devices on the ImageNet-1k dataset have shown the merit of SONATA with up to 0.25% improvement in accuracy and up to 2.42x gains in latency and energy. Our SONATA has seen up to sim$93.6% Pareto dominance over the native NSGA-II, further stipulating the importance of self-adaptive evolution operators in HW-aware NAS.

标题: Tiny Reinforcement Learning for Quadruped Locomotion using Decision Transformers

作者: Orhan Eren Akgün, Néstor Cuevas, Matheus Farias

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.13201v1

中文摘要: 资源受限的机器人平台对于需要低成本硬件替代方案的任务特别有用，因为有丢失机器人的风险，如在搜索和救援应用中，或者需要大量设备，如在群体机器人中。出于这个原因，找到使强化学习技术适应这些超低成本机器人平台的较低计算能力和较小存储容量所施加的限制的机制是至关重要的。我们试图通过提出一种方法来解决这一需求，使模仿学习可部署到资源受限的机器人平台上。在这里，我们将模仿学习问题作为一个条件序列建模任务，并使用专家演示和自定义奖励来训练决策Transformer model。然后，我们使用软件优化方案压缩生成模型，包括量化和剪枝。我们使用Isaac Gym在模拟中测试我们的方法，Isaac Gym是一个为强化学习设计的真实物理模拟环境。我们经验证明，我们的方法实现了Bittle，一个资源受限的四足机器人的自然步态。我们还运行了多个模拟来显示修剪和量化对模型性能的影响。我们的结果表明，量化（低至4位）和修剪使模型规模减少了约30%，同时保持了竞争性回报，使模型可在资源受限的系统中部署。

摘要: Resource-constrained robotic platforms are particularly useful for tasks that require low-cost hardware alternatives due to the risk of losing the robot, like in search-and-rescue applications, or the need for a large number of devices, like in swarm robotics. For this reason, it is crucial to find mechanisms for adapting reinforcement learning techniques to the constraints imposed by lower computational power and smaller memory capacities of these ultra low-cost robotic platforms. We try to address this need by proposing a method for making imitation learning deployable onto resource-constrained robotic platforms. Here we cast the imitation learning problem as a conditional sequence modeling task and we train a decision transformer using expert demonstrations augmented with a custom reward. Then, we compress the resulting generative model using software optimization schemes, including quantization and pruning. We test our method in simulation using Isaac Gym, a realistic physics simulation environment designed for reinforcement learning. We empirically demonstrate that our method achieves natural looking gaits for Bittle, a resource-constrained quadruped robot. We also run multiple simulations to show the effects of pruning and quantization on the performance of the model. Our results show that quantization (down to 4 bits) and pruning reduce model size by around 30% while maintaining a competitive reward, making the model deployable in a resource-constrained system.

标题: Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL

作者: Xiangyu Liu, Souradip Chakraborty, Yanchao Sun

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2305.17342v3

摘要: Most existing works focus on direct perturbations to the victim’s state/action or the underlying transition dynamics to demonstrate the vulnerability of reinforcement learning agents to adversarial attacks. However, such direct manipulations may not be always realizable. In this paper, we consider a multi-agent setting where a well-trained victim agent $\nu$ is exploited by an attacker controlling another agent $\alpha$ with an \textit{adversarial policy}. Previous models do not account for the possibility that the attacker may only have partial control over $\alpha$ or that the attack may produce easily detectable “abnormal” behaviors. Furthermore, there is a lack of provably efficient defenses against these adversarial policies. To address these limitations, we introduce a generalized attack framework that has the flexibility to model to what extent the adversary is able to control the agent, and allows the attacker to regulate the state distribution shift and produce stealthier adversarial policies. Moreover, we offer a provably efficient defense with polynomial convergence to the most robust victim policy through adversarial training with timescale separation. This stands in sharp contrast to supervised learning, where adversarial training typically provides only \textit{empirical} defenses. Using the Robosumo competition experiments, we show that our generalized attack formulation results in much stealthier adversarial policies when maintaining the same winning rate as baselines. Additionally, our adversarial training approach yields stable learning dynamics and less exploitable victim policies.

== Imitation Learning ==

标题: DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models

作者: Norman Di Palo, Edward Johns

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.13181v1

Project: https://www.robot-learning.uk/dinobot|

中文摘要: 我们提出了DINOBot，这是一种用于机器人操作的新型模仿学习框架，它利用了从用DINO训练的视觉转换器中提取的特征的图像级和像素级能力。当与新颖的对象交互时，DINOBot首先使用这些功能来检索人类演示期间体验到的视觉上最相似的对象，然后使用该对象将其末端效应器与新颖的对象对齐，以实现有效的交互。通过对日常任务的一系列真实世界实验，我们表明，利用视觉基础模型的图像级和像素级属性可以实现前所未有的学习效率和泛化。视频和代码可在https：//www.robot-learning.uk/dinobot。获得

摘要: We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align its end-effector with the novel object to enable effective interaction. Through a series of real-world experiments on everyday tasks, we show that exploiting both the image-level and pixel-level properties of vision foundation models enables unprecedented learning efficiency and generalisation. Videos and code are available at https://www.robot-learning.uk/dinobot.

标题: SubIQ: Inverse Soft-Q Learning for Offline Imitation with Suboptimal Demonstrations

作者: Huy Hoang, Tien Mai, Pradeep Varakantham

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.13147v1

中文摘要: 我们考虑离线模仿学习（IL），其目的是从演示中模仿专家的行为，而无需与环境进一步交互。离线IL的主要挑战之一是处理专家演示的有限支持，这些演示仅覆盖一小部分国家行动空间。在这项工作中，我们考虑离线IL，其中专家演示是有限的，但由一组较低专业水平的次优演示补充。为这种设置开发的大多数现有离线IL方法都基于行为克隆或分布匹配，其目的是将模仿策略的占用分布与专家策略的占用分布相匹配。这种方法经常遭受过度拟合，因为专家演示仅限于准确表示任何占用率分布。另一方面，由于次优集合要大得多，模仿策略很有可能被训练成次优策略。在本文中，为了解决这些问题，我们提出了一种基于逆软Q学习的新方法，其中将正则化项添加到训练目标中，目的是将学习到的奖励与预先分配的奖励函数对齐，该奖励函数将较高的权重分配给来自专家演示的状态——动作对，而将较低的权重分配给来自较低专业水平的状态——动作对。在标准基准上，我们的逆软Q学习明显优于其他离线IL基线。

摘要: We consider offline imitation learning (IL), which aims to mimic the expert’s behavior from its demonstration without further interaction with the environment. One of the main challenges in offline IL is dealing with the limited support of expert demonstrations that cover only a small fraction of the state-action spaces. In this work, we consider offline IL, where expert demonstrations are limited but complemented by a larger set of sub-optimal demonstrations of lower expertise levels. Most of the existing offline IL methods developed for this setting are based on behavior cloning or distribution matching, where the aim is to match the occupancy distribution of the imitation policy with that of the expert policy. Such an approach often suffers from over-fitting, as expert demonstrations are limited to accurately represent any occupancy distribution. On the other hand, since sub-optimal sets are much larger, there is a high chance that the imitation policy is trained towards sub-optimal policies. In this paper, to address these issues, we propose a new approach based on inverse soft-Q learning, where a regularization term is added to the training objective, with the aim of aligning the learned rewards with a pre-assigned reward function that allocates higher weights to state-action pairs from expert demonstrations, and lower weights to those from lower expertise levels. On standard benchmarks, our inverse soft-Q learning significantly outperforms other offline IL baselines by a large margin.

标题: Imitation Learning-Based Online Time-Optimal Control with Multiple-Waypoint Constraints for Quadrotors

作者: Jin Zhou, Jiahao Mei, Fangguo Zhao

PubTime: 2024-02-18

Downlink: http://arxiv.org/abs/2402.11570v1

中文摘要: 在过去的十年中，由于其简单的结构和积极的机动性，将四旋翼飞机用于各种目的的应用出现了显著的激增，例如搜索和救援、送货和自主无人机比赛等。阻止四旋翼飞机在这些场景中广泛使用的关键挑战之一是在线航路点约束的时间最优轨迹生成和控制技术。这封信提出了一种基于模仿学习的在线解决方案，以时间最优的性能有效地导航四旋翼飞行器通过多个航路点。训练神经网络（WN&CNets）从耗时的CPC算法生成的数据集中学习控制律，然后部署神经网络在线生成最优控制命令来指导四旋翼飞行器。为了解决有限训练数据和最终航路点悬停机动的挑战，我们提出了一种过渡阶段策略，利用多项式帮助四旋翼飞行器在切换航路点时“跳过”停停走走机动。我们的方法在模拟和真实世界的实验中得到了验证，在6.0米4.0米2.0米的有限空间内通过7个航路点时，实现了7米/秒的最大速度。结果表明，在最优性略有损失的情况下，WN&CNets显著减少了处理时间，并实现了多航路点约束飞行任务的在线最优控制

摘要: Over the past decade, there has been a remarkable surge in utilizing quadrotors for various purposes due to their simple structure and aggressive maneuverability, such as search and rescue, delivery and autonomous drone racing, etc. One of the key challenges preventing quadrotors from being widely used in these scenarios is online waypoint-constrained time-optimal trajectory generation and control technique. This letter proposes an imitation learning-based online solution to efficiently navigate the quadrotor through multiple waypoints with time-optimal performance. The neural networks (WN&CNets) are trained to learn the control law from the dataset generated by the time-consuming CPC algorithm and then deployed to generate the optimal control commands online to guide the quadrotors. To address the challenge of limited training data and the hover maneuver at the final waypoint, we propose a transition phase strategy that utilizes polynomials to help the quadrotor ‘jump over’ the stop-and-go maneuver when switching waypoints. Our method is demonstrated in both simulation and real-world experiments, achieving a maximum speed of 7 m/s while navigating through 7 waypoints in a confined space of 6.0 m * 4.0 m * 2.0 m. The results show that with a slight loss in optimality, the WN&CNets significantly reduce the processing time and enable online optimal control for multiple-waypoint-constrained flight tasks.

== Embodied Artificial Intelligence@robotic agent@human robot interaction ==

标题: PROGrasp: Pragmatic Human-Robot Communication for Object Grasping

作者: Gi-Cheon Kang, Junghyun Kim, Jaein Kim

PubTime: 2024-02-18

Downlink: http://arxiv.org/abs/2309.07759v2

GitHub: https://github.com/gicheonkang/prograsp|

中文摘要: 交互式对象抓取（IOG）是通过人机自然语言交互来识别和抓取所需对象的任务。当前的IOG系统假设人类用户最初指定目标对象的类别（例如，瓶子）。受语用学的启发，人类经常依靠语境来传达他们的意图，以实现目标，我们引入了一个新的IOG任务，语用-IOG，以及相应的数据集，面向意图的多模态对话（IM-Dial）。在我们提出的任务场景中，一个面向意图的话语（例如，“我渴了”）最初被给予机器人。然后，机器人应该通过与人类用户交互来识别目标对象。基于任务设置，我们提出了一种新的机器人系统，可以解释用户的意图并拾取目标对象，实用对象抓取（PROGrasp）。PROGrasp通过整合视觉基础、提问、对象抓取以及最重要的语用推理的答案解释等模块来执行语用IOG。实验结果表明，PROGrasp在离线（即目标对象发现）和在线（即具有物理机械臂的IOG）设置中是有效的。代码和数据可在https：//github.com/gicheonkang/prograp。获得。

摘要: Interactive Object Grasping (IOG) is the task of identifying and grasping the desired object via human-robot natural language interaction. Current IOG systems assume that a human user initially specifies the target object’s category (e.g., bottle). Inspired by pragmatics, where humans often convey their intentions by relying on context to achieve goals, we introduce a new IOG task, Pragmatic-IOG, and the corresponding dataset, Intention-oriented Multi-modal Dialogue (IM-Dial). In our proposed task scenario, an intention-oriented utterance (e.g., “I am thirsty”) is initially given to the robot. The robot should then identify the target object by interacting with a human user. Based on the task setup, we propose a new robotic system that can interpret the user’s intention and pick up the target object, Pragmatic Object Grasping (PROGrasp). PROGrasp performs Pragmatic-IOG by incorporating modules for visual grounding, question asking, object grasping, and most importantly, answer interpretation for pragmatic inference. Experimental results show that PROGrasp is effective in offline (i.e., target object discovery) and online (i.e., IOG with a physical robot arm) settings. Code and data are available at https://github.com/gicheonkang/prograsp.

标题: Towards Unified Interactive Visual Grounding in The Wild

作者: Jie Xu, Hanbo Zhang, Qingyi Si

PubTime: 2024-02-18

Downlink: http://arxiv.org/abs/2401.16699v2

GitHub: https://github.com/jxu124/TiO|

中文摘要: 由于自然语言中不可避免的歧义，人机交互（HRI）中的交互视觉基础具有挑战性，但又很实用。它要求机器人通过主动收集信息来消除用户输入的歧义。以前的方法通常依赖预定义的模板来询问歧义消除问题，导致现实交互场景中的性能降低。在本文中，我们提出了TiO，一个用于人机交互中交互视觉基础的端到端系统。受益于视觉对话和基础的统一公式，我们的方法可以在广泛的公共数据的联合上进行训练，并对多样化和具有挑战性的开放世界场景表现出卓越的通用性。在实验中，我们验证了TiO的猜猜是什么？！和InViG基准，以明显的优势创造了新的最先进的性能。此外，我们在精心挑选的150个具有挑战性的场景以及真实机器人平台上进行HRI实验。结果表明，我们的方法表现出优于多样化的视觉和语言输入的通用性，成功率高。代码和演示可在https：//github.com/jxu124/TiO。获得

摘要: Interactive visual grounding in Human-Robot Interaction (HRI) is challenging yet practical due to the inevitable ambiguity in natural languages. It requires robots to disambiguate the user input by active information gathering. Previous approaches often rely on predefined templates to ask disambiguation questions, resulting in performance reduction in realistic interactive scenarios. In this paper, we propose TiO, an end-to-end system for interactive visual grounding in human-robot interaction. Benefiting from a unified formulation of visual dialogue and grounding, our method can be trained on a joint of extensive public data, and show superior generality to diversified and challenging open-world scenarios. In the experiments, we validate TiO on GuessWhat?! and InViG benchmarks, setting new state-of-the-art performance by a clear margin. Moreover, we conduct HRI experiments on the carefully selected 150 challenging scenes as well as real-robot platforms. Results show that our method demonstrates superior generality to diversified visual and language inputs with a high success rate. Codes and demos are available at https://github.com/jxu124/TiO.

标题: Particle Filter SLAM for Vehicle Localization

作者: Tianrui Liu, Changxin Xu, Yuxin Qiao

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.07429v2

中文摘要: 同步定位和绘图（SLAM）在机器人学中提出了一个巨大的挑战，涉及动态构建地图，同时确定机器人代理在陌生环境中的精确位置。这一复杂的任务因固有的“先有鸡还是先有蛋”的困境而变得更加复杂，精确的地图绘制依赖于对机器人位置的可靠估计，反之亦然。此外，SLAM的计算强度增加了额外的复杂性，使其成为该领域中一个至关重要但要求很高的主题。在我们的研究中，我们通过采用粒子滤波SLAM方法来解决SLAM的挑战。我们的方法利用编码数据和光纤陀螺（FOG）信息来实现对车辆运动的精确估计，而激光雷达技术通过提供对周围障碍物的详细洞察来促进环境感知。这些数据流的集成最终导致粒子滤波器SLAM框架的建立，这代表了本文中有效导航和克服机器人系统中同步定位和映射相关复杂性的关键努力。

摘要: Simultaneous Localization and Mapping (SLAM) presents a formidable challenge in robotics, involving the dynamic construction of a map while concurrently determining the precise location of the robotic agent within an unfamiliar environment. This intricate task is further compounded by the inherent “chicken-and-egg” dilemma, where accurate mapping relies on a dependable estimation of the robot’s location, and vice versa. Moreover, the computational intensity of SLAM adds an additional layer of complexity, making it a crucial yet demanding topic in the field. In our research, we address the challenges of SLAM by adopting the Particle Filter SLAM method. Our approach leverages encoded data and fiber optic gyro (FOG) information to enable precise estimation of vehicle motion, while lidar technology contributes to environmental perception by providing detailed insights into surrounding obstacles. The integration of these data streams culminates in the establishment of a Particle Filter SLAM framework, representing a key endeavor in this paper to effectively navigate and overcome the complexities associated with simultaneous localization and mapping in robotic systems.

标题: Developing Autonomous Robot-Mediated Behavior Coaching Sessions with Haru

作者: Matouš Jelínek, Eric Nichols, Randy Gomez

PubTime: 2024-02-18

Downlink: http://arxiv.org/abs/2402.11569v1

中文摘要: 本研究对行为改变辅导中人机交互中自主对话的设计和影响进行了实证研究。我们重点关注桌面社交机器人南波瑠的使用，并探索培养积极行为改变的微小习惯方法的实施。我们研究的核心在于开发一个完全自主的对话系统，最大限度地发挥南波瑠的情感表现力和独特的个性。我们的方法包括对话系统的迭代设计和广泛测试，确保它有效地体现了微小习惯方法的原则，同时还结合了提高信任和抑制信任的策略。对话最终版本的有效性在一项有人类参与者（N=12）的实验研究中进行了评估。结果表明，人们对南波瑠的活泼、互动和中立的看法有了显著改善。此外，我们的研究有助于更广泛地理解社会机器人中的对话设计，为该领域的未来发展提供实用的见解。

摘要: This study presents an empirical investigation into the design and impact of autonomous dialogues in human-robot interaction for behavior change coaching. We focus on the use of Haru, a tabletop social robot, and explore the implementation of the Tiny Habits method for fostering positive behavior change. The core of our study lies in developing a fully autonomous dialogue system that maximizes Haru’s emotional expressiveness and unique personality. Our methodology involved iterative design and extensive testing of the dialogue system, ensuring it effectively embodied the principles of the Tiny Habits method while also incorporating strategies for trust-raising and trust-dampening. The effectiveness of the final version of the dialogue was evaluated in an experimental study with human participants (N=12). The results indicated a significant improvement in perceptions of Haru’s liveliness, interactivity, and neutrality. Additionally, our study contributes to the broader understanding of dialogue design in social robotics, offering practical insights for future developments in the field.

标题: What's the Plan? Evaluating and Developing Planning-Aware Techniques for LLMs

作者: Eran Hirsch, Guy Uziel, Ateret Anaby-Tavor

PubTime: 2024-02-18

Downlink: http://arxiv.org/abs/2402.11489v1

中文摘要: 规划是人工智能中的一项基本任务，涉及在给定环境中找到实现特定目标的一系列动作。大型语言模型（LLMs）越来越多地用于需要规划能力的应用程序，如web或具体化代理。根据最近的研究，我们通过实验证明LLMs缺乏规划所需的必要技能。基于这些观察，我们提倡将有限责任管理与经典规划方法相结合的混合方法的潜力。然后，我们介绍了一种新的混合方法SimPlan，并在一个新的具有挑战性的设置中评估了它的性能。我们在各个规划领域的广泛实验表明，SimPlan明显优于现有的基于LLM的规划者。

摘要: Planning is a fundamental task in artificial intelligence that involves finding a sequence of actions that achieve a specified goal in a given environment. Large language models (LLMs) are increasingly used for applications that require planning capabilities, such as web or embodied agents. In line with recent studies, we demonstrate through experimentation that LLMs lack necessary skills required for planning. Based on these observations, we advocate for the potential of a hybrid approach that combines LLMs with classical planning methodology. Then, we introduce SimPlan, a novel hybrid-method, and evaluate its performance in a new challenging setup. Our extensive experiments across various planning domains demonstrate that SimPlan significantly outperforms existing LLM-based planners.

标题: A Novel Multivariate Skew-Normal Mixture Model and Its Application in Path-Planning for Very-Large-Scale Robotic Systems

作者: Pingping Zhu, Chang Liu, Peter Estephan

PubTime: 2024-02-16

Downlink: http://arxiv.org/abs/2402.11091v1

中文摘要: 本文解决了在复杂和混乱环境中运行的超大规模机器人系统（VLSR）的路径规划挑战。VLSR系统由许多自主工作的协作代理或机器人组成。传统上，许多VLSR系统的方法都是基于高斯混合模型（GMM）开发的，其中GMM代表智能体的演化空间分布，作为系统状态的宏观视图。然而，我们最近对VLSR系统的研究揭示了使用GMM表示代理分布的局限性，特别是在混乱的环境中。为了克服这些限制，我们提出了一种新的模型，称为偏正态混合模型（SNMM）来表示代理分布。此外，我们提出了一种参数学习算法，旨在使用样本数据估计SNMM的参数。此外，我们开发了两种基于SNMM的路径规划算法来引导VLSR系统通过复杂和混乱的环境。与基于GMM的路径规划方法相比，我们的仿真结果表明了这些算法的有效性和优越性。

摘要: This paper addresses the path-planning challenge for very large-scale robotic systems (VLSR) operating in complex and cluttered environments. VLSR systems consist of numerous cooperative agents or robots working together autonomously. Traditionally, many approaches for VLSR systems are developed based on Gaussian mixture models (GMMs), where the GMMs represent agents’ evolving spatial distribution, serving as a macroscopic view of the system’s state. However, our recent research into VLSR systems has unveiled limitations in using GMMs to represent agent distributions, especially in cluttered environments. To overcome these limitations, we propose a novel model called the skew-normal mixture model (SNMM) for representing agent distributions. Additionally, we present a parameter learning algorithm designed to estimate the SNMM’s parameters using sample data. Furthermore, we develop two SNMM-based path-planning algorithms to guide VLSR systems through complex and cluttered environments. Our simulation results demonstrate the effectiveness and superiority of these algorithms compared to GMM-based path-planning methods.

== Object Detection@ Segmentation@Open vocabulary detection@SAM ==

标题: GraphKD: Exploring Knowledge Distillation Towards Document Object Detection with Structured Graph Creation

作者: Ayan Banerjee, Sanket Biswas, Josep Lladós

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.11401v2

GitHub: https://github.com/ayanban011/GraphKD|

中文摘要: 文档中的对象检测是通过了解不同元素之间的层次结构和关系来自动化数字或扫描文档中的结构元素识别过程的关键步骤。大型和复杂的模型虽然实现了高精度，但计算成本高且内存密集，这使得它们不适用于在资源受限的设备上部署。知识提炼使我们能够创建更小、更高效的模型，保留大型模型的大部分性能。在这里，我们提出了一个基于图的知识提取框架，以正确识别和定位文档图像中的文档对象。这里，我们设计了一个结构化图，其节点包含提议级特征和表示不同提议区域之间关系的边。此外，为了减少文本偏差，设计了自适应节点采样策略来修剪权重分布，并将更多权重放在非文本节点上。我们将完整的图编码为知识表示，并通过有效地同时捕获局部和全局信息，通过提出的蒸馏损失将其从教师传递给学生。对竞争基准的广泛实验表明，所提出的框架优于当前最先进的方法。该代码将在以下网址提供：https：//github.com/ayanban011/GraphKD。

摘要: Object detection in documents is a key step to automate the structural elements identification process in a digital or scanned document through understanding the hierarchical structure and relationships between different elements. Large and complex models, while achieving high accuracy, can be computationally expensive and memory-intensive, making them impractical for deployment on resource constrained devices. Knowledge distillation allows us to create small and more efficient models that retain much of the performance of their larger counterparts. Here we present a graph-based knowledge distillation framework to correctly identify and localize the document objects in a document image. Here, we design a structured graph with nodes containing proposal-level features and edges representing the relationship between the different proposal regions. Also, to reduce text bias an adaptive node sampling strategy is designed to prune the weight distribution and put more weightage on non-text nodes. We encode the complete graph as a knowledge representation and transfer it from the teacher to the student through the proposed distillation loss by effectively capturing both local and global information concurrently. Extensive experimentation on competitive benchmarks demonstrates that the proposed framework outperforms the current state-of-the-art approaches. The code will be available at: https://github.com/ayanban011/GraphKD.

标题: OBMO: One Bounding Box Multiple Objects for Monocular 3D Object Detection

作者: Chenxi Huang, Tong He, Haidong Ren

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2212.10049v2

GitHub: https://github.com/mrsempress/OBMO|

中文摘要: 与典型的多传感器系统相比，单目3D物体检测因其简单的配置而备受关注。然而，基于激光雷达的方法和基于单目的方法之间仍然存在显著差距。在本文中，我们发现单目图像的不适定性质会导致深度模糊。具体来说，具有不同深度的对象可以在2D图像中以相同的边界框和相似的视觉特征出现。遗憾的是，该网络不能从这种无区别的视觉特征中准确区分不同的深度，导致深度训练不稳定。为了促进深度学习，我们提出了一个简单而有效的即插即用模块，\underline{O}ne\underline{B}ounding Box\underline{M}ultiple\underline{O}bjects（OBMO）。具体来说，我们通过沿着视平截头体移动3D边界框来添加一组合适的伪标签。为了约束伪3D标签的合理性，我们精心设计了两种标签评分策略来代表它们的质量。与原始的硬深度标签相比，这种具有质量分数的软伪标签允许网络学习合理的深度范围，提高训练稳定性，从而提高最终性能。在KITTI和Waymo基准测试上进行的大量实验表明，我们的方法显著提高了最先进的单目3D探测器（在KITTI验证集的中等设置下的改进是 $\mathbf{1.82\sim 10.91\%}$ \textbf{BEV中的mAP}和 $\mathbf{1.18\sim 9.36\%}$ \textbf{3D中的mAP}）。代码已在\url{https://github.com/mrsempress/OBMO}发布。

摘要: Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, \underline{O}ne \underline{B}ounding Box \underline{M}ultiple \underline{O}bjects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ \textbf{mAP in BEV} and $\mathbf{1.18\sim 9.36\%}$ \textbf{mAP in 3D}). Codes have been released at \url{https://github.com/mrsempress/OBMO}.

标题: mBEST: Realtime Deformable Linear Object Detection Through Minimal Bending Energy Skeleton Pixel Traversals

作者: Andrew Choi, Dezhong Tong, Brian Park

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2302.09444v5

Project: https://youtu.be/q84I9i0DOK4|

中文摘要: 可变形材料的机器人操作是一项具有挑战性的任务，通常需要实时视觉反馈。对于可变形线性物体（DLO）或“杆”来说尤其如此，其细长而灵活的结构使得正确的跟踪和检测变得非常重要。为了应对这一挑战，我们提出了mBEST，这是一种用于实时检测DLO的鲁棒算法，能够产生每个DLO中心线的有序像素序列以及分割掩码。我们的算法获得DLO的二进制掩模，然后对其进行细化以产生骨架像素表示。在细化骨架以确保拓扑正确性之后，遍历像素以沿着每个唯一的DLO生成路径。在我们算法的核心，我们假设通过选择使DLO的累积弯曲能量最小化的路径组合，可以鲁棒地处理交集。我们表明，这种简单而直观的公式优于最先进的方法来检测具有大量零星交叉的DLO，从具有高方差的曲率到几乎平行的配置。此外，与现有技术相比，我们的方法实现了大约50%的运行时间和更好的可扩展性的显著性能改进。

摘要: Robotic manipulation of deformable materials is a challenging task that often requires realtime visual feedback. This is especially true for deformable linear objects (DLOs) or “rods”, whose slender and flexible structures make proper tracking and detection nontrivial. To address this challenge, we present mBEST, a robust algorithm for the realtime detection of DLOs that is capable of producing an ordered pixel sequence of each DLO’s centerline along with segmentation masks. Our algorithm obtains a binary mask of the DLOs and then thins it to produce a skeleton pixel representation. After refining the skeleton to ensure topological correctness, the pixels are traversed to generate paths along each unique DLO. At the core of our algorithm, we postulate that intersections can be robustly handled by choosing the combination of paths that minimizes the cumulative bending energy of the DLO(s). We show that this simple and intuitive formulation outperforms the state-of-the-art methods for detecting DLOs with large numbers of sporadic crossings ranging from curvatures with high variance to nearly-parallel configurations. Furthermore, our method achieves a significant performance improvement of approximately 50% faster runtime and better scaling over the state of the art.

标题: Enhancing Self-Supervised Learning for Remote Sensing with Elevation Data: A Case Study with Scarce And High Level Semantic Labels

作者: Omar A. Castaño-Idarraga, Raul Ramos-Pollán, Freddie Kalaitzis

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2304.06857v3

GitHub: https://github.com/omarcastano/Elevation-Aware-SSL|https://github.com/omarcastano/Elevation-Aware-SSL|

中文摘要: 这项工作提出了一种混合的无监督和有监督学习方法，以在只有少数表示非常一般的语义概念的标签可用时预训练应用于地球观测下游任务的模型。我们将预训练模型的对比方法与逐像素回归预文本任务相结合，以预测世界范围内普遍可用的粗略高程图。我们假设这将允许模型预学习有用的表示，因为在许多遥感任务中，高程图和目标之间通常存在一些相关性。我们评估了我们的方法在二进制语义分割任务和二进制图像分类任务上的性能，这两个任务都来自为哥伦比亚西北部创建的数据集。在这两种情况下，我们用39k个未标记的图像预训练我们的模型，只用80个标记的图像在下游任务中微调它们，用2944个标记的图像评估它们。我们的实验表明，我们的方法，用于分割的GLCNet+Elevation和用于分类的SimCLR+Elevation，在宏观平均F1分数和平均并集交集（MIoU）方面优于没有逐像素回归前文本任务的对应方法，即SimCLR和GLCNet。我们的研究不仅鼓励开发利用容易获得的地理信息（如高程数据）的预训练方法，以增强自我监督方法在应用于地球观测任务时的性能，而且还促进了具有高级语义标签的数据集的使用，这些数据集更有可能频繁更新。项目代码可以在以下链接中找到：\href{https：//github.com/omarcastano/Elevation-Aware-SSL}{https：//github.com/omarcastano/Elevation-Aware-SSL}。

摘要: This work proposes a hybrid unsupervised and supervised learning method to pre-train models applied in Earth observation downstream tasks when only a handful of labels denoting very general semantic concepts are available. We combine a contrastive approach to pre-train models with a pixel-wise regression pre-text task to predict coarse elevation maps, which are commonly available worldwide. We hypothesize that this will allow the model to pre-learn useful representations, as there is generally some correlation between elevation maps and targets in many remote sensing tasks. We assess the performance of our approach on a binary semantic segmentation task and a binary image classification task, both derived from a dataset created for the northwest of Colombia. In both cases, we pre-train our models with 39k unlabeled images, fine-tune them on the downstream tasks with only 80 labeled images, and evaluate them with 2944 labeled images. Our experiments show that our methods, GLCNet+Elevation for segmentation, and SimCLR+Elevation for classification, outperform their counterparts without the pixel-wise regression pre-text task, namely SimCLR and GLCNet, in terms of macro-average F1 Score and Mean Intersection over Union (MIoU). Our study not only encourages the development of pre-training methods that leverage readily available geographical information, such as elevation data, to enhance the performance of self-supervised methods when applied to Earth observation tasks, but also promotes the use of datasets with high-level semantic labels, which are more likely to be updated frequently. Project code can be found in this link \href{https://github.com/omarcastano/Elevation-Aware-SSL}{https://github.com/omarcastano/Elevation-Aware-SSL}.

标题: Self-Supervised Pre-Training for Precipitation Post-Processor

作者: Sojung An, Junha Lee, Jiyeon Jang

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2310.20187v3

Project: https://www.climatechange.ai/papers/neurips2023/18|

摘要: Obtaining a sufficient forecast lead time for local precipitation is essential in preventing hazardous weather events. Global warming-induced climate change increases the challenge of accurately predicting severe precipitation events, such as heavy rainfall. In this paper, we propose a deep learning-based precipitation post-processor for numerical weather prediction (NWP) models. The precipitation post-processor consists of (i) employing self-supervised pre-training, where the parameters of the encoder are pre-trained on the reconstruction of the masked variables of the atmospheric physics domain; and (ii) conducting transfer learning on precipitation segmentation tasks (the target domain) from the pre-trained encoder. In addition, we introduced a heuristic labeling approach to effectively train class-imbalanced datasets. Our experiments on precipitation correction for regional NWP show that the proposed method outperforms other approaches.

标题: Three Pillars improving Vision Foundation Model Distillation for Lidar

作者: Gilles Puy, Spyros Gidaris, Alexandre Boulch

PubTime: 2024-02-19

Downlink: http://arxiv.org/abs/2310.17504v2

GitHub: https://github.com/valeoai/ScaLR|

摘要: Self-supervised image backbones can be used to address complex 2D tasks (e.g., semantic segmentation, object discovery) very efficiently and with little or no downstream supervision. Ideally, 3D backbones for lidar should be able to inherit these properties after distillation of these powerful 2D features. The most recent methods for image-to-lidar distillation on autonomous driving data show promising results, obtained thanks to distillation methods that keep improving. Yet, we still notice a large performance gap when measuring the quality of distilled and fully supervised features by linear probing. In this work, instead of focusing only on the distillation method, we study the effect of three pillars for distillation: the 3D backbone, the pretrained 2D backbones, and the pretraining dataset. In particular, thanks to our scalable distillation method named ScaLR, we show that scaling the 2D and 3D backbones and pretraining on diverse datasets leads to a substantial improvement of the feature quality. This allows us to significantly reduce the gap between the quality of distilled and fully-supervised 3D features, and to improve the robustness of the pretrained backbones to domain gaps and perturbations.