[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--机器人、强化学习_evolving graphical planner: contextual global plan-CSDN博客

本文链接：https://blog.csdn.net/u011573853/article/details/135854831

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉导航
具身智能，机器人
强化学习
开放词汇，检测分割

[晓理紫]每日论文分享(有中文摘要，源码或项目地址)

== human robot interaction ==

标题: The Conversation is the Command: Interacting with Real-World Autonomous Robot Through Natural Language

作者: Linus Nwankwo, Elmar Rueckert

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2401.11838v1

Project: https://osf.io/wzyf6|

GitHub: https://github.com/LinusNEP/TCC_IRoNL.git).|

中文摘要: 近年来，自主代理在现实世界环境中激增，如我们的家庭、办公室和公共场所。然而，自然的人机交互仍然是一个关键的挑战。在本文中，我们介绍了一种协同利用大型语言模型（LLMs）和多模态视觉语言模型（VLMs）的能力的方法，使人类能够通过对话与自主机器人进行自然交互。我们利用LLMs解码来自人类的高级自然语言指令，并将其抽象为精确的机器人可操作命令或查询。此外，我们利用VLMs来提供对机器人任务环境的视觉和语义理解。我们99.13%的命令识别准确率和97.96%的命令执行成功率表明，我们的方法可以增强现实世界应用中的人机交互。本文的视频演示可以在https：//osf.io/wzyf6找到，代码可以在我们的GitHub资源库（https：//github.com/LinusNEP/tcc_iron.git）找到。

摘要: In recent years, autonomous agents have surged in real-world environments such as our homes, offices, and public spaces. However, natural human-robot interaction remains a key challenge. In this paper, we introduce an approach that synergistically exploits the capabilities of large language models (LLMs) and multimodal vision-language models (VLMs) to enable humans to interact naturally with autonomous robots through conversational dialogue. We leveraged the LLMs to decode the high-level natural language instructions from humans and abstract them into precise robot actionable commands or queries. Further, we utilised the VLMs to provide a visual and semantic understanding of the robot’s task environment. Our results with 99.13% command recognition accuracy and 97.96% commands execution success show that our approach can enhance human-robot interaction in real-world applications. The video demonstrations of this paper can be found at https://osf.io/wzyf6 and the code is available at our GitHub repository (https://github.com/LinusNEP/TCC_IRoNL.git).

标题: ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

作者: Dong An, Hanqing Wang, Wenguan Wang

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2304.03047v3

GitHub: https://github.com/MarSaKi/ETPNav.|https://github.com/MarSaKi/ETPNav|

中文摘要: Vision-language导航是一项需要代理按照指令在环境中导航的任务。它在具体化人工智能领域变得越来越重要，在自主导航、搜索和救援以及人机交互方面具有潜在的应用。在本文中，我们提出了一个更实际但具有挑战性的对应设置——连续环境中的视觉语言导航（VLN-CE）。为了开发一个鲁棒的VLN-CE代理，我们提出了一个新的导航框架ETPNav，它专注于两个关键技能：1）抽象环境和生成远程导航计划的能力，以及2）在连续环境中的避障控制能力。ETPNav通过沿着穿越路径自组织预测的航路点来执行环境的在线拓扑映射，而无需先前的环境经验。它赋予代理将导航过程分解为高级规划和低级控制的特权。同时，ETPNav利用基于Transformer model的跨模态规划器来基于拓扑图和指令生成导航计划。然后，该计划通过避障控制器来执行，该控制器利用试错法来防止导航陷入障碍物。实验结果证明了该方法的有效性。ETPNav的产量超过10%和20%的改进比以前的属性R2R-CE和RxR-CE数据集的最新技术。我们的代码可在https：//github.com/MarSaKi/ETPNav。获得

摘要: Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is available at https://github.com/MarSaKi/ETPNav.

标题: Design, Development, and Deployment of Context-Adaptive AI Systems for Enhanced End-User Adoption

作者: Christine P Lee

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13643v1

中文摘要: 我的研究中心是开发上下文自适应的人工智能系统，通过集成技术方法来提高最终用户的采用率。我将这些人工智能系统部署在各种交互模式中，包括用户界面和机器人等具体化代理，以扩展它们的实用性。我的研究分三个关键阶段展开：设计、开发和部署。在设计阶段，以用户为中心的方法被用来理解用户对人工智能系统的体验，并为用户参与制作人工智能解释创建设计工具。在正在进行的开发阶段，为机器人代理创建了一个安全保证的人工智能系统，以自动为不可预见的场景提供自适应解决方案和解释。接下来的步骤将涉及以各种交互形式实现和评估上下文自适应人工智能系统。我寻求在技术开发中优先考虑人类的需求，创建人工智能系统，在现实世界的应用中切实惠及最终用户，并增强交互体验。

摘要: My research centers on the development of context-adaptive AI systems to improve end-user adoption through the integration of technical methods. I deploy these AI systems across various interaction modalities, including user interfaces and embodied agents like robots, to expand their practical applicability. My research unfolds in three key stages: design, development, and deployment. In the design phase, user-centered approaches were used to understand user experiences with AI systems and create design tools for user participation in crafting AI explanations. In the ongoing development stage, a safety-guaranteed AI system for a robot agent was created to automatically provide adaptive solutions and explanations for unforeseen scenarios. The next steps will involve the implementation and evaluation of context-adaptive AI systems in various interaction forms. I seek to prioritize human needs in technology development, creating AI systems that tangibly benefit end-users in real-world applications and enhance interaction experiences.

标题: VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

作者: Raphael Schumann, Wanrong Zhu, Weixi Feng

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2307.06082v2

中文摘要: 现实世界环境中的增量决策是具身人工智能中最具挑战性的任务之一。其中一个要求特别高的场景是视觉和语言导航~（VLN），它需要视觉和自然语言理解以及空间和时间推理能力。具体化代理需要将其对导航指令的理解建立在对真实世界环境（如街景）的观察中。尽管LLMs在其他研究领域取得了令人印象深刻的成果，但如何最好地将它们与交互式视觉环境联系起来是一个持续存在的问题。在这项工作中，我们提出了VELMA，一个具体化的LLM代理，它使用轨迹和视觉环境观察的语言化作为下一步行动的上下文提示。视觉信息由管道描述，管道从人类书面导航指令中提取地标，并使用CLIP来确定它们在当前全景视图中的可见性。我们用两个上下文中的例子证明了VELMA能够成功地遵循街景中的导航指令。我们在几千个示例上进一步微调了LLM代理，并在两个数据集的任务完成方面比以前的最先进水平相对提高了25%-30%。

摘要: Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.

标题: TraKDis: A Transformer-based Knowledge Distillation Approach for Visual Reinforcement Learning with Application to Cloth Manipulation

作者: Wei Chen, Nicolas Rojas

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13362v1

中文摘要: 使用基于视觉反馈的强化学习来接近机器人布料操作是有吸引力的，因为机器人感知和控制可以同时学习。然而，主要的挑战是由于布料的复杂动力学和相应状态的高维度，这掩盖了这个想法的实用性。为了解决这些问题，我们提出了TraKDis，这是一种新的基于Transformer model的知识提取方法，将视觉强化学习问题分解为两个不同的阶段。在第一阶段，训练一个特权代理，它拥有完全的布状态信息的知识。这个特权代理充当老师，为后续阶段提供有价值的指导和训练信号。第二阶段涉及知识提取过程，其中特权代理获得的知识通过利用预先训练的状态估计和权重初始化转移到基于视觉的代理。与最先进的RL技术相比，TraKDis表现出更好的性能，在模拟中，在布料折叠任务中表现出21.9%、13.8%和8.3%的更高性能。此外，为了验证鲁棒性，我们在噪声环境中评估代理；结果表明，该系统具有有效处理和适应环境不确定性的能力。还进行了真实机器人实验，以展示我们的方法在真实世界场景中的效率。

摘要: Approaching robotic cloth manipulation using reinforcement learning based on visual feedback is appealing as robot perception and control can be learned simultaneously. However, major challenges result due to the intricate dynamics of cloth and the high dimensionality of the corresponding states, what shadows the practicality of the idea. To tackle these issues, we propose TraKDis, a novel Transformer-based Knowledge Distillation approach that decomposes the visual reinforcement learning problem into two distinct stages. In the first stage, a privileged agent is trained, which possesses complete knowledge of the cloth state information. This privileged agent acts as a teacher, providing valuable guidance and training signals for subsequent stages. The second stage involves a knowledge distillation procedure, where the knowledge acquired by the privileged agent is transferred to a vision-based agent by leveraging pre-trained state estimation and weight initialization. TraKDis demonstrates better performance when compared to state-of-the-art RL techniques, showing a higher performance of 21.9%, 13.8%, and 8.3% in cloth folding tasks in simulation. Furthermore, to validate robustness, we evaluate the agent in a noisy environment; the results indicate its ability to handle and adapt to environmental uncertainties effectively. Real robot experiments are also conducted to showcase the efficiency of our method in real-world scenarios.

标题: Adversarial Imitation Learning from Visual Observations using Latent Information

作者: Vittorio Giammarino, James Queeney, Ioannis Ch. Paschalidis

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2309.17371v2

中文摘要: 我们关注从视觉观察中模仿学习的问题，其中学习代理可以访问专家的视频作为其唯一的学习来源。该框架的挑战包括缺乏专家行动和环境的部分可观测性，因为地面真实状态只能从像素中推断出来。为了解决这个问题，我们首先对部分可观察环境中的模仿学习进行了理论分析。我们建立了关于专家和代理潜在状态转移分布之间的散度的学习代理的次优性的上界。在这种分析的激励下，我们引入了一种称为来自观察的潜在对抗性模仿的算法，该算法将非策略对抗性模仿技术与来自观察序列的代理状态的学习潜在表示相结合。在高维连续机器人任务的实验中，我们表明我们的算法符合最先进的性能，同时提供了显著的计算优势。此外，我们展示了我们的方法如何通过利用专家视频来提高像素强化学习的效率。为了确保可重复性，我们提供了对我们代码的免费访问。

摘要: We focus on the problem of imitation learning from visual observations, where the learning agent has access to videos of experts as its sole learning source. The challenges of this framework include the absence of expert actions and the partial observability of the environment, as the ground-truth states can only be inferred from pixels. To tackle this problem, we first conduct a theoretical analysis of imitation learning in partially observable environments. We establish upper bounds on the suboptimality of the learning agent with respect to the divergence between the expert and the agent latent state-transition distributions. Motivated by this analysis, we introduce an algorithm called Latent Adversarial Imitation from Observations, which combines off-policy adversarial imitation techniques with a learned latent representation of the agent’s state from sequences of observations. In experiments on high-dimensional continuous robotic tasks, we show that our algorithm matches state-of-the-art performance while providing significant computational advantages. Additionally, we show how our method can be used to improve the efficiency of reinforcement learning from pixels by leveraging expert videos. To ensure reproducibility, we provide free access to our code.

== Reinforcement Learning @ RL ==

标题: The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations

作者: Matthias Lehmann

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13662v1

GitHub: https://github.com/Matt00n/PolicyGradientsJax.|

中文摘要: 近年来，在深度强化学习中提出了各种强大的策略梯度算法。虽然所有这些算法都建立在策略梯度定理的基础上，但具体的设计选择在不同的算法之间有很大的不同。我们提供了政策上的政策梯度算法的整体概述，以促进对其理论基础和实际实现的理解。在这个概述中，我们包括策略梯度定理的连续版本的详细证明，收敛结果和实际算法的全面讨论。我们比较了连续控制环境中最突出的算法，并提供了关于正则化好处的见解。所有代码都可以在https：//github.com/Matt00n/PolicyGradientsJax获得。

摘要: In recent years, various powerful policy gradient algorithms have been proposed in deep reinforcement learning. While all these algorithms build on the Policy Gradient Theorem, the specific design choices differ significantly across algorithms. We provide a holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations. In this overview, we include a detailed proof of the continuous version of the Policy Gradient Theorem, convergence results and a comprehensive discussion of practical algorithms. We compare the most prominent algorithms on continuous control environments and provide insights on the benefits of regularization. All code is available at https://github.com/Matt00n/PolicyGradientsJax.

标题: DittoGym: Learning to Control Soft Shape-Shifting Robots

作者: Suning Huang, Boyuan Chen, Huazhe Xu

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13231v1

Project: https://dittogym.github.io.|

中文摘要: 机器人协同设计是一个新兴的研究领域，其中机器人的形态与学习的策略一起优化，以解决特定的任务。它为软机器人带来了特别的希望，软机器人适合新的制造技术，可以实现学习形态和致动器。受自然和最近新颖的机器人设计的启发，我们建议更进一步，探索新颖的可重构机器人，定义为可以在其生命周期内改变形态的机器人。我们将可重构软机器人的控制形式化为高维强化学习（RL）问题。我们在同一个动作空间中统一了形态变化、运动和环境相互作用，并引入了适当的、从粗到细的课程，使我们能够发现实现对最终机器人的细粒度控制的策略。我们还介绍了DittoGym，这是一个针对可重构软机器人的全面RL基准，需要细粒度的形态变化来完成任务。最后，我们在DittoGym上评估了我们提出的从粗到细的算法，并演示了机器人在一个序列中多次学习改变其形态，这是由我们的RL算法唯一实现的。更多结果请访问https：//dittogym.github.io。

摘要: Robot co-design, where the morphology of a robot is optimized jointly with a learned policy to solve a specific task, is an emerging area of research. It holds particular promise for soft robots, which are amenable to novel manufacturing techniques that can realize learned morphologies and actuators. Inspired by nature and recent novel robot designs, we propose to go a step further and explore the novel reconfigurable robots, defined as robots that can change their morphology within their lifetime. We formalize control of reconfigurable soft robots as a high-dimensional reinforcement learning (RL) problem. We unify morphology change, locomotion, and environment interaction in the same action space, and introduce an appropriate, coarse-to-fine curriculum that enables us to discover policies that accomplish fine-grained control of the resulting robots. We also introduce DittoGym, a comprehensive RL benchmark for reconfigurable soft robots that require fine-grained morphology changes to accomplish the tasks. Finally, we evaluate our proposed coarse-to-fine algorithm on DittoGym and demonstrate robots that learn to change their morphology several times within a sequence, uniquely enabled by our RL algorithm. More results are available at https://dittogym.github.io.

标题: Reward Engineering for Generating Semi-structured Explanation

作者: Jiuzhou Han, Wray Buntine, Ehsan Shareghi

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2309.08347v2

GitHub: https://github.com/Jiuzhouh/Reward-Engineering-for-Generating-SEG|

中文摘要: 半结构化解释用显式表示描述了推理器的隐式过程。这个解释强调了如何利用特定查询中的可用信息，并用推理器从其内部权重产生的信息来补充信息，以生成答案。尽管最近语言模型的生成能力有所提高，但产生结构化的解释来验证模型的真实推理能力仍然是一个挑战。这个问题对于不太大的LMs（例如，FLAN-T5-XXL）尤其明显。在这项工作中，我们首先强调了监督微调（SFT）在应对这一挑战方面的局限性，然后在强化学习（RL）中引入了一种精心制作的奖励工程方法，以更好地解决这一问题。我们研究了多种奖励聚合方法，并提供了详细的讨论，揭示了RL在未来研究中的潜力。我们提出的方法在两个半结构化解释生成基准（ExplaGraph和COPA-SSE）上取得了新的最新结果。

摘要: Semi-structured explanation depicts the implicit process of a reasoner with an explicit representation. This explanation highlights how available information in a specific query is utilised and supplemented with information a reasoner produces from its internal weights towards generating an answer. Despite the recent improvements in generative capabilities of language models, producing structured explanations to verify a model’s true reasoning capabilities remains a challenge. This issue is particularly pronounced for not-so-large LMs (e.g., FLAN-T5-XXL). In this work, we first underscore the limitations of supervised fine-tuning (SFT) in tackling this challenge, and then introduce a carefully crafted reward engineering method in reinforcement learning (RL) to better address this problem. We investigate multiple reward aggregation methods and provide a detailed discussion which sheds light on the promising potential of RL for future research. Our proposed method on two semi-structured explanation generation benchmarks (ExplaGraph and COPA-SSE) achieves new state-of-the-art results.

标题: Generalization of Heterogeneous Multi-Robot Policies via Awareness and Communication of Capabilities

作者: Pierce Howell, Max Rudolph, Reza Torbati

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.13127v1

Project: https://sites.google.com/view/cap-comm|

中文摘要: 多智能体强化学习（MARL）的最新进展正在异构多机器人团队中实现令人印象深刻的协调。然而，现有的方法经常忽略将学习到的策略推广到新组成、规模和机器人的团队的挑战。虽然这种概括在可以按需重新训练策略的虚拟代理团队中可能不重要，但在部署在现实世界中的多机器人系统中却至关重要，并且必须随时适应不可避免的变化。因此，多机器人策略必须对团队变化保持健壮——这种能力我们称之为自适应团队。在这项工作中，我们通过进行涉及已建立的多机器人试验台的详细实验，研究机器人能力的感知和通信是否可以提供这种概括。我们证明了共享的分散策略，使机器人能够意识到并交流他们的能力，可以通过隐式捕捉集体能力和有效协调之间的基本关系来实现适应性团队合作。培训政策的视频可在以下网址观看：https://sites.google.com/view/cap-comm

摘要: Recent advances in multi-agent reinforcement learning (MARL) are enabling impressive coordination in heterogeneous multi-robot teams. However, existing approaches often overlook the challenge of generalizing learned policies to teams of new compositions, sizes, and robots. While such generalization might not be important in teams of virtual agents that can retrain policies on-demand, it is pivotal in multi-robot systems that are deployed in the real-world and must readily adapt to inevitable changes. As such, multi-robot policies must remain robust to team changes – an ability we call adaptive teaming. In this work, we investigate if awareness and communication of robot capabilities can provide such generalization by conducting detailed experiments involving an established multi-robot test bed. We demonstrate that shared decentralized policies, that enable robots to be both aware of and communicate their capabilities, can achieve adaptive teaming by implicitly capturing the fundamental relationship between collective capabilities and effective coordination. Videos of trained policies can be viewed at: https://sites.google.com/view/cap-comm

标题: Getting the Ball Rolling: Learning a Dexterous Policy for a Biomimetic Tendon-Driven Hand with Rolling Contact Joints

作者: Yasunori Toshimitsu, Benedek Forrai, Barnabas Gavin Cangan

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2308.02453v3

Project: https://srl-ethz.github.io/get-ball-rolling/|https://youtu.be/YahsMhqNU8o|

GitHub: https://github.com/srl-ethz/faive_gym_oss|

中文摘要: 仿生、灵巧的机器人手有潜力复制人类可以完成的许多任务，并获得作为通用操作平台的地位。强化学习（RL）框架的最新进展在四足运动和灵巧操纵任务中取得了显著的性能。结合能够并行模拟数千个机器人的基于GPU的高度并行化模拟，基于RL的控制器变得更加可扩展和可接近。然而，为了将RL训练的策略带到现实世界中，我们需要输出可以与物理致动器和传感器一起工作的策略的训练框架，以及可以用可访问的材料制造但足够健壮以运行交互式策略的硬件平台。本工作介绍了仿生肌腱驱动的Faive手及其系统架构，该系统使用肌腱驱动的滚动接触关节来实现3D可打印、鲁棒的高自由度手设计。我们对手的每个元素进行建模，并将其集成到GPU仿真环境中，用RL训练策略，实现灵巧的手内球体旋转技能向物理机器人手的零射击转移。

摘要: Biomimetic, dexterous robotic hands have the potential to replicate much of the tasks that a human can do, and to achieve status as a general manipulation platform. Recent advances in reinforcement learning (RL) frameworks have achieved remarkable performance in quadrupedal locomotion and dexterous manipulation tasks. Combined with GPU-based highly parallelized simulations capable of simulating thousands of robots in parallel, RL-based controllers have become more scalable and approachable. However, in order to bring RL-trained policies to the real world, we require training frameworks that output policies that can work with physical actuators and sensors as well as a hardware platform that can be manufactured with accessible materials yet is robust enough to run interactive policies. This work introduces the biomimetic tendon-driven Faive Hand and its system architecture, which uses tendon-driven rolling contact joints to achieve a 3D printable, robust high-DoF hand design. We model each element of the hand and integrate it into a GPU simulation environment to train a policy with RL, and achieve zero-shot transfer of a dexterous in-hand sphere rotation skill to the physical robot hand.

标题: Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

作者: Zihao Zhou, Bin Hu, Chenyang Zhao

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2311.13373v4

GitHub: https://github.com/ZJLAB-AMMI/LLM4Teach.|

中文摘要: 最近的研究揭示了大型语言模型（LLMs）通过提供高级指令来处理复杂的顺序决策任务的潜力。然而，基于LLM的代理在处理特定目标问题方面缺乏专业化，特别是在实时动态环境中。此外，在实际场景中部署基于LLM的代理既昂贵又耗时。另一方面，强化学习（RL）方法训练专门从事目标任务但经常遭受低采样效率和高探索成本的代理。在本文中，我们介绍了一个新的框架，通过使用来自基于LLM的教师代理的指令来训练一个较小的、专门的学生RL代理来解决这些挑战。通过结合教师代理的指导，学生代理可以将LLM的先验知识提取到自己的模型中。因此，可以用少得多的数据来训练学生代理。此外，通过环境反馈的进一步训练，学生代理在完成目标任务方面超越了其教师的能力。我们在具有挑战性的迷你网格和栖息地环境上进行了实验，这些环境是专门为具体化人工智能研究设计的，以评估我们框架的有效性。结果清楚地表明，与强基线方法相比，我们的方法实现了更好的性能。我们的代码可从https://github.com/ZJLAB-AMMI/LLM4Teach获得。

摘要: Recent studies have uncovered the potential of Large Language Models (LLMs) in addressing complex sequential decision-making tasks through the provision of high-level instructions. However, LLM-based agents lack specialization in tackling specific target problems, particularly in real-time dynamic environments. Additionally, deploying an LLM-based agent in practical scenarios can be both costly and time-consuming. On the other hand, reinforcement learning (RL) approaches train agents that specialize in the target task but often suffer from low sampling efficiency and high exploration costs. In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task. We conducted experiments on challenging MiniGrid and Habitat environments, specifically designed for embodied AI research, to evaluate the effectiveness of our framework. The results clearly demonstrate that our approach achieves superior performance compared to strong baseline methods. Our code is available at https://github.com/ZJLAB-AMMI/LLM4Teach.

== Object Detection@ Segmentation@Open vocabulary detection ==

标题: CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

作者: Size Wu, Wenwei Zhang, Lumin Xu

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2310.01403v2

GitHub: https://github.com/wusize/CLIPSelf.|

中文摘要: 包括对象检测和图像分割在内的开放词汇密集预测任务已经通过对比语言——图像预训练（CLIP）的成功得到了推进。剪辑模型，特别是那些包含视觉转换器（ViTs）的剪辑模型，在零镜头图像分类中表现出显著的泛化能力。然而，对于开放词汇密集预测任务，当将CLIP的视觉——语言对齐从全局图像表示转移到局部区域表示时，CLIP ViTs遭受从完整图像到局部图像区域的域转移。在本文中，我们对CLIP模型中的区域——语言对齐进行了深入分析，这对于下游开放词汇密集预测任务至关重要。随后，我们提出了一种名为CLIPSelf的方法，该方法使CLIP ViT的图像级识别能力适应局部图像区域，而不需要任何区域——文本对。CLIPSelf通过将从其密集特征图中提取的区域表示与相应图像裁剪的图像级表示对齐，使ViTs能够提取自身。借助增强的CLIP ViTs，我们在各种基准测试中实现了开放词汇对象检测、语义分割和全景分割方面的最新性能。模型和代码发布于https：//github.com/wusize/CLIPSelf。

摘要: Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.

标题: SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation

作者: Zhaohu Xing, Tian Ye, Yijun Yang

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13560v1

GitHub: https://github.com/ge-xing/SegMamba|

中文摘要: Transformer model架构在全球关系建模方面显示出非凡的能力。然而，当处理高维医学图像时，它提出了重大的计算挑战。这阻碍了它在这项任务中的发展和广泛采用。Mamba作为一种状态空间模型（SSM），最近成为序列建模中一种显著的长程依赖方式，以其显著的存储效率和计算速度在自然语言处理领域表现出色。受其成功的启发，我们推出了SegMamba，这是一种新颖的3D医学图像\textbf{Seg}mentation\textbf{Mamba}模型，旨在有效地捕捉每个尺度的整体体积特征中的长期依赖性。与基于Transformer model的方法相比，我们的SegMamba从状态空间模型的角度来看，在整体体积特征建模方面表现出色，保持了卓越的处理速度，即使体积特征的分辨率为{ $KaTeX parse error: Undefined control sequence: \乘 at position 3: 64\̲乘̲以64\乘以64$ }。在BraTS2023数据集上的综合实验证明了我们的SegMamba的有效性和效率。SegMamba的代码可在以下网址获得：https：//github.com/ge-xing/SegMamba

摘要: The Transformer architecture has shown a remarkable ability in modeling global relationships. However, it poses a significant computational challenge when processing high-dimensional medical images. This hinders its development and widespread adoption in this task. Mamba, as a State Space Model (SSM), recently emerged as a notable manner for long-range dependencies in sequential modeling, excelling in natural language processing filed with its remarkable memory efficiency and computational speed. Inspired by its success, we introduce SegMamba, a novel 3D medical image \textbf{Seg}mentation \textbf{Mamba} model, designed to effectively capture long-range dependencies within whole volume features at every scale. Our SegMamba, in contrast to Transformer-based methods, excels in whole volume feature modeling from a state space model standpoint, maintaining superior processing speed, even with volume features at a resolution of { $64\times 64\times 64$ }. Comprehensive experiments on the BraTS2023 dataset demonstrate the effectiveness and efficiency of our SegMamba. The code for SegMamba is available at: https://github.com/ge-xing/SegMamba

标题: QAGait: Revisit Gait Recognition from a Quality Perspective

作者: Zengbin Wang, Saihui Hou, Man Zhang

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13531v1

GitHub: https://github.com/wzb-bupt/QAGait.|

中文摘要: 步态识别是一种很有前途的生物识别方法，旨在从行人独特的行走模式中识别行人。剪影模态以其易于获取、结构简单、稀疏表示和方便建模而闻名，已被广泛应用于受控的实验室研究。然而，随着步态识别从实验室快速发展到野外场景，各种条件对轮廓模态提出了重大挑战，包括1）无法识别的低质量轮廓（异常分割、严重遮挡，甚至非人体形状），以及2）可识别但具有挑战性的轮廓（背景噪声、非标准姿势、轻微遮挡）。为了应对这些挑战，我们重新审视步态识别管道，并从质量角度研究步态识别，即QAGait。具体来说，我们提出了一系列具有成本效益的质量评估策略，包括最大连接面积和模板匹配，以消除背景噪声和不可识别的轮廓，对齐策略，以处理非标准姿势。我们还提出了两个质量感知损失函数，将轮廓质量整合到嵌入空间内的优化中。大量的实验证明了我们的QAGait。它既能保证步态的可靠性，又能保证步态的性能增强。此外，我们的质量评估策略可以与现有的步态数据集无缝集成，展示我们的优势。代码可在https：//github.com/wzb-bupt/QAGait获得。

摘要: Gait recognition is a promising biometric method that aims to identify pedestrians from their unique walking patterns. Silhouette modality, renowned for its easy acquisition, simple structure, sparse representation, and convenient modeling, has been widely employed in controlled in-the-lab research. However, as gait recognition rapidly advances from in-the-lab to in-the-wild scenarios, various conditions raise significant challenges for silhouette modality, including 1) unidentifiable low-quality silhouettes (abnormal segmentation, severe occlusion, or even non-human shape), and 2) identifiable but challenging silhouettes (background noise, non-standard posture, slight occlusion). To address these challenges, we revisit gait recognition pipeline and approach gait recognition from a quality perspective, namely QAGait. Specifically, we propose a series of cost-effective quality assessment strategies, including Maxmial Connect Area and Template Match to eliminate background noises and unidentifiable silhouettes, Alignment strategy to handle non-standard postures. We also propose two quality-aware loss functions to integrate silhouette quality into optimization within the embedding space. Extensive experiments demonstrate our QAGait can guarantee both gait reliability and performance enhancement. Furthermore, our quality assessment strategies can seamlessly integrate with existing gait datasets, showcasing our superiority. Code is available at https://github.com/wzb-bupt/QAGait.

标题: Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers

作者: Yuling Shi, Hongyu Zhang, Chengcheng Wan

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.06461v2

GitHub: https://github.com/YerbaPage/DetectCodeGPT|

中文摘要: 大型语言模型催化了一场前所未有的代码生成浪潮。虽然取得了显著的进步，但它们模糊了机器和人类创作的源代码之间的区别，导致软件工件的完整性和真实性问题。以前的方法，如DetectGPT，已被证明在识别机器生成的文本方面是有效的，但它们不能识别和利用机器生成的代码的独特模式。因此，当应用于代码时，它的适用性就会动摇。在本文中，我们仔细研究了表征机器和人类编写的代码的特定模式。通过对代码属性（如长度、词汇多样性和自然性）的严格分析，我们揭示了每个源代码固有的独特特征。我们特别注意到，代码的结构分割是识别其出处的关键因素。基于我们的发现，我们提出了一种新的机器生成的代码检测方法，称为DetectCodeGPT，它通过捕捉代码的不同结构模式来改进DetectGPT。与依赖外部LLM进行扰动的传统技术不同，DetectCodeGPT通过有策略地插入空格和换行符来扰动代码语料库，从而确保功效和效率。实验结果表明，我们的方法在检测机器生成代码方面明显优于现有技术。

摘要: Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine-and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine and human-authored code. Through a rigorous analysis of code attributes such as length, lexical diversity, and naturalness, we expose unique pat-terns inherent to each source. We particularly notice that the structural segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose a novel machine-generated code detection method called DetectCodeGPT, which improves DetectGPT by capturing the distinct structural patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.

标题: Modularized Zero-shot VQA with Pre-trained Models

作者: Rui Cao, Jing Jiang

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2305.17369v2

GitHub: https://github.com/abril4416/Mod-Zero-VQA|

中文摘要: 大规模预训练模型（PTM）显示出强大的零射击能力。在本文中，我们研究了如何利用它们进行零镜头视觉问答（VQA）。我们的方法是由一些观察激发的。首先，VQA问题通常需要多个推理步骤，这仍然是大多数PTM所缺乏的能力。第二，VQA推理链中的不同步骤需要不同的技能，如对象检测和关系推理，但单个PTM可能不具备所有这些技能。第三，最近关于零镜头VQA的工作没有明确考虑多步推理链，这使得它们与基于分解的方法相比更难解释。我们提出了一个模块化的零镜头网络，它将问题显式地分解成子推理步骤，并且是高度可解释的。我们将子推理任务转换为PTM的可接受目标，并将任务分配给适当的PTM，而无需任何调整。我们在零镜头设置下的两个VQA基准上的实验证明了我们的方法的有效性和与几个基线相比更好的可解释性。

摘要: Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA). Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capability that most PTMs lack. Second, different steps in VQA reasoning chains require different skills such as object detection and relational reasoning, but a single PTM may not possess all these skills. Third, recent work on zero-shot VQA does not explicitly consider multi-step reasoning chains, which makes them less interpretable compared with a decomposition-based approach. We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable. We convert sub reasoning tasks to acceptable objectives of PTMs and assign tasks to proper PTMs without any adaptation. Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method and better interpretability compared with several baselines.

标题: Synthetic data enables faster annotation and robust segmentation for multi-object grasping in clutter

作者: Dongmyoung Lee, Wei Chen, Nicolas Rojas

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13405v1

Project: https://sites.google.com/view/synthetic-dataset-generation.|

中文摘要: 机器人抓取中的对象识别和对象姿态估计仍然是重大挑战，因为在数据收集和注释方面，构建标记数据集可能是耗时和经济昂贵的。在这项工作中，我们提出了一种合成数据生成方法，通过将生成的合成数据集与更小的真实世界数据集（混合数据集）相结合，最大限度地减少人工干预，并使下游图像分割算法更加鲁棒。标注实验表明，所提出的合成场景生成可以显著减少标注时间。RGB图像分割使用混合数据集进行训练，并与深度信息相结合，以产生单个分割对象的像素到点的对应关系。然后通过分割算法的置信度得分来确定要抓取的对象。拾取和放置实验表明，在我们的混合数据集上训练的分割（98.9%，70%）在标记和抓取成功率方面分别优于真实数据集和公开可用的数据集（6.7%，18.8%）和（2.8%，10%）。补充材料见https://sites.google.com/view/synthetic-dataset-generation。

摘要: Object recognition and object pose estimation in robotic grasping continue to be significant challenges, since building a labelled dataset can be time consuming and financially costly in terms of data collection and annotation. In this work, we propose a synthetic data generation method that minimizes human intervention and makes downstream image segmentation algorithms more robust by combining a generated synthetic dataset with a smaller real-world dataset (hybrid dataset). Annotation experiments show that the proposed synthetic scene generation can diminish labelling time dramatically. RGB image segmentation is trained with hybrid dataset and combined with depth information to produce pixel-to-point correspondence of individual segmented objects. The object to grasp is then determined by the confidence score of the segmentation algorithm. Pick-and-place experiments demonstrate that segmentation trained on our hybrid dataset (98.9%, 70%) outperforms the real dataset and a publicly available dataset by (6.7%, 18.8%) and (2.8%, 10%) in terms of labelling and grasping success rate, respectively. Supplementary material is available at https://sites.google.com/view/synthetic-dataset-generation.