[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--具身智能、强化学习

最新推荐文章于 2024-07-23 14:36:35 发布

晓理紫

最新推荐文章于 2024-07-23 14:36:35 发布

阅读量1.1k

点赞数 27

分类专栏：最新论文和会议信息推送文章标签：人工智能深度学习机器学习

本文链接：https://blog.csdn.net/u011573853/article/details/135661456

版权

最新论文和会议信息推送专栏收录该内容

85 篇文章 7 订阅

订阅专栏

专属领域论文订阅

VX关注晓理紫，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

晓理紫

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉导航
具身智能，机器人
强化学习
开放词汇，检测分割

[晓理紫]每日论文分享(有中文摘要，源码或项目地址)

== Embodied Artificial Intelligence ==

标题: Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation

作者: Yuanchen Ju, Kaizhe Hu, Guowei Zhang

中文摘要: 实现机器人操作，将其推广到分发之外的场景，是朝着开放世界的智能化迈出的关键一步。对人类来说，这种能力植根于对物体之间语义对应的理解，从而将熟悉物体的互动体验自然地转移到新颖物体上。尽管机器人缺乏这样一个互动经验库，但互联网上大量可用的人类视频可能是一种宝贵的资源，我们可以从中提取包括接触点在内的可供性记忆。受人类自然思维方式的启发，我们提出了Robo-ABC：当面对需要泛化的陌生对象时，机器人可以通过从可供性记忆中检索具有视觉或语义相似性的对象来获得可供性。下一步是将检索到的对象的接触点映射到新对象。虽然乍一看，建立这种对应关系可能会带来巨大的挑战，但最近的研究发现，它自然源于预先训练的扩散模型，即使在不同的对象类别之间也能进行可供性映射。通过Robo-ABC框架，机器人可以在没有任何手动注释、额外训练、部分分割、预编码知识或视点限制的情况下，以零样本的方式进行归纳以操作类别外对象。在数量上，与最先进的（SOTA）端到端可供性模型相比，Robo-ABC显著提高了视觉可供性检索的准确性31.6%。我们还进行了跨类别物体抓取任务的真实世界实验。Robo ABC的成功率为85.7%，证明了其在现实世界任务中的能力

摘要: Enabling robotic manipulation that generalizes to out-of-distribution scenes is a crucial step toward open-world embodied intelligence. For human beings, this ability is rooted in the understanding of semantic correspondence among objects, which naturally transfers the interaction experience of familiar objects to novel ones. Although robots lack such a reservoir of interaction experience, the vast availability of human videos on the Internet may serve as a valuable resource, from which we extract an affordance memory including the contact points. Inspired by the natural way humans think, we propose Robo-ABC: when confronted with unfamiliar objects that require generalization, the robot can acquire affordance by retrieving objects that share visual or semantic similarities from the affordance memory. The next step is to map the contact points of the retrieved objects to the new object. While establishing this correspondence may present formidable challenges at first glance, recent research finds it naturally arises from pre-trained diffusion models, enabling affordance mapping even across disparate object categories. Through the Robo-ABC framework, robots may generalize to manipulate out-of-category objects in a zero-shot manner without any manual annotation, additional training, part segmentation, pre-coded knowledge, or viewpoint restrictions. Quantitatively, Robo-ABC significantly enhances the accuracy of visual affordance retrieval by a large margin of 31.6% compared to state-of-the-art (SOTA) end-to-end affordance models. We also conduct real-world experiments of cross-category object-grasping tasks. Robo-ABC achieved a success rate of 85.7%, proving its capacity for real-world tasks.

[Downlink:]http://arxiv.org/abs/2401.07487v1

标题: AMC'24 "A Novel Stiffness Modulation Mechanism for Energy Efficient Variable Stiffness Actuators"

作者: Sariyildiz Emre

中文摘要: 本文提出了一种新的刚度调制机制，该机制能够以快速的方式实现无限程刚度调制。所提出的刚度调制机制可以帮助改善许多机器人-环境交互应用，如人机协作和机器人康复

摘要: This paper presents a new stiffness modulation mechanism that enables infinite-range stiffness modulation in a fast manner. The proposed stiffness modulation mechanism can help improve many robot environment interaction applications such as human-robot collaboration and robotic rehabilitation.

[Downlink:]http://arxiv.org/abs/2401.07430v1

== Reinforcement Learning @ RL ==

标题: Contrastive Active Inference

作者: Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt

中文摘要: 主动推理是一种统一的感知和行动理论，其基础是大脑通过最小化自由能量来维持世界的内部模型。从行为的角度来看，主动推理机可以被视为自我证明的存在，它们通过行动来实现自己的乐观预测，即偏好的结果或目标。相比之下，强化学习需要人为设计的奖励来实现任何期望的结果。尽管主动推理可以为控制提供一个更自然的自监督目标，但由于在复杂环境中扩展该方法的缺点，其适用性受到限制。在这项工作中，我们提出了一个主动推理的对比目标，该目标大大减少了学习智能体生成模型和规划未来行动的计算负担。在基于图像的任务中，我们的方法明显优于基于似然的主动推理，同时计算成本更低，更容易训练。我们与能够访问人类设计的奖励函数的强化学习代理进行了比较，表明我们的方法与它们的性能非常匹配。最后，我们还表明，在环境中存在干扰因素的情况下，对比方法的表现明显更好，并且我们的方法能够将目标概括为背景中的变化。网站和代码：https://contrastive-aif.github.io/

摘要: Active inference is a unifying theory for perception and action resting upon the idea that the brain maintains an internal model of the world by minimizing free energy. From a behavioral perspective, active inference agents can be seen as self-evidencing beings that act to fulfill their optimistic predictions, namely preferred outcomes or goals. In contrast, reinforcement learning requires human-designed rewards to accomplish any desired outcome. Although active inference could provide a more natural self-supervised objective for control, its applicability has been limited because of the shortcomings in scaling the approach to complex environments. In this work, we propose a contrastive objective for active inference that strongly reduces the computational burden in learning the agent’s generative model and planning future actions. Our method performs notably better than likelihood-based active inference in image-based tasks, while also being computationally cheaper and easier to train. We compare to reinforcement learning agents that have access to human-designed reward functions, showing that our approach closely matches their performance. Finally, we also show that contrastive methods perform significantly better in the case of distractors in the environment and that our method is able to generalize goals to variations in the background. Website and code: https://contrastive-aif.github.io/

[Downlink:]http://arxiv.org/abs/2110.10083v4

[Project:]https://contrastive-aif.github.io/|

标题: Pgx: Hardware-Accelerated Parallel Game Simulators for Reinforcement Learning

作者: Sotetsu Koyamada, Shinri Okano, Soichiro Nishimori

中文摘要: 我们提出了Pgx，这是一套用JAX编写并针对GPU/TPU加速器进行优化的棋盘游戏强化学习（RL）环境。通过利用JAX在加速器上的自动矢量化和并行化，Pgx可以有效地扩展到数千个加速器上的同时模拟。在DGX-A100工作站上的实验中，我们发现Pgx模拟RL环境的速度比Python中现有的实现快10-100x。Pgx包括RL环境，通常用作RL研究的基准，如双陆棋、国际象棋、将棋和围棋。此外，Pgx提供了微型游戏集和基线模型，以促进快速的研究周期。我们展示了Gumbel AlphaZero算法在Pgx环境中的有效训练。总体而言，Pgx为研究人员提供了高性能的环境模拟器，以加速他们的RL实验。Pgx可在http://github.com/sotetsuk/pgx.

摘要: We propose Pgx, a suite of board game reinforcement learning (RL) environments written in JAX and optimized for GPU/TPU accelerators. By leveraging JAX’s auto-vectorization and parallelization over accelerators, Pgx can efficiently scale to thousands of simultaneous simulations over accelerators. In our experiments on a DGX-A100 workstation, we discovered that Pgx can simulate RL environments 10-100x faster than existing implementations available in Python. Pgx includes RL environments commonly used as benchmarks in RL research, such as backgammon, chess, shogi, and Go. Additionally, Pgx offers miniature game sets and baseline models to facilitate rapid research cycles. We demonstrate the efficient training of the Gumbel AlphaZero algorithm with Pgx environments. Overall, Pgx provides high-performance environment simulators for researchers to accelerate their RL experiments. Pgx is available at http://github.com/sotetsuk/pgx.

[Downlink:]http://arxiv.org/abs/2303.17503v4

[GitHub:]http://github.com/sotetsuk/pgx.|

标题: Efficient Reinforcemen Learning with Decoupling Exploration and Utilization

作者: Jingpu Yang, Qirui Zhao, Helin Wang

中文摘要: 深度神经网络（DNN）泛化受到当前离线强化学习技术对现有数据集保守处理的过度依赖的限制。这种方法经常导致算法满足于仅调整到特定数据集的次优解决方案。同样，在在线强化学习中，先前强加的惩罚性悲观主义也剥夺了该模型的探索潜力。我们的研究提出了一个新的框架，乐观和悲观的行动者强化学习（OPARL）。OPARL采用了一种独特的双重参与者方法：一个乐观的参与者致力于探索，另一个悲观的参与者专注于利用，从而有效地区分了探索和利用策略。强化学习方法的这种独特组合促进了一种更平衡、更有效的方法。它能够优化政策，通过悲观的利用策略专注于产生高回报的行动，同时通过乐观的探索确保广泛的州覆盖率。实验和理论研究表明，OPARL提高了代理人的应用和探索能力。在DMControl基准测试和Mujoco环境的大多数任务中，OPARL的性能优于最先进的方法。我们的代码已于发布https://github.com/yydsok/OPARL

摘要: Deep neural network(DNN) generalization is limited by the over-reliance of current offline reinforcement learning techniques on conservative processing of existing datasets. This method frequently results in algorithms that settle for suboptimal solutions that only adjust to a certain dataset. Similarly, in online reinforcement learning, the previously imposed punitive pessimism also deprives the model of its exploratory potential. Our research proposes a novel framework, Optimistic and Pessimistic Actor Reinforcement Learning (OPARL). OPARL employs a unique dual-actor approach: an optimistic actor dedicated to exploration and a pessimistic actor focused on utilization, thereby effectively differentiating between exploration and utilization strategies. This unique combination in reinforcement learning methods fosters a more balanced and efficient approach. It enables the optimization of policies that focus on actions yielding high rewards through pessimistic utilization strategies, while also ensuring extensive state coverage via optimistic exploration. Experiments and theoretical study demonstrates OPARL improves agents’ capacities for application and exploration. In the most tasks of DMControl benchmark and Mujoco environment, OPARL performed better than state-of-the-art methods. Our code has released on https://github.com/yydsok/OPARL

[Downlink:]http://arxiv.org/abs/2312.15965v2

[GitHub:]https://github.com/yydsok/OPARL|

标题: CoVO-MPC: Theoretical Analysis of Sampling-based MPC and Optimal Covariance Design

作者: Zeji Yi, Chaoyi Pan, Guanqi He

中文摘要: 基于采样的模型预测控制（MPC）由于其灵活性和并行性，在许多领域都是一种实用而有效的方法，尤其是基于模型的强化学习。尽管它具有吸引人的经验性能，但理论上的理解，特别是在收敛分析和超参数调整方面，仍然缺乏。在本文中，我们描述了一种广泛使用的基于采样的MPC方法——模型预测路径积分控制（MPPI）的收敛性。我们证明了当优化为二次优化时，MPPI至少具有线性收敛速度，该优化涵盖了时变LQR系统。然后我们扩展到更一般的非线性系统。我们的理论分析直接导致了一种新的基于采样的MPC算法，即CoVariance Optimal MPC（CoVo-MPC），该算法优化调度采样协方差以优化收敛速度。根据经验，CoVo-MPC在模拟和真实世界的四旋翼敏捷控制任务中都显著优于标准MPPI 43-54%。视频和附录位于\url{https://lecar-lab.github.io/CoVO-MPC/}.

摘要: Sampling-based Model Predictive Control (MPC) has been a practical and effective approach in many domains, notably model-based reinforcement learning, thanks to its flexibility and parallelizability. Despite its appealing empirical performance, the theoretical understanding, particularly in terms of convergence analysis and hyperparameter tuning, remains absent. In this paper, we characterize the convergence property of a widely used sampling-based MPC method, Model Predictive Path Integral Control (MPPI). We show that MPPI enjoys at least linear convergence rates when the optimization is quadratic, which covers time-varying LQR systems. We then extend to more general nonlinear systems. Our theoretical analysis directly leads to a novel sampling-based MPC algorithm, CoVariance-Optimal MPC (CoVo-MPC) that optimally schedules the sampling covariance to optimize the convergence rate. Empirically, CoVo-MPC significantly outperforms standard MPPI by 43-54% in both simulations and real-world quadrotor agile control tasks. Videos and Appendices are available at \url{https://lecar-lab.github.io/CoVO-MPC/}.

[Downlink:]http://arxiv.org/abs/2401.07369v1

[Project:]https://lecar-lab.github.io/CoVO-MPC/|

标题: Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

作者: Kun Lei, Zhengmao He, Chenhao Lu

中文摘要: 结合线下和在线强化学习（RL）对于高效和安全的学习至关重要。然而，以前的方法将离线和在线学习视为单独的过程，导致了冗余的设计和有限的性能。我们问：我们能在不引入额外的保守主义或规则化的情况下实现直接而有效的线下和在线学习吗？在这项研究中，我们提出了Uni-o4，它利用了离线和在线学习的政策目标。由于两个阶段的目标一致，RL代理可以在离线和在线学习之间无缝转移。这种特性增强了学习范式的灵活性，允许预训练、微调、离线和在线学习的任意组合。特别是在离线阶段，Uni-o4利用不同的集成策略来解决估计的行为策略和离线数据集之间的不匹配问题。通过简单的离线策略评估（OPE）方法，Uni-o4可以安全地实现多步骤策略改进。我们证明，通过采用上述方法，这两种范式的融合可以产生优越的离线初始化以及稳定快速的在线微调能力。通过现实世界中的机器人任务，我们强调了这种模式在具有挑战性的、以前从未见过的现实世界环境中快速部署的好处。此外，通过使用大量模拟基准进行综合评估，我们证实了我们的方法在离线和离线到在线的微调学习中都达到了最先进的性能。我们的网站：https://lei-kun.github.io/uni-o4/.

摘要: Combining offline and online reinforcement learning (RL) is crucial for efficient and safe learning. However, previous approaches treat offline and online learning as separate procedures, resulting in redundant designs and limited performance. We ask: Can we achieve straightforward yet effective offline and online learning without introducing extra conservatism or regularization? In this study, we propose Uni-o4, which utilizes an on-policy objective for both offline and online learning. Owning to the alignment of objectives in two phases, the RL agent can transfer between offline and online learning seamlessly. This property enhances the flexibility of the learning paradigm, allowing for arbitrary combinations of pretraining, fine-tuning, offline, and online learning. In the offline phase, specifically, Uni-o4 leverages diverse ensemble policies to address the mismatch issues between the estimated behavior policy and the offline dataset. Through a simple offline policy evaluation (OPE) approach, Uni-o4 can achieve multi-step policy improvement safely. We demonstrate that by employing the method above, the fusion of these two paradigms can yield superior offline initialization as well as stable and rapid online fine-tuning capabilities. Through real-world robot tasks, we highlight the benefits of this paradigm for rapid deployment in challenging, previously unseen real-world environments. Additionally, through comprehensive evaluations using numerous simulated benchmarks, we substantiate that our method achieves state-of-the-art performance in both offline and offline-to-online fine-tuning learning. Our website: https://lei-kun.github.io/uni-o4/ .

[Downlink:]http://arxiv.org/abs/2311.03351v3

[Project:]https://lei-kun.github.io/uni-o4/|https://lei-kun.github.io/uni-o4/|

标题: Learning Interactive Real-World Simulators

作者: Mengjiao Yang, Yilun Du, Kamyar Ghasemipour

中文摘要: 基于互联网数据训练的生成模型彻底改变了文本、图像和视频内容的创建方式。也许生成模型的下一个里程碑是模拟现实体验，以响应人类、机器人和其他交互式代理所采取的行动。真实世界模拟器的应用范围从游戏和电影中的可控内容创建，到纯粹在模拟中训练可直接部署在现实世界中的具体代理。我们探索了通过生成建模学习真实世界交互的通用模拟器的可能性。我们首先提出了一个重要的观察结果，即可用于学习真实世界模拟器的自然数据集通常在不同维度上是丰富的（例如，图像数据中的大量对象、机器人数据中的密集采样动作以及导航数据中的不同运动）。通过仔细编排不同的数据集，每个数据集都提供了整体体验的不同方面，我们可以从静态场景和对象中模拟高级指令（如“打开抽屉”）和低级控件（如“按x，y移动”）的视觉结果。我们使用模拟器来训练高级视觉语言策略和低级强化学习策略，在纯模拟训练后，每一种策略都可以在现实世界中零次部署。我们还表明，其他类型的智能，如视频字幕模型，可以从模拟经验的训练中受益，从而开辟更广泛的应用。视频演示可在https://universal-simulator.github.io.

摘要: Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different dimensions (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, we can simulate the visual outcome of both high-level instructions such as ``open the drawer’’ and low-level controls such as “move by x, y” from otherwise static scenes and objects. We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies, each of which can be deployed in the real world in zero shot after training purely in simulation. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience, opening up even wider applications. Video demos can be found at https://universal-simulator.github.io.

[Downlink:]http://arxiv.org/abs/2310.06114v2

[Project:]https://universal-simulator.github.io.|https://universal-simulator.github.io|

== OOpen vocabulary detection ==

标题: SAMF: Small-Area-Aware Multi-focus Image Fusion for Object Detection

作者: Xilai Li, Xiaosong Li, Haishu Tan

中文摘要: 现有的多焦点图像融合（MFIF）方法往往无法准确地保留不确定的过渡区域和检测大散焦区域内的小焦点区域。为了解决这个问题，本研究提出了一种新的小面积感知MFIF算法来增强目标检测能力。首先，我们增强了小焦点和边界区域内的像素属性，随后将其与视觉显著性检测相结合，以获得用于区分聚焦像素分布的预融合结果。为了准确地确保像素聚焦，我们将源图像视为聚焦、散焦和不确定区域的组合，并提出了一种三区域分割策略。最后，我们设计了一个有效的像素选择规则来生成分割决策图，并获得最终的融合结果。实验表明，该方法能够准确检测小而平滑的焦点区域，同时提高了物体检测性能，在主观和客观评价方面都优于现有方法。源代码位于https://github.com/ixilai/SAMF.

摘要: Existing multi-focus image fusion (MFIF) methods often fail to preserve the uncertain transition region and detect small focus areas within large defocused regions accurately. To address this issue, this study proposes a new small-area-aware MFIF algorithm for enhancing object detection capability. First, we enhance the pixel attributes within the small focus and boundary regions, which are subsequently combined with visual saliency detection to obtain the pre-fusion results used to discriminate the distribution of focused pixels. To accurately ensure pixel focus, we consider the source image as a combination of focused, defocused, and uncertain regions and propose a three-region segmentation strategy. Finally, we design an effective pixel selection rule to generate segmentation decision maps and obtain the final fusion results. Experiments demonstrated that the proposed method can accurately detect small and smooth focus areas while improving object detection performance, outperforming existing methods in both subjective and objective evaluations. The source code is available at https://github.com/ixilai/SAMF.

[Downlink:]http://arxiv.org/abs/2401.08357v1

[GitHub:]https://github.com/ixilai/SAMF.|

标题: Generative Denoise Distillation: Simple Stochastic Noises Induce Efficient Knowledge Transfer for Dense Prediction

作者: Zhaoge Liu, Xiaohao Xu, Yunkang Cao

中文摘要: 知识提炼是将知识从更强大的大模型（教师）转移到更简单的模型（学生）的过程。目前的许多方法都涉及学生直接模仿老师的知识。然而，通过这些流行的方法，在学习的表示中仍然存在冗余，这些方法倾向于不加区别地学习每个空间位置的特征。为了从教师那里获得更紧凑的表示（概念特征），受人类认知的启发，我们提出了一种创新的方法，称为生成去噪蒸馏（GDD），其中将随机噪声添加到学生的概念特征中，以将其嵌入浅层网络生成的实例特征中。然后，将生成的实例特征与来自老师的实例知识对齐。我们对对象检测、实例分割和语义分割进行了广泛的实验，以证明我们的方法的通用性和有效性。值得注意的是，GDD在上述任务中实现了最先进的性能。我们通过增强基于ResNet-18的PspNet和DeepLabV3，在语义分割方面取得了实质性的改进，mIoU得分分别为74.67和77.69，超过了他们之前在20个类别的Cityscapes数据集上的69.85和73.20分。GDD的源代码可在https://github.com/ZhgLiu/GDD.

摘要: Knowledge distillation is the process of transferring knowledge from a more powerful large model (teacher) to a simpler counterpart (student). Numerous current approaches involve the student imitating the knowledge of the teacher directly. However, redundancy still exists in the learned representations through these prevalent methods, which tend to learn each spatial location’s features indiscriminately. To derive a more compact representation (concept feature) from the teacher, inspired by human cognition, we suggest an innovative method, termed Generative Denoise Distillation (GDD), where stochastic noises are added to the concept feature of the student to embed them into the generated instance feature from a shallow network. Then, the generated instance feature is aligned with the knowledge of the instance from the teacher. We extensively experiment with object detection, instance segmentation, and semantic segmentation to demonstrate the versatility and effectiveness of our method. Notably, GDD achieves new state-of-the-art performance in the tasks mentioned above. We have achieved substantial improvements in semantic segmentation by enhancing PspNet and DeepLabV3, both of which are based on ResNet-18, resulting in mIoU scores of 74.67 and 77.69, respectively, surpassing their previous scores of 69.85 and 73.20 on the Cityscapes dataset of 20 categories. The source code of GDD is available at https://github.com/ZhgLiu/GDD.

[Downlink:]http://arxiv.org/abs/2401.08332v1

[GitHub:]https://github.com/ZhgLiu/GDD.|

标题: UV-SAM: Adapting Segment Anything Model for Urban Village Identification

作者: Xin Zhang, Yu Liu, Yuming Lin

中文摘要: 城中村被定义为城市中心内或周围的非正规住宅区，其特点是基础设施不足和生活条件恶劣，与关于贫困、适足住房和可持续城市的可持续发展目标密切相关。传统上，政府在很大程度上依赖实地调查方法来监测城中村，但这是耗时、劳动密集型的，而且可能会延迟。得益于广泛可用和及时更新的卫星图像，最近的研究开发了计算机视觉技术来有效地检测城市村庄。然而，现有的研究要么侧重于简单的城中村图像分类，要么未能提供准确的边界信息。为了从卫星图像中准确识别城中村边界，我们利用视觉基础模型的力量，将分段任意模型（SAM）应用于城中村分割，称为UV-SAM。具体而言，UV-SAM首先利用小型语义分割模型为城中村生成混合提示，包括掩码、边界框和图像表示，然后将其输入SAM进行细粒度边界识别。在中国两个数据集上的广泛实验结果表明，UV-SAM优于现有基线，多年的识别结果表明，城中村的数量和面积都在随着时间的推移而减少，这为深入了解城中村的发展趋势提供了更深入的见解，并为可持续城市的愿景基础模型提供了启示。本研究的数据集和代码可在https://github.com/tsinghua-fib-lab/UV-SAM.

摘要: Urban villages, defined as informal residential areas in or around urban centers, are characterized by inadequate infrastructures and poor living conditions, closely related to the Sustainable Development Goals (SDGs) on poverty, adequate housing, and sustainable cities. Traditionally, governments heavily depend on field survey methods to monitor the urban villages, which however are time-consuming, labor-intensive, and possibly delayed. Thanks to widely available and timely updated satellite images, recent studies develop computer vision techniques to detect urban villages efficiently. However, existing studies either focus on simple urban village image classification or fail to provide accurate boundary information. To accurately identify urban village boundaries from satellite images, we harness the power of the vision foundation model and adapt the Segment Anything Model (SAM) to urban village segmentation, named UV-SAM. Specifically, UV-SAM first leverages a small-sized semantic segmentation model to produce mixed prompts for urban villages, including mask, bounding box, and image representations, which are then fed into SAM for fine-grained boundary identification. Extensive experimental results on two datasets in China demonstrate that UV-SAM outperforms existing baselines, and identification results over multiple years show that both the number and area of urban villages are decreasing over time, providing deeper insights into the development trends of urban villages and sheds light on the vision foundation models for sustainable cities. The dataset and codes of this study are available at https://github.com/tsinghua-fib-lab/UV-SAM.

[Downlink:]http://arxiv.org/abs/2401.08083v1

[GitHub:]https://github.com/tsinghua-fib-lab/UV-SAM.|

标题: Calpric: Inclusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning

作者: Wenjun Qiu, David Lie, Lisa Austin

中文摘要: 在隐私政策方面训练准确的深度学习模型的一个重大挑战是获得大量全面的训练数据的成本和难度。为了应对这些挑战，我们提出了Calpric，它结合了自动文本选择和分割、主动学习和众包注释器的使用，以低成本生成一个大型、平衡的隐私政策培训集。自动文本选择和分割简化了标记任务，使来自众包平台（如亚马逊的Mechanical Turk）的未经培训的注释器能够与受过培训的注释者（如法律系学生）竞争，还减少了注释器之间的协议，从而降低了标记成本。具有可靠的训练标签可以使用主动学习，它使用更少的训练样本来有效地覆盖输入空间，进一步降低成本并改善数据集中的类和数据类别平衡。这些技术的结合使Calpric能够生成在更广泛的数据类别上准确的模型，并提供比以前更详细、更精细的标签。我们的众包流程使Calpric能够获得可靠的标记数据，每个标记文本段的成本约为0.92至1.71美元。Calpric的训练过程还生成了一个标签数据集，包含9个数据类别的16K隐私政策文本片段，具有平衡的正样本和负样本

摘要: A significant challenge to training accurate deep learning models on privacy policies is the cost and difficulty of obtaining a large and comprehensive set of training data. To address these challenges, we present Calpric , which combines automatic text selection and segmentation, active learning and the use of crowdsourced annotators to generate a large, balanced training set for privacy policies at low cost. Automated text selection and segmentation simplifies the labeling task, enabling untrained annotators from crowdsourcing platforms, like Amazon’s Mechanical Turk, to be competitive with trained annotators, such as law students, and also reduces inter-annotator agreement, which decreases labeling cost. Having reliable labels for training enables the use of active learning, which uses fewer training samples to efficiently cover the input space, further reducing cost and improving class and data category balance in the data set. The combination of these techniques allows Calpric to produce models that are accurate over a wider range of data categories, and provide more detailed, fine-grain labels than previous work. Our crowdsourcing process enables Calpric to attain reliable labeled data at a cost of roughly $0.92-$1.71 per labeled text segment. Calpric 's training process also generates a labeled data set of 16K privacy policy text segments across 9 Data categories with balanced positive and negative samples.

[Downlink:]http://arxiv.org/abs/2401.08038v1

[Project:]https://www.usenix.org/conference/usenixsecurity23/presentation/qiu|

标题: Machine Perceptual Quality: Evaluating the Impact of Severe Lossy Compression on Audio and Image Models

作者: Dan Jacobellis, Daniel Cummings, Neeraja J. Yadwadkar

中文摘要: 在神经数据压缩领域，主要关注的是优化经典失真度量（如PSNR或SSIM）或人类感知质量的算法。随着机器而非人类消耗的数据量不断增加，出现了一种面向机器的压缩新范式 $KaTeX parse error: Undefined control sequence: \unicode at position 1: \̲u̲n̲i̲c̲o̲d̲e̲｛x2013｝$ ，该范式将保留机器感知的显著特征置于传统的以人为中心的标准 $KaTeX parse error: Undefined control sequence: \unicode at position 1: \̲u̲n̲i̲c̲o̲d̲e̲{x2013}$ 之上，这给利用有损压缩的系统的开发、评估和部署带来了一些新的挑战。特别是，目前尚不清楚有损压缩的不同方法将如何影响下游机器感知任务的性能。为了解决这一未被充分探索的领域，我们评估了各种感知模型 $KaTeX parse error: Undefined control sequence: \unicode at position 1: \̲u̲n̲i̲c̲o̲d̲e̲{x2013}$ ，包括严重有损压缩下的图像分类、图像分割、语音识别和音乐源分离 $KaTeX parse error: Undefined control sequence: \unicode at position 1: \̲u̲n̲i̲c̲o̲d̲e̲{x2013}$ 。我们使用了几种流行的编解码器，它们跨越了传统、神经和生成压缩架构。我们的结果表明了三个关键发现：（1）使用生成压缩，可以利用高度压缩的数据，同时对机器感知质量的影响可以忽略不计；（2）机器感知质量与深度相似性度量密切相关，表明这些度量在面向机器的编解码器的开发中起着至关重要的作用；和（3）使用有损压缩数据集（如ImageNet）进行预训练可能会导致有损压缩增加而不是降低机器感知质量的反直觉场景。为了鼓励参与这一不断增长的研究领域，我们的代码和实验可在以下网站上获取：https://github.com/danjacobellis/MPQ.

摘要: In the field of neural data compression, the prevailing focus has been on optimizing algorithms for either classical distortion metrics, such as PSNR or SSIM, or human perceptual quality. With increasing amounts of data consumed by machines rather than humans, a new paradigm of machine-oriented compression $KaTeX parse error: Undefined control sequence: \unicode at position 1: \̲u̲n̲i̲c̲o̲d̲e̲{x2013}$ which prioritizes the retention of features salient for machine perception over traditional human-centric criteria $KaTeX parse error: Undefined control sequence: \unicode at position 1: \̲u̲n̲i̲c̲o̲d̲e̲{x2013}$ has emerged, creating several new challenges to the development, evaluation, and deployment of systems utilizing lossy compression. In particular, it is unclear how different approaches to lossy compression will affect the performance of downstream machine perception tasks. To address this under-explored area, we evaluate various perception models $KaTeX parse error: Undefined control sequence: \unicode at position 1: \̲u̲n̲i̲c̲o̲d̲e̲{x2013}$ including image classification, image segmentation, speech recognition, and music source separation $KaTeX parse error: Undefined control sequence: \unicode at position 1: \̲u̲n̲i̲c̲o̲d̲e̲{x2013}$ under severe lossy compression. We utilize several popular codecs spanning conventional, neural, and generative compression architectures. Our results indicate three key findings: (1) using generative compression, it is feasible to leverage highly compressed data while incurring a negligible impact on machine perceptual quality; (2) machine perceptual quality correlates strongly with deep similarity metrics, indicating a crucial role of these metrics in the development of machine-oriented codecs; and (3) using lossy compressed datasets, (e.g. ImageNet) for pre-training can lead to counter-intuitive scenarios where lossy compression increases machine perceptual quality rather than degrading it. To encourage engagement on this growing area of research, our code and experiments are available at: https://github.com/danjacobellis/MPQ.

[Downlink:]http://arxiv.org/abs/2401.07957v1

[GitHub:]https://github.com/danjacobellis/MPQ.|

标题: SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation

作者: Zhengze Xu, Dongyue Wu, Changqian Yu

中文摘要: 最近的实时语义分割方法通常采用额外的语义分支来追求丰富的长程上下文。然而，额外的分支会导致不希望的计算开销，并降低推理速度。为了消除这种困境，我们提出了SCTNet，这是一种具有转换器语义信息的单分支CNN，用于实时分割。SCTNet享有无推理语义分支的丰富语义表示，同时保留了轻量级单分支CNN的高效性。SCTNet利用transformer作为唯一的训练语义分支，考虑到其提取长程上下文的卓越能力。借助所提出的类似CNN块CFBlock的转换器和语义信息对齐模块，SCTNet可以在训练中从转换器分支捕获丰富的语义信息。在推理过程中，只需要部署单个分支CNN。我们在Cityscapes、ADE20K和COCO-Stuff-10K上进行了广泛的实验，结果表明我们的方法达到了最先进的性能。代码和型号可在https://github.com/xzz777/SCTNet

摘要: Recent real-time semantic segmentation methods usually adopt an additional semantic branch to pursue rich long-range context. However, the additional branch incurs undesirable computational overhead and slows inference speed. To eliminate this dilemma, we propose SCTNet, a single branch CNN with transformer semantic information for real-time segmentation. SCTNet enjoys the rich semantic representations of an inference-free semantic branch while retaining the high efficiency of lightweight single branch CNN. SCTNet utilizes a transformer as the training-only semantic branch considering its superb ability to extract long-range context. With the help of the proposed transformer-like CNN block CFBlock and the semantic information alignment module, SCTNet could capture the rich semantic information from the transformer branch in training. During the inference, only the single branch CNN needs to be deployed. We conduct extensive experiments on Cityscapes, ADE20K, and COCO-Stuff-10K, and the results show that our method achieves the new state-of-the-art performance. The code and model is available at https://github.com/xzz777/SCTNet

[Downlink:]http://arxiv.org/abs/2312.17071v2

[GitHub:]https://github.com/xzz777/SCTNet|https://github.com/xzz777/SCTNet|

晓理紫

关注

27
点赞
踩
17

收藏

觉得还不错? 一键收藏
0
评论
[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--具身智能、强化学习

实现机器人操作，将其推广到分发之外的场景，是朝着开放世界的智能化迈出的关键一步。对人类来说，这种能力植根于对物体之间语义对应的理解，从而将熟悉物体的互动体验自然地转移到新颖物体上。尽管机器人缺乏这样一个互动经验库，但互联网上大量可用的人类视频可能是一种宝贵的资源，我们可以从中提取包括接触点在内的可供性记忆。受人类自然思维方式的启发，我们提出了Robo-ABC：当面对需要泛化的陌生对象时，机器人可以通过从可供性记忆中检索具有视觉或语义相似性的对象来获得可供性。下一步是将检索到的对象的接触点映射到新对象。
复制链接

扫一扫

专栏目录