[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--大模型_charactergen:efficient 3d character generation fro-CSDN博客

本文链接：https://blog.csdn.net/u011573853/article/details/136438863

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

== LLM ==

标题: RelayAttention for Efficient Large Language Model Serving with Long System Prompts

作者: Lei Zhu, Xinjiang Wang, Wayne Zhang

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2402.14808v2

GitHub: https://github.com/rayleizhu/vllm-ra|

中文摘要: 实用大型语言模型（LLM）服务可能涉及一个很长的系统提示符，它指定了任务的指令、示例和知识文档，并在许多请求中重用。然而，随着生成下一个令牌的成本不断增加，长系统提示会导致吞吐量/延迟瓶颈。序列长度。本文旨在提高涉及长系统提示的LLM服务的效率。我们的关键观察是，在现有的因果注意力计算算法中，处理这些系统提示需要大量冗余的内存访问。具体来说，对于批处理请求，系统提示的缓存隐藏状态（即，键值对）从片外DRAM多次传输到片内SRAM，每次对应于单个请求。为了消除这种冗余，我们提出了RelayAttention，这是一种注意力算法，允许从DRAM中为一批输入令牌读取这些隐藏状态一次。RelayAttention是一顿免费的午餐：它保持了生成质量，同时不需要模型再训练，因为它基于因果注意力的数学重构。代码可从\url{https：//github.com/rayleizhu/vllm-ra}获得。

摘要: Practical large language model (LLM) services may involve a long system prompt, which specifies the instructions, examples, and knowledge documents of the task and is reused across numerous requests. However, the long system prompt causes throughput/latency bottlenecks as the cost of generating the next token grows w.r.t. the sequence length. This paper aims to improve the efficiency of LLM services that involve long system prompts. Our key observation is that handling these system prompts requires heavily redundant memory accesses in existing causal attention computation algorithms. Specifically, for batched requests, the cached hidden states (i.e., key-value pairs) of system prompts are transferred from off-chip DRAM to on-chip SRAM multiple times, each corresponding to an individual request. To eliminate such a redundancy, we propose RelayAttention, an attention algorithm that allows reading these hidden states from DRAM exactly once for a batch of input tokens. RelayAttention is a free lunch: it maintains the generation quality while requiring no model retraining, as it is based on a mathematical reformulation of causal attention. Code is available at \url{https://github.com/rayleizhu/vllm-ra}.

标题: Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy

作者: Shuhai Zhang, Yiliao Song, Jiahao Yang

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2402.16041v2

GitHub: https://github.com/ZSHsh98/MMD-MP|

中文摘要: ChatGPT等大型语言模型（LLMs）在生成类似人类的文本方面表现出了卓越的性能。然而，机器生成的文本（MGTs）可能会带来严重的风险，如抄袭问题、误导信息或幻觉问题。因此，在许多情况下检测MGTs是非常迫切和重要的。不幸的是，区分MGTs和人类书写的文本是具有挑战性的，因为由于LLMs的显著性能，它们之间的分布差异通常非常微妙。在本文中，我们试图利用\textit{maximum mean discrancy}（MMD）来解决这个问题，因为MMD可以很好地识别分布差异。然而，使用不同的MGT直接训练具有MMD的检测器将导致MMD的方差显著增加，因为MGT可能由于各种LLM而包含\textit{多个文本群体}。这将严重损害MMD测量两个样本之间差异的能力。为了解决这个问题，我们提出了一种新的MMD{多种群}感知优化方法，称为MMD-MP，它可以{避免方差增加}，从而提高测量分布差异的稳定性。基于MMD-MP，我们分别开发了基于段落和基于句子的检测方法。在各种LLMs（如GPT2和ChatGPT）上的大量实验表明，我们的MMD-MP具有优异的检测性能。源代码可在\url{https://github.com/ZSHsh98/MMD-MP}获得。

摘要: Large language models (LLMs) such as ChatGPT have exhibited remarkable performance in generating human-like texts. However, machine-generated texts (MGTs) may carry critical risks, such as plagiarism issues, misleading information, or hallucination issues. Therefore, it is very urgent and important to detect MGTs in many situations. Unfortunately, it is challenging to distinguish MGTs and human-written texts because the distributional discrepancy between them is often very subtle due to the remarkable performance of LLMs. In this paper, we seek to exploit \textit{maximum mean discrepancy} (MMD) to address this issue in the sense that MMD can well identify distributional discrepancies. However, directly training a detector with MMD using diverse MGTs will incur a significantly increased variance of MMD since MGTs may contain \textit{multiple text populations} due to various LLMs. This will severely impair MMD’s ability to measure the difference between two samples. To tackle this, we propose a novel \textit{multi-population} aware optimization method for MMD called MMD-MP, which can \textit{avoid variance increases} and thus improve the stability to measure the distributional discrepancy. Relying on MMD-MP, we develop two methods for paragraph-based and sentence-based detection, respectively. Extensive experiments on various LLMs, \eg, GPT2 and ChatGPT, show superior detection performance of our MMD-MP. The source code is available at \url{https://github.com/ZSHsh98/MMD-MP}.

标题: Measuring Bargaining Abilities of LLMs: A Benchmark and A Buyer-Enhancement Method

作者: Tian Xia, Zhiwei He, Tong Ren

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2402.15813v2

GitHub: https://github.com/TianXiaSJTU/AmazonPriceHistory|

中文摘要: 讨价还价是人类之间谈判的一个重要而独特的部分。随着LLM驱动的代理学习谈判和像真人一样行动，如何评估代理的讨价还价能力仍然是一个悬而未决的问题。我们首次将讨价还价任务正式描述为一个不对称的不完全信息博弈，定义了买方和卖方在多个讨价还价过程中的收益。它允许我们定量评估代理在讨价还价任务中的表现。我们收集了一个真实的产品价格数据集AmazonHistoryPrice，并对各种LLM代理的讨价还价能力进行了评估。我们发现扮演买方比扮演卖方困难得多，增加模型尺寸并不能有效地提高买方的绩效。为了应对这一挑战，我们提出了一种称为OG-Narrator的新方法，它集成了一个确定性报价生成器来控制买方报价的价格范围，以及一个LLM叙述器来为生成的报价创建自然语言句子。实验结果表明，OG-Narrator将买方的交易率从26.67%提高到88.88%，并在所有基线上带来十倍的利润倍增，即使是没有对齐的模型。

摘要: Bargaining is an important and unique part of negotiation between humans. As LLM-driven agents learn to negotiate and act like real humans, how to evaluate agents’ bargaining abilities remains an open problem. For the first time, we formally described the Bargaining task as an asymmetric incomplete information game, defining the gains of the Buyer and Seller in multiple bargaining processes. It allows us to quantitatively assess an agent’s performance in the Bargain task. We collected a real product price dataset, AmazonHistoryPrice, and conducted evaluations of various LLM agents’ bargaining abilities. We find that playing a Buyer is much harder than a Seller, and increasing model size can not effectively improve the Buyer’s performance. To address the challenge, we propose a novel approach called OG-Narrator that integrates a deterministic Offer Generator to control the price range of Buyer’s offers, and an LLM Narrator to create natural language sentences for generated offers. Experimental results show that OG-Narrator improves the buyer’s deal rates from 26.67% to 88.88% and brings a ten times of multiplication of profits on all baselines, even a model that has not been aligned.

标题: Vaccine: Perturbation-aware Alignment for Large Language Model

作者: Tiansheng Huang, Sihao Hu, Ling Liu

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2402.01109v3

GitHub: https://github.com/git-disl/Vaccine|

中文摘要: 微调即服务的新范式为大型语言模型（LLMs）引入了新的攻击面：用户上传的一些有害数据可以很容易地欺骗微调，以产生一个对齐破坏的模型。我们进行了实证分析，发现了一个\textit{有害嵌入漂移}现象，显示了排列破坏效应的可能原因。受我们发现的启发，我们提出了疫苗，一种扰动感知比对技术，以减轻用户微调的安全风险。疫苗的核心思想是通过在比对阶段逐步添加精心制作的扰动来产生不变的隐藏嵌入。这使得嵌入能够在微调阶段承受来自未清理的用户数据的有害扰动。我们在开源主流LLMs（例如，Llama2、Opt、Vicuna）上的结果表明，疫苗可以提高针对有害提示诱导的嵌入漂移的比对的鲁棒性，同时保留对良性提示的推理能力。我们的代码可以在\url{https：//github.com/git-disl/Vaccine}找到。

摘要: The new paradigm of finetuning-as-a-service introduces a new attack surface for Large Language Models (LLMs): a few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. We conduct an empirical analysis and uncover a \textit{harmful embedding drift} phenomenon, showing a probable cause of the alignment-broken effect. Inspired by our findings, we propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning. The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. Our code is available at \url{https://github.com/git-disl/Vaccine}.

标题: ReLLa: Retrieval-enhanced Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation

作者: Jianghao Lin, Rong Shan, Chenxu Zhu

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2308.11131v4

GitHub: https://github.com/LaVieEnRose365/ReLLa|

中文摘要: 随着大型语言模型（LLMs）在自然语言处理（NLP）领域取得显著突破，LLM增强推荐系统受到了广泛关注，目前正在积极探索。在本文中，我们将重点放在为零镜头和少镜头推荐任务调整和增强纯大型语言模型上。首先，我们确定并阐述了推荐领域中LLMs的终身顺序行为不理解问题，即LLMs无法从长用户行为序列的文本上下文中提取有用的信息，即使上下文的长度远远没有达到LLMs的上下文限制。为了解决这一问题并提高LLMs的推荐性能，我们提出了一个新的框架，即检索增强的大型语言模型（ReLLa），用于零镜头和少镜头环境下的推荐任务。对于零镜头推荐，我们执行语义用户行为检索（SUBR）来提高测试样本的数据质量，这大大降低了LLMs从用户行为序列中提取必要知识的难度。对于少镜头推荐，我们通过采用SUBR作为训练样本的数据扩充技术，进一步设计了检索增强指令调整（ReiT）。具体来说，我们开发了一个混合训练数据集，由原始数据样本和它们的检索增强对应物组成。我们在三个真实世界的公共数据集上进行了广泛的实验，以证明ReLLa与现有基线模型相比的优越性，以及其终身顺序行为理解的能力。需要强调的是，只有不到10%的训练样本，少数镜头ReLLa可以优于在整个训练集（例如，DCNv2、DIN、SIM）上训练的传统CTR模型。代码可从\url{https：//github.com/LaVieEnRose365/ReLLa}获得。

摘要: With large language models (LLMs) achieving remarkable breakthroughs in natural language processing (NLP) domains, LLM-enhanced recommender systems have received much attention and have been actively explored currently. In this paper, we focus on adapting and empowering a pure large language model for zero-shot and few-shot recommendation tasks. First and foremost, we identify and formulate the lifelong sequential behavior incomprehension problem for LLMs in recommendation domains, i.e., LLMs fail to extract useful information from a textual context of long user behavior sequence, even if the length of context is far from reaching the context limitation of LLMs. To address such an issue and improve the recommendation performance of LLMs, we propose a novel framework, namely Retrieval-enhanced Large Language models (ReLLa) for recommendation tasks in both zero-shot and few-shot settings. For zero-shot recommendation, we perform semantic user behavior retrieval (SUBR) to improve the data quality of testing samples, which greatly reduces the difficulty for LLMs to extract the essential knowledge from user behavior sequences. As for few-shot recommendation, we further design retrieval-enhanced instruction tuning (ReiT) by adopting SUBR as a data augmentation technique for training samples. Specifically, we develop a mixed training dataset consisting of both the original data samples and their retrieval-enhanced counterparts. We conduct extensive experiments on three real-world public datasets to demonstrate the superiority of ReLLa compared with existing baseline models, as well as its capability for lifelong sequential behavior comprehension. To be highlighted, with only less than 10% training samples, few-shot ReLLa can outperform traditional CTR models that are trained on the entire training set (e.g., DCNv2, DIN, SIM). The code is available \url{https://github.com/LaVieEnRose365/ReLLa}.

标题: HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

作者: Mantas Mazeika, Long Phan, Xuwang Yin

PubTime: 2024-02-27

Downlink: http://arxiv.org/abs/2402.04249v2

Project: https://www.harmbench.org|

GitHub: https://github.com/centerforaisafety/HarmBench|

摘要: Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria. Using HarmBench, we conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs and defenses, yielding novel insights. We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses. We open source HarmBench at https://github.com/centerforaisafety/HarmBench.

== visual model ==

标题: Video ReCap: Recursive Captioning of Hour-Long Videos

作者: Md Mohaiminul Islam, Ngan Ho, Xitong Yang

PubTime: 2024-02-28

Downlink: http://arxiv.org/abs/2402.13250v3

Project: https://sites.google.com/view/vidrecap|

中文摘要: 大多数视频字幕模型被设计为处理几秒钟的短视频剪辑，并输出描述低级视觉概念（例如，对象、场景、原子动作）的文本。然而，大多数真实世界的视频持续几分钟或几小时，并且具有跨越不同时间粒度的复杂层次结构。我们提出了Video ReCap，这是一种递归视频字幕模型，可以处理长度明显不同（从1秒到2小时）的视频输入，并在多个层次结构级别输出视频字幕。递归视频语言架构利用了不同视频层次之间的协同作用，可以有效地处理长达一小时的视频。我们利用课程学习培训方案来学习视频的层次结构，从描述原子动作的剪辑级字幕开始，然后关注片段级描述，最后为长达一小时的视频生成摘要。此外，我们通过用8,267个手动收集的远程视频摘要扩充Ego4D来引入Ego4D-HCap数据集。我们的递归模型可以灵活地生成不同层次层次的字幕，同时也适用于其他复杂的视频理解任务，如EgoSchema上的VideoQA。数据、代码和模型可从以下网址获得：https：//sites.google.com/view/vidrecap

摘要: Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap, a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. The recursive video-language architecture exploits the synergy between different video hierarchies and can process hour-long videos efficiently. We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos. Furthermore, we introduce Ego4D-HCap dataset by augmenting Ego4D with 8,267 manually collected long-range video summaries. Our recursive model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks, such as VideoQA on EgoSchema. Data, code, and models are available at: https://sites.google.com/view/vidrecap

标题: Learning to Generalize towards Unseen Domains via a Content-Aware Style Invariant Model for Disease Detection from Chest X-rays

作者: Mohammad Zunaed, Md. Aynal Haque, Taufiq Hasan

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2302.13991v5

中文摘要: 由于分布差异导致的性能下降是智能成像中的一个长期挑战，尤其是对于胸部X射线（CXRs）。最近的研究表明，CNN偏向于风格（例如，无信息的纹理）而不是内容（例如，形状），这与人类视觉系统形成鲜明对比。放射科医生倾向于从CXRs中学习视觉线索，因此在多个领域表现良好。受此启发，我们在图像（SRM-IL）和特征（SRM-FL）级别采用了新颖的动态风格随机化模块，以创建丰富的风格扰动特征，同时保持内容完整，实现强大的跨域性能。以前的方法通过插值或从现有数据交换样式来构建新样式，从而模拟看不见的域，在训练期间将它们限制在可用的源域。然而，SRM-IL从CXR图像的可能值范围而不是训练数据中采样风格统计，以实现更多样化的增强。此外，与预定义的通道方向平均值和标准偏差相比，我们在SRM-FL中利用像素方向的可学习参数作为样式嵌入，用于捕获更具代表性的样式特征。此外，我们利用来自同一CXR的有和没有风格扰动版本的全局语义特征和预测分布的一致性正则化来调整模型对内容标记的敏感性，以实现准确的预测。我们提出的方法，在CheXpert和MIMIC-CXR数据集上训练，在看不见的域测试数据集，即BRAX、VinDr-CXR和NIH胸部X-ray14上分别实现了77.32$\pm $0.35 、 88.38$ \pm $0.19 、 82.63$ \pm $0.13 A U C （$ \pm $0.80 、 87.57$ \pm $0.46 、 82.07$ \pm$0.19。

摘要: Performance degradation due to distribution discrepancy is a longstanding challenge in intelligent imaging, particularly for chest X-rays (CXRs). Recent studies have demonstrated that CNNs are biased toward styles (e.g., uninformative textures) rather than content (e.g., shape), in stark contrast to the human vision system. Radiologists tend to learn visual cues from CXRs and thus perform well across multiple domains. Motivated by this, we employ the novel on-the-fly style randomization modules at both image (SRM-IL) and feature (SRM-FL) levels to create rich style perturbed features while keeping the content intact for robust cross-domain performance. Previous methods simulate unseen domains by constructing new styles via interpolation or swapping styles from existing data, limiting them to available source domains during training. However, SRM-IL samples the style statistics from the possible value range of a CXR image instead of the training data to achieve more diversified augmentations. Moreover, we utilize pixel-wise learnable parameters in the SRM-FL compared to pre-defined channel-wise mean and standard deviations as style embeddings for capturing more representative style features. Additionally, we leverage consistency regularizations on global semantic features and predictive distributions from with and without style-perturbed versions of the same CXR to tweak the model’s sensitivity toward content markers for accurate predictions. Our proposed method, trained on CheXpert and MIMIC-CXR datasets, achieves 77.32$\pm $0.35, 88.38$ \pm $0.19, 82.63$ \pm $0.13 A U C s ($ \pm $0.80, 87.57$ \pm $0.46, 82.07$ \pm$0.19 from state-of-the-art models on five-fold cross-validation with statistically significant results in thoracic disease classification.

标题: LIR: A Lightweight Baseline for Image Restoration

作者: Dongqi Fan, Ting Yue, Xin Zhao

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2402.01368v2

中文摘要: 最近，基于CNN和Transformer model的图像恢复有了重大进展。然而，图像恢复任务的固有特征经常被忽视。相反，许多工作只关注基本的块设计，并将许多这样的块堆叠到模型中，导致参数冗余和不必要的计算。因此，阻碍了图像恢复的效率。在本文中，我们提出了一个轻量级的图像恢复基线称为LIR，以有效地重建图像和消除退化（模糊，雨，噪声，雾霾）。首先，LIR通过简单的结构设计解决了现代网络忽略的局部和全局剩余连接中存在的退化问题。然后，为了实现轻量级，根据图像恢复的固有特性，引入了轻量级自适应注意块（LAA），该块主要由所提出的自适应滤波器和注意块组成。LAA能够以计算友好的方式自适应地锐化轮廓，消除退化，并在各种图像恢复场景中捕捉全局信息。大量实验表明，我们的LIR在某些任务中以更少的参数和计算量实现了与最先进模型相当的性能。此外，值得注意的是，我们的LIR比更符合人类审美的最先进的网络产生更好的视觉效果。

摘要: Recently, there have been significant advancements in Image Restoration based on CNN and transformer. However, the inherent characteristics of the Image Restoration task are often overlooked. Many works, instead, only focus on the basic block design and stack numerous such blocks to the model, leading to parameters redundant and computations unnecessary. Thus, the efficiency of the image restoration is hindered. In this paper, we propose a Lightweight Baseline for Image Restoration called LIR to efficiently reconstruct the image and remove degradations (blur, rain, noise, haze). First of all, LIR addresses the degradations existing in the local and global residual connections that are ignored by modern networks, through a simple structural design. Then, to achieve lightweight, a Lightweight Adaptive Attention (LAA) Block is introduced depending on the inherent characteristics of the Image Restoration, which is mainly composed of proposed Adaptive Filters and Attention Blocks. LAA is capable of adaptively sharpening contours, removing degradation, and capturing global information in various Image Restoration scenes in a computation-friendly manner. Extensive experiments demonstrate that our LIR achieves comparable performance to state-of-the-art models with fewer parameters and computations in certain tasks. In addition, it is worth noting that our LIR produces better visual results than state-of-the-art networks that are more in line with the human aesthetic.

标题: Black-box Adversarial Attacks Against Image Quality Assessment Models

作者: Yu Ran, Ao-Xiang Zhang, Mingjie Li

PubTime: 2024-02-28

Downlink: http://arxiv.org/abs/2402.17533v2

中文摘要: 无参考图像质量评估（NR-IQA）的目标是根据图像的主观评价预测图像的感知质量。为了将NR-IQA模型付诸实践，有必要研究它们的潜在漏洞，以便对模型进行改进。本文首次尝试探讨了NR-IQA模型上的黑盒对抗攻击。具体来说，我们首先将攻击问题表述为最大化原始图像和扰动图像的估计质量分数之间的偏差，同时限制扰动图像失真以保持视觉质量。在这种公式下，我们设计了一个双向损失函数，以误导对抗性例子的估计质量分数向相反的方向与最大偏差。在此基础上，我们最终开发了一种针对NR-IQA模型的高效黑盒攻击方法。大量实验表明，所有评估的NR-IQA模型都容易受到所提出的攻击方法的攻击。并且所产生的扰动是不可转移的，这使得它们能够服务于不同IQA模型的特性的研究。

摘要: The goal of No-Reference Image Quality Assessment (NR-IQA) is to predict the perceptual quality of an image in line with its subjective evaluation. To put the NR-IQA models into practice, it is essential to study their potential loopholes for model refinement. This paper makes the first attempt to explore the black-box adversarial attacks on NR-IQA models. Specifically, we first formulate the attack problem as maximizing the deviation between the estimated quality scores of original and perturbed images, while restricting the perturbed image distortions for visual quality preservation. Under such formulation, we then design a Bi-directional loss function to mislead the estimated quality scores of adversarial examples towards an opposite direction with maximum deviation. On this basis, we finally develop an efficient and effective black-box attack method against NR-IQA models. Extensive experiments reveal that all the evaluated NR-IQA models are vulnerable to the proposed attack method. And the generated perturbations are not transferable, enabling them to serve the investigation of specialities of disparate IQA models.

== diffusion policy@diffusion formulation@diffusion model ==

标题: BS-Diff: Effective Bone Suppression Using Conditional Diffusion Models from Chest X-Ray Images

作者: Zhanghao Chen, Yifei Sun, Wenjian Qin

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2311.15328v3

GitHub: https://github.com/Benny0323/BS-Diff|

中文摘要: 胸部X射线（CXRs）通常用作肺部筛查的低剂量方式。尽管如此，CXRs的功效还是受到了一定的阻碍，因为大约75%的肺面积与骨重叠，这反过来又阻碍了疾病的检测和诊断。作为一种补救措施，骨抑制技术已经被引入。目前临床上的双能减影成像技术需要昂贵的设备和暴露于高辐射的受试者。为了规避这些问题，已经提出了基于深度学习的图像生成算法。然而，现有的方法在产生高质量图像和捕捉纹理细节方面有所欠缺，特别是对于肺血管。为了解决这些问题，本文提出了一种新的骨抑制框架，称为BS-Diff，它包括一个配备有U-Net架构的条件扩散模型和一个简单的增强模块，以结合自动编码器。我们提出的网络不仅可以生成具有高骨抑制率的软组织图像，还具有捕捉精细图像细节的能力。此外，我们编制了自2010年以来最大的数据集，包括来自120名患者的数据，这些患者具有我们附属医院收集的高清、高分辨率配对CXRs和软组织图像。广泛的实验、比较分析、消融研究和临床评估表明，提出的BS-Diff在多个指标上优于几种骨抑制模型。我们的代码可以在https：//github.com/benny 0323/BS-Diff访问。

摘要: Chest X-rays (CXRs) are commonly utilized as a low-dose modality for lung screening. Nonetheless, the efficacy of CXRs is somewhat impeded, given that approximately 75% of the lung area overlaps with bone, which in turn hampers the detection and diagnosis of diseases. As a remedial measure, bone suppression techniques have been introduced. The current dual-energy subtraction imaging technique in the clinic requires costly equipment and subjects being exposed to high radiation. To circumvent these issues, deep learning-based image generation algorithms have been proposed. However, existing methods fall short in terms of producing high-quality images and capturing texture details, particularly with pulmonary vessels. To address these issues, this paper proposes a new bone suppression framework, termed BS-Diff, that comprises a conditional diffusion model equipped with a U-Net architecture and a simple enhancement module to incorporate an autoencoder. Our proposed network cannot only generate soft tissue images with a high bone suppression rate but also possesses the capability to capture fine image details. Additionally, we compiled the largest dataset since 2010, including data from 120 patients with high-definition, high-resolution paired CXRs and soft tissue images collected by our affiliated hospital. Extensive experiments, comparative analyses, ablation studies, and clinical evaluations indicate that the proposed BS-Diff outperforms several bone-suppression models across multiple metrics. Our code can be accessed at https://github.com/Benny0323/BS-Diff.

标题: Data-Free Model Stealing Attack Based on Denoising Diffusion Probabilistic Model

作者: Guofeng Gao, Xiaodong Wang, Zhiqiang Wei

PubTime: 2023-08

Downlink: https://ieeexplore.ieee.org/document/10448559/

Journal: 2023 IEEE Smart World Congress (SWC)

中文摘要: 无数据模型窃取（MS）攻击使用合成样本来查询目标模型，并训练替代模型以符合目标模型的预测，避免了对模型开发人员使用的真实数据集的强烈依赖。然而，现有的无数据MS攻击方法在为高精度MS攻击生成高质量查询样本方面仍有很大差距。在本文中，我们构造了DDPM优化的生成器来生成数据，其中设计了一个残差网络状结构来融合数据以合成查询样本。我们的方法进一步提高了合成查询样本的数量和质量，并有效地减少了对目标模型的查询次数。结果表明，与现有方法相比，所提出的方法获得了优异的性能。

摘要: Data-free model stealing (MS) attacks use synthetic samples to query a target model and train a substitute model to fit the target model’s predictions, avoiding strong dependence on real datasets used by model developers. However, the existing data-free MS attack methods still have a big gap in generating high-quality query samples for high-precision MS attacks. In this paper, we construct the DDPM-optimized generator to generate data, in which a residual network-like structure is designed to fuse data to synthesize query samples. Our method further improves the quantity and quality of synthetic query samples, and effectively reduces the number of queries to the target model. The results show that the proposed method achieves superior performance compared to state-of-the-art methods.

标题: CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization

作者: Hao-Yang Peng, Jia-Peng Zhang, Meng-Hao Guo

PubTime: 2024-02-28

Downlink: http://arxiv.org/abs/2402.17214v2

中文摘要: 在数字内容创作领域，从单个图像生成高质量的3D角色具有挑战性，尤其是考虑到各种身体姿势的复杂性以及自遮挡和姿势模糊的问题。在本文中，我们介绍了CharacterGen，这是一个为有效生成3D字符而开发的框架。CharacterGen引入了简化的生成管道以及图像条件多视图扩散模型。该模型有效地将输入姿态校准为规范形式，同时保留输入图像的关键属性，从而解决了不同姿态带来的挑战。基于Transformer model的可推广稀疏视图重建模型是我们方法的另一个核心组件，有助于从多视图图像创建详细的3D模型。我们还采用纹理反投影策略来生成高质量的纹理贴图。此外，我们还策划了一个动漫角色数据集，以多种姿势和视图呈现，以训练和评估我们的模型。我们的方法已经通过定量和定性实验进行了彻底的评估，显示了它在生成具有高质量形状和纹理的3D角色方面的熟练程度，为下游应用（如索具和动画）做好了准备。

摘要: In the field of digital content creation, generating high-quality 3D characters from single images is challenging, especially given the complexities of various body poses and the issues of self-occlusion and pose ambiguity. In this paper, we present CharacterGen, a framework developed to efficiently generate 3D characters. CharacterGen introduces a streamlined generation pipeline along with an image-conditioned multi-view diffusion model. This model effectively calibrates input poses to a canonical form while retaining key attributes of the input image, thereby addressing the challenges posed by diverse poses. A transformer-based, generalizable sparse-view reconstruction model is the other core component of our approach, facilitating the creation of detailed 3D models from multi-view images. We also adopt a texture-back-projection strategy to produce high-quality texture maps. Additionally, we have curated a dataset of anime characters, rendered in multiple poses and views, to train and evaluate our model. Our approach has been thoroughly evaluated through quantitative and qualitative experiments, showing its proficiency in generating 3D characters with high-quality shapes and textures, ready for downstream applications such as rigging and animation.

标题: Transparent Image Layer Diffusion using Latent Transparency

作者: Lvmin Zhang, Maneesh Agrawala

PubTime: 2024-02-28

Downlink: http://arxiv.org/abs/2402.17113v2

中文摘要: 我们提出了LayerDiffusion，这是一种使大规模预训练潜在扩散模型能够生成透明图像的方法。该方法允许生成单个透明图像或多个透明层。该方法学习“潜在透明度”，该“潜在透明度”将α通道透明度编码到预训练的潜在扩散模型的潜在流形中。它通过将增加的透明度调节为潜在偏移，对预训练模型的原始潜在分布进行最小的改变，来保持大扩散模型的生产就绪质量。这样，任何潜在扩散模型都可以通过用调整后的潜在空间对其进行微调而转换成透明图像生成器。我们用使用人在回路收集方案收集的1M透明图像层对来训练模型。我们表明，潜在透明可以应用于不同的开源图像生成器，或适用于各种条件控制系统，以实现前景/背景条件层生成、联合层生成、层内容的结构控制等应用。一项用户研究发现，在大多数情况下（97%），用户更喜欢我们本地生成的透明内容，而不是以前的特别解决方案，如生成然后抠图。用户还报告说，我们生成的透明图像的质量可与Adobe Stock等真实的商业透明资产相媲美。

摘要: We present LayerDiffusion, an approach enabling large-scale pretrained latent diffusion models to generate transparent images. The method allows generation of single transparent images or of multiple transparent layers. The method learns a “latent transparency” that encodes alpha channel transparency into the latent manifold of a pretrained latent diffusion model. It preserves the production-ready quality of the large diffusion model by regulating the added transparency as a latent offset with minimal changes to the original latent distribution of the pretrained model. In this way, any latent diffusion model can be converted into a transparent image generator by finetuning it with the adjusted latent space. We train the model with 1M transparent image layer pairs collected using a human-in-the-loop collection scheme. We show that latent transparency can be applied to different open source image generators, or be adapted to various conditional control systems to achieve applications like foreground/background-conditioned layer generation, joint layer generation, structural control of layer contents, etc. A user study finds that in most cases (97%) users prefer our natively generated transparent content over previous ad-hoc solutions such as generating and then matting. Users also report the quality of our generated transparent images is comparable to real commercial transparent assets like Adobe Stock.

标题: On the Evaluation of Generative Models in Distributed Learning Tasks

作者: Zixiao Wang, Farzan Farnia, Zhenghao Lin

PubTime: 2024-02-27

Downlink: http://arxiv.org/abs/2310.11714v3

中文摘要: 文献中广泛研究了深度生成模型的评估，包括生成对抗网络（GANs）和扩散模型。虽然现有的评估方法主要针对由单个客户端存储训练数据的集中学习问题，但生成模型的许多应用涉及分布式学习设置，例如联合学习场景，其中训练数据由几个客户端收集并分布在几个客户端之间。本文研究了异构数据分布下分布式学习任务中生成模型的评估。首先，我们关注Fr’echet初始距离(FID)，并考虑客户端上以下基于FID的聚合分数：1）FID-avg作为客户端单个FID分数的平均值，2）FID-all作为训练模型到包含所有客户端数据的集体数据集的FID距离。我们证明了根据FID-all和FID-avg分数的模型排名可能不一致，这可能导致根据两个聚合分数的不同的最优生成模型。接下来，我们考虑核初始距离(KID)，并类似地定义KID-avg和KID-all聚合。与FID情况不同，我们证明了KID-all和KID-avg导致生成模型的排名相同。我们在标准图像数据集和训练方案上进行了几次数值实验，以支持我们在以下方面的理论发现分布式学习问题中生成模型的评价。

摘要: The evaluation of deep generative models including generative adversarial networks (GANs) and diffusion models has been extensively studied in the literature. While the existing evaluation methods mainly target a centralized learning problem with training data stored by a single client, many applications of generative models concern distributed learning settings, e.g. the federated learning scenario, where training data are collected by and distributed among several clients. In this paper, we study the evaluation of generative models in distributed learning tasks with heterogeneous data distributions. First, we focus on the Fr’echet inception distance (FID) and consider the following FID-based aggregate scores over the clients: 1) FID-avg as the mean of clients’ individual FID scores, 2) FID-all as the FID distance of the trained model to the collective dataset containing all clients’ data. We prove that the model rankings according to the FID-all and FID-avg scores could be inconsistent, which can lead to different optimal generative models according to the two aggregate scores. Next, we consider the kernel inception distance (KID) and similarly define the KID-avg and KID-all aggregations. Unlike the FID case, we prove that KID-all and KID-avg result in the same rankings of generative models. We perform several numerical experiments on standard image datasets and training schemes to support our theoretical findings on the evaluation of generative models in distributed learning problems.

== VLN ==

标题: Active propulsion noise shaping for multi-rotor aircraft localization

作者: Gabriele Serussi, Tamir Shor, Tom Hirshberg

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2402.17289v2

中文摘要: 多旋翼空中自主飞行器（MAV）主要依靠视觉进行导航。然而，视觉定位和里程计技术在低阳光或直射阳光下表现不佳，视野有限，易受遮挡。在许多情况下，声学传感可以作为视觉的补充甚至替代形式，并且它还具有更低的系统成本和能量足迹的额外好处，这对于微型飞机尤其重要。本文建议主动控制和塑造旋翼产生的飞机推进噪声，以利于定位任务，而不是将其视为有害的干扰。我们提出了一种神经网络架构，用于已知环境中基于自噪声的定位。我们表明，训练它与学习时变转子相位调制同时实现了精确和鲁棒的定位。使用2D声学环境中MAV转子噪声的计算负担得起的模拟来评估所提出的方法，该模拟与转子压力场的真实记录相吻合。

摘要: Multi-rotor aerial autonomous vehicles (MAVs) primarily rely on vision for navigation purposes. However, visual localization and odometry techniques suffer from poor performance in low or direct sunlight, a limited field of view, and vulnerability to occlusions. Acoustic sensing can serve as a complementary or even alternative modality for vision in many situations, and it also has the added benefits of lower system cost and energy footprint, which is especially important for micro aircraft. This paper proposes actively controlling and shaping the aircraft propulsion noise generated by the rotors to benefit localization tasks, rather than considering it a harmful nuisance. We present a neural network architecture for selfnoise-based localization in a known environment. We show that training it simultaneously with learning time-varying rotor phase modulation achieves accurate and robust localization. The proposed methods are evaluated using a computationally affordable simulation of MAV rotor noise in 2D acoustic environments that is fitted to real recordings of rotor pressure fields.