[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--大模型、扩散模型

最新推荐文章于 2024-07-25 20:24:31 发布

晓理紫

最新推荐文章于 2024-07-25 20:24:31 发布

阅读量1.3k

点赞数 22

分类专栏：最新论文和会议信息推送文章标签： redis 数据库人工智能深度学习

本文链接：https://blog.csdn.net/u011573853/article/details/136357347

版权

最新论文和会议信息推送专栏收录该内容

85 篇文章 7 订阅

订阅专栏

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

== chatgpt@large language model @LLM ==

标题: SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation

作者: Shuangrui Ding, Zihan Liu, Xiaoyi Dong

PubTime: 2024-02-27

Downlink: http://arxiv.org/abs/2402.17645v1

Project: https://pjlab-songcomposer.github.io/|

GitHub: https://github.com/pjlab-songcomposer/songcomposer|

中文摘要: 我们推出SongComposer，这是一款为歌曲创作而设计的创新LLM。通过利用LLM的能力，它可以理解并生成象征性歌曲表示中的旋律和歌词。现有的与音乐相关的LLM将音乐视为量化的音频信号，而这种隐式编码导致编码效率低和灵活性差。相比之下，我们求助于象征性的歌曲表现，这是人类为音乐设计的成熟而高效的方式，并使LLM能够像人类一样明确地创作歌曲。在实践中，我们设计了一种新颖的元组设计来格式化旋律中的歌词和三个音符属性（音高、持续时间和休止符持续时间），这保证了对音乐符号的正确LLM理解，并实现了歌词和旋律之间的精确对齐。为了向LLM传授基本的音乐理解，我们仔细收集了SongCompose-PT，这是一个大规模的歌曲预训练数据集，包括中文或英文的歌词、旋律和配对歌词——旋律。经过充分的预培训后，10K精心制作的QA对被用于增强LLM的指令遵循能力和解决不同的任务。通过大量的实验，SongComposer在歌词到旋律生成、旋律到歌词生成、歌曲延续和文本到歌曲创作方面表现出色，超过了GPT-4等高级LLMs。

摘要: We present SongComposer, an innovative LLM designed for song composition. It could understand and generate melodies and lyrics in symbolic song representations, by leveraging the capability of LLM. Existing music-related LLM treated the music as quantized audio signals, while such implicit encoding leads to inefficient encoding and poor flexibility. In contrast, we resort to symbolic song representation, the mature and efficient way humans designed for music, and enable LLM to explicitly compose songs like humans. In practice, we design a novel tuple design to format lyric and three note attributes (pitch, duration, and rest duration) in the melody, which guarantees the correct LLM understanding of musical symbols and realizes precise alignment between lyrics and melody. To impart basic music understanding to LLM, we carefully collected SongCompose-PT, a large-scale song pretraining dataset that includes lyrics, melodies, and paired lyrics-melodies in either Chinese or English. After adequate pre-training, 10K carefully crafted QA pairs are used to empower the LLM with the instruction-following capability and solve diverse tasks. With extensive experiments, SongComposer demonstrates superior performance in lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation, outperforming advanced LLMs like GPT-4.

标题: Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data

作者: Xiao Liu, Zirui Wu, Xueqing Wu

PubTime: 2024-02-27

Downlink: http://arxiv.org/abs/2402.17644v1

Project: https://xxxiaol.github.io/QRData/|

GitHub: https://github.com/xxxiaol/QRData|

中文摘要: 定量推理是分析数据的一项关键技能，但对这种能力的评估仍然有限。为了解决这一差距，我们引入了数据定量推理（QRData）基准，旨在评估大型语言模型在使用真实世界数据进行统计和因果推理方面的能力。该基准由精心构建的411个问题的数据集组成，并附有来自教科书、在线学习材料和学术论文的数据表。为了比较模型对数据和文本的定量推理能力，我们用290个纯文本问题的辅助集来丰富基准，即QRText。我们在不同的模型上评估自然语言推理、基于程序的推理和代理推理方法，包括思维链、思维程序、ReAct和代码解释器助手。最强的模型GPT-4达到了58%的准确率，有很大的改进空间。在开源模型中，Deepseek-coder-instruct，一个在2T令牌上预训练的代码LLM，获得了37%的最高准确率。分析表明，模型在数据分析和因果推理方面遇到困难，在同时使用因果知识和提供的数据方面遇到困难。代码和数据在https://github.com/xxxiaol/QRData。

摘要: Quantitative reasoning is a critical skill to analyze data, yet the assessment of such ability remains limited. To address this gap, we introduce the Quantitative Reasoning with Data (QRData) benchmark, aiming to evaluate Large Language Models’ capability in statistical and causal reasoning with real-world data. The benchmark comprises a carefully constructed dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers. To compare models’ quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText. We evaluate natural language reasoning, program-based reasoning, and agent reasoning methods including Chain-of-Thought, Program-of-Thoughts, ReAct, and code interpreter assistants on diverse models. The strongest model GPT-4 achieves an accuracy of 58%, which has a large room for improvement. Among open-source models, Deepseek-coder-instruct, a code LLM pretrained on 2T tokens, gets the highest accuracy of 37%. Analysis reveals that models encounter difficulties in data analysis and causal reasoning, and struggle in using causal knowledge and provided data simultaneously. Code and data are in https://github.com/xxxiaol/QRData.

标题: Language Agents as Optimizable Graphs

作者: Mingchen Zhuge, Wenyi Wang, Louis Kirsch

PubTime: 2024-02-27

Downlink: http://arxiv.org/abs/2402.16823v2

Project: https://gptswarm.org|

GitHub: https://github.com/metauto-ai/gptswarm|https://github.com/metauto-ai/gptswarm|

中文摘要: 各种人工设计的即时工程技术已经被提出来改进基于大型语言模型（LLMs）的问题解决器，产生许多不同的代码库。我们通过将基于LLM的代理描述为计算图来统一这些方法。节点实现处理多模态数据或查询LLMs的功能，边描述操作之间的信息流。图可以递归地组合成更大的复合图，表示代理间协作的层次结构（其中边连接不同代理的操作）。我们新颖的自动图优化器（1）细化节点级LLM提示（节点优化）和（2）通过改变图连接性来改进代理编排（边优化）。实验表明，我们的框架可以用来有效地开发、集成和自动改进各种LLM代理。代码可以在https：//github.com/metauto-ai/gptswarm。找到

摘要: Various human-designed prompt engineering techniques have been proposed to improve problem solvers based on Large Language Models (LLMs), yielding many disparate code bases. We unify these approaches by describing LLM-based agents as computational graphs. The nodes implement functions to process multimodal data or query LLMs, and the edges describe the information flow between operations. Graphs can be recursively combined into larger composite graphs representing hierarchies of inter-agent collaboration (where edges connect operations of different agents). Our novel automatic graph optimizers (1) refine node-level LLM prompts (node optimization) and (2) improve agent orchestration by changing graph connectivity (edge optimization). Experiments demonstrate that our framework can be used to efficiently develop, integrate, and automatically improve various LLM agents. The code can be found at https://github.com/metauto-ai/gptswarm.

标题: HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

作者: Mantas Mazeika, Long Phan, Xuwang Yin

PubTime: 2024-02-27

Downlink: http://arxiv.org/abs/2402.04249v2

Project: https://www.harmbench.org|

GitHub: https://github.com/centerforaisafety/HarmBench|

中文摘要: 自动化红队为发现和减轻与恶意使用大型语言模型（LLMs）相关的风险带来了巨大的希望，但该领域缺乏一个标准化的评估框架来严格评估新方法。为了解决这个问题，我们引入了HarmBench，这是一个自动化红队的标准化评估框架。我们确定了几个以前在red团队评估中未考虑的理想属性，并系统地设计了HarmBench以满足这些标准。使用HarmBench，我们对18种红色团队方法和33种目标LLMs和防御进行了大规模比较，产生了新的见解。我们还介绍了一种高效的对抗性训练方法，该方法极大地增强了LLM在各种攻击中的鲁棒性，展示了HarmBench如何实现攻击和防御的共同开发。我们在https：//github.com/centerforaisafety/HarmBench。开源HarmBench。

摘要: Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria. Using HarmBench, we conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs and defenses, yielding novel insights. We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses. We open source HarmBench at https://github.com/centerforaisafety/HarmBench.

标题: Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View

作者: Jintian Zhang, Xin Xu, Ningyu Zhang

PubTime: 2024-02-26

Downlink: http://arxiv.org/abs/2310.02124v2

Project: https://zjunlp.github.io/project/MachineSoM|

GitHub: https://github.com/zjunlp/MachineSoM|

中文摘要: 随着自然语言处理（NLP）系统越来越多地用于复杂的社会环境，一个紧迫的问题出现了：在由多个大型语言模型（LLMs）组成的多智能体社会中，这些NLP系统能反映人类式的协作智能吗？本文通过结合实际实验和理论见解，探讨了当代自然语言处理系统之间的协作机制。我们构建了四个由LLM代理人组成的独特的“社会”，其中每个代理人都有一个特定的“特质”（随和或过于自信），并以一种独特的“思维模式”（辩论或反思）进行合作。通过在三个基准数据集上评估这些多智能体社会，我们发现某些协作策略不仅优于以前的顶层方法，而且优化了效率（使用更少的API令牌）。此外，我们的结果进一步说明，LLM代理表现出类似人类的社会行为，如从众和达成共识，反映了基本的社会心理学理论。总之，我们整合了社会心理学的见解，将LLM代理人的合作置于情境中，激发了对LLM合作机制的进一步研究。我们承诺共享我们的代码和数据集\footnote{\url{https：//github.com/zjunlp/MachineSoM}。}，希望在这一有前途的途径上促进进一步的研究。

摘要: As Natural Language Processing (NLP) systems are increasingly employed in intricate social environments, a pressing query emerges: Can these NLP systems mirror human-esque collaborative intelligence, in a multi-agent society consisting of multiple large language models (LLMs)? This paper probes the collaboration mechanisms among contemporary NLP systems by melding practical experiments with theoretical insights. We fabricate four unique societies' comprised of LLM agents, where each agent is characterized by a specific trait’ (easy-going or overconfident) and engages in collaboration with a distinct `thinking pattern’ (debate or reflection). Through evaluating these multi-agent societies on three benchmark datasets, we discern that certain collaborative strategies not only outshine previous top-tier approaches, but also optimize efficiency (using fewer API tokens). Moreover, our results further illustrate that LLM agents manifest human-like social behaviors, such as conformity and consensus reaching, mirroring foundational social psychology theories. In conclusion, we integrate insights from social psychology to contextualize the collaboration of LLM agents, inspiring further investigations into the collaboration mechanism for LLMs. We commit to sharing our code and datasets\footnote{\url{https://github.com/zjunlp/MachineSoM}.}, hoping to catalyze further research in this promising avenue.

标题: LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

作者: Fahim Dalvi, Maram Hasanain, Sabri Boughorbel

PubTime: 2024-02-26

Downlink: http://arxiv.org/abs/2308.04945v2

Project: https://youtu.be/9cC2m_abk3A|

GitHub: https://github.com/qcri/LLMeBench/|

摘要: The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available, their customization capabilities for specific tasks and datasets are often complex for different users. In this study, we introduce the LLMeBench framework, which can be seamlessly customized to evaluate LLMs for any NLP task, regardless of language. The framework features generic dataset loaders, several model providers, and pre-implements most standard evaluation metrics. It supports in-context learning with zero- and few-shot settings. A specific dataset and task can be evaluated for a given LLM in less than 20 lines of code while allowing full flexibility to extend the framework for custom datasets, models, or tasks. The framework has been tested on 31 unique NLP tasks using 53 publicly available datasets within 90 experimental setups, involving approximately 296K data points. We open-sourced LLMeBench for the community (https://github.com/qcri/LLMeBench/) and a video demonstrating the framework is available online. (https://youtu.be/9cC2m_abk3A)

== CLIP@ViT @ VLM @ visual model ==

标题: Robot at the Mirror: Learning to Imitate via Associating Self-supervised Models

作者: Andrej Lucny, Kristina Malinovska, Igor Farkas

PubTime: 2024-02-26

Downlink: http://arxiv.org/abs/2311.13226v2

Project: https://link.springer.com/chapter/10.1007/978-3-031-44207-0_39|

GitHub: https://github.com/andylucny/learningImitation/tree/main/mirror|

中文摘要: 我们介绍了一种从现成的自我监督模型通过关联而不是训练和微调来构建定制模型的方法。我们用一个人形机器人看着镜子并学习从它感知的图像中检测自己身体的3D姿势的例子来演示它。为了建立我们的模型，我们首先通过机器人操作前准备的模型从视觉输入和机器人身体的姿势中获得特征。然后，我们通过一个样本高效的机器人在镜子上的自我探索来映射它们相应的潜在空间。通过这种方式，机器人构建请求的3D姿态检测器，该检测器在采集的样本上立即获得完美的质量，而不是逐渐获得质量。该映射采用特征向量对的关联，然后以与著名的Transformer model模型的键值机制相同的方式实现。最后，将我们的模型部署到模拟机器人上，使我们能够在没有人类参与的情况下研究、调整和系统地评估其超参数，推进我们之前的研究。

摘要: We introduce an approach to building a custom model from ready-made self-supervised models via their associating instead of training and fine-tuning. We demonstrate it with an example of a humanoid robot looking at the mirror and learning to detect the 3D pose of its own body from the image it perceives. To build our model, we first obtain features from the visual input and the postures of the robot’s body via models prepared before the robot’s operation. Then, we map their corresponding latent spaces by a sample-efficient robot’s self-exploration at the mirror. In this way, the robot builds the solicited 3D pose detector, which quality is immediately perfect on the acquired samples instead of obtaining the quality gradually. The mapping, which employs associating the pairs of feature vectors, is then implemented in the same way as the key-value mechanism of the famous transformer models. Finally, deploying our model for imitation to a simulated robot allows us to study, tune up, and systematically evaluate its hyperparameters without the involvement of the human counterpart, advancing our previous research.

标题: CLIP-LSTM: Fused Model for Dynamic Hand Gesture Recognition

作者: Reena Tripathi, Bindu Verma

PubTime: 2023-12

Downlink: https://ieeexplore.ieee.org/document/10440820/

Journal: 2023 IEEE 20th India Council International Conference (INDICON)

中文摘要: 计算机视觉领域继续发现动态手势检测是一个有趣的主题。该算法无法准确确定手势在视频馈送中是开始还是结束，因此，实时识别动态手部运动具有挑战性。实时动态手势检测有多种应用，许多研究人员正在研究它。在本文中，我们使用剪辑模型来提取手势的特征。然后将提取的特征传递到BLSTM模型中，对动态手势进行分类。使用剪辑模型进行特征提取克服了手部检测和跟踪的问题。各种光照使得手部检测和跟踪具有挑战性，并且使用CLIP来提取每个视频的特征。我们在一个具有挑战性的数据集上用更少的参数进行了一个实验。在CHG和LISA数据集上的实验结果表明，我们提出的模型在CHG上具有97%的准确率，在LISA数据集上具有86%的准确率，这表明我们提出的模型优于最先进的方法（SOTA）。

摘要: The computer vision field continues to find dynamic hand gesture detection to be an intriguing subject. The algorithm is unable to determine accurately whether a gesture begins or ends in a video feed therefore, recognizing dynamic hand movements in real time is challenging. Real-time dynamic hand gesture detection has several applications, and numerous researchers are working on it. In this paper, we have used the CLIP model to extract the features of hand gestures. Then the extracted features passed into the BLSTM model to classify the dynamic hand gestures. Using the CLIP model for feature extraction overcomes the problem of hand detection and tracking. The various illumination makes hand detection and tracking challenging and CLIP is used to extract the features of each video. We conduct an experiment with fewer parameters on a challenging dataset. Experimental results on the CHG and LISA dataset with 97% accuracy on CHG and 86% accuracy on LISA datasets shows that our proposed model outperforms the state-of-the-art methods (SOTA).

标题: Modeling the Relationship between Perisaccadic Neural Responses and Location Information

作者: Geyu Weng, Amir Akbarian, Behrad Noudoost

PubTime: 2022-11

Downlink: https://ieeexplore.ieee.org/document/10051903/

Journal: 2022 56th Asilomar Conference on Signals, Systems, and Computers

中文摘要: 眼球运动对于大脑从环境中收集视觉信息至关重要。投射在视网膜上的视觉图像在快速弹道眼球运动（迅速扫视）期间会突然变化，但我们对视觉世界的感知是连续的。为了产生稳定的视觉感知，视觉神经元的时空敏感性需要在迅速扫视之前和期间快速变化。这项研究使用一个建模框架来表征跨迅速扫视的神经元反应的快速动力学，从而量化迅速扫视期间的视阈周围反应动力学对位置信息读出的贡献。我们将这种方法应用于在视觉刺激的视觉引导扫视任务中从非人灵长类动物的视觉皮层记录的神经元反应。使用模型预测的反应和分类方法，我们测量了扫视前和扫视后感受野位置的神经元的空间辨别性能。表征扫视周围空间信息的读出及其精确的时间过程可以提供关于神经元如何整合跨扫视的空间信息以产生连续视觉体验的见解。

摘要: Eye movements are essential for the brain to collect visual information from the environment. Visual images projected on the retina change abruptly during rapid ballistic eye movements (saccades), but our perception of the visual world is continuous. To generate a stable visual perception, the spatiotemporal sensitivity of visual neurons needs to change quickly prior to and during saccades. This study uses a modeling framework to characterize the fast dynamics of neuronal responses across saccades, thereby quantifying the contribution of perisaccadic response dynamics to the readout of location information during saccades. We apply this approach to neuronal responses recorded from the visual cortex of nonhuman primates during a visually-guided saccade task with visual stimulations. Using the model-predicted responses and a classification method, we measure the spatial discriminability performance of neurons at pre-saccadic and post-saccadic receptive field locations. Characterizing the readout of perisaccadic spatial information and its precise time course can provide insights into how neurons integrate spatial information across saccades to generate a continuous visual experience.

标题: Context Relation Fusion Model for Visual Question Answering

作者: Haotian Zhang, Wei Wu

PubTime: 2022-10

Downlink: https://ieeexplore.ieee.org/document/9897563/

Journal: 2022 IEEE International Conference on Image Processing (ICIP)

中文摘要: 传统的VQA模型倾向于依赖语言先验作为回答问题的捷径，而忽略了视觉信息。为了解决这一问题，最新的方法通过全局特征将语言先验分为“好”的语言语境和“坏”的语言偏向，以利于语言语境并抑制语言偏向。然而，语言先验不能被全局特征一丝不苟地划分。在本文中，我们提出了一种新的语境关系融合模型（CRFM），它产生全面的语境特征，迫使VQA模型更仔细地区分语言先验为“好”的语言语境和“坏”的语言偏见。具体来说，我们利用视觉关系融合模型（VRFM）和问题关系融合模型（QRFM）来学习局部关键上下文信息，然后通过参与特征融合模型（AFFM）来执行信息增强。实验表明，我们的CRFM在VQA-CP v2数据集上实现了最先进的性能。用于图像质量评估（IQA）的

摘要: Traditional VQA models tend to rely on language priors as a shortcut to answer questions and neglect visual information. To solve this problem, the latest approaches divide language priors into “good” language context and “bad” language bias through global features to benefit the language context and suppress the language bias. However, language priors cannot be meticulously divided by global features. In this paper, we propose a novel Context Relation Fusion Model (CRFM), which produces comprehensive contextual features forcing the VQA model to more carefully distinguish language priors into “good” language context and “bad” language bias. Specifically, we utilize the Visual Relation Fusion Model (VRFM) and Question Relation Fusion Model (QRFM) to learn local critical contextual information and then perform information enhancement through the Attended Features Fusion Model (AFFM). Experiments show that our CRFM achieves state-of-the-art performance on the VQA-CP v2 dataset.

标题: Revisiting Natural Scene Statistical Modeling Using Deep Features for Opinion-Unaware Image Quality Assessment

作者: Saeed Mahmoudpour, Peter Schelkens

PubTime: 2022-10

Downlink: https://ieeexplore.ieee.org/document/9898064/

Journal: 2022 IEEE International Conference on Image Processing (ICIP)

Project: https://gitlab.com/saeedmp/dni|https://gitlab.com/saeedmp/dni|

中文摘要: 意见不感知无参考（OU-NR）方法非常令人感兴趣，因为它们可以独立于参考图像和人类质量意见的知识来预测视觉质量。在原始图像语料库上训练的图像自然度模型已经显示出开发OU-NR方法的潜力。然而，提取的特征可能与人类视觉系统（HVS）的偏好不匹配。本文旨在利用卷积神经网络的特性来实现更丰富的自然空间表示。此外，重新审视了从训练到质量测量的IQA处理步骤，并通过引入HVSinspired准则对自然度模型进行了改进。实验结果表明，在不同的失真类型和图像内容下，使用HVS对齐深度特征构建的自然度模型具有更高的性能和可推广性。质量指数的源代码可在https://gitlab.com/saeedmp/dni。获得

摘要: Opinion-unaware no-reference (OU-NR) methods for image quality assessment (IQA) are of great interest since they can predict visual quality independent of a reference image and knowledge of human quality opinions. Models of image naturalness trained on a corpus of pristine images have shown potential for developing OU-NR methods. However, the extracted features may not match the preferences of the human visual system (HVS). This paper aims to utilize the features of convolutional neural networks to achieve a richer representation of the naturalness space. In addition, the IQA processing steps from training to quality measurement are revisited and the naturalness model is improved by incorporating HVSinspired criteria. Experimental results show the higher performance and generalizability of the naturalness model – constructed using HVS-aligned deep features – under different distortion types and image contents. The source code of the quality index is available at https://gitlab.com/saeedmp/dni.

标题: Sub-word Level Lip Reading With Visual Attention

作者: K R Prajwal, Triantafyllos Afouras, Andrew Zisserman

PubTime: 2022-06

Downlink: https://ieeexplore.ieee.org/document/9878368/

Journal: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

中文摘要: 本文的目标是学习能够识别无声视频中语音的强唇读模型。大多数先前的工作通过在普通汇集的视觉特征之上适应现有的自动语音识别技术来处理开放集视觉语音识别问题。相反，在本文中，我们重点关注唇读中遇到的独特挑战，并提出量身定制的解决方案。为此，我们做出了以下贡献：（1）我们提出了一种基于注意的池机制来聚合视觉语音表征；（2）我们首次使用子词单元进行唇读，并表明这使我们能够更好地模拟任务的歧义；（3）提出了一种基于唇读网络的视觉语音检测模型。综上所述，当在公共数据集上训练时，我们在具有挑战性的LRS2和LRS3基准上获得了最先进的结果，甚至通过使用少一个数量级的数据超过了在大规模工业数据集上训练的模型。我们的最佳模型在LRS2数据集上实现了22.6%的单词错误率，这是唇读模型前所未有的性能，显著缩小了唇读和自动语音识别之间的性能差距。此外，在AVA-ActiveSpeaker基准测试中，我们的VSD模型超越了所有纯视觉基线，甚至超过了最近的几种视听方法。

摘要: The goal of this paper is to learn strong lip reading models that can recognise speech in silent videos. Most prior works deal with the open-set visual speech recognition problem by adapting existing automatic speech recognition techniques on top of trivially pooled visual features. Instead, in this paper, we focus on the unique challenges encountered in lip reading and propose tailored solutions. To this end, we make the following contributions: (1) we propose an attention-based pooling mechanism to aggregate visual speech representations; (2) we use sub-word units for lip reading for the first time and show that this allows us to better model the ambiguities of the task; (3) we propose a model for Visual Speech Detection (VSD), trained on top of the lip reading network. Following the above, we obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets, and even surpass models trained on large-scale industrial datasets by using an order of magnitude less data. Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models, significantly reducing the performance gap between lip reading and automatic speech recognition. Moreover, on the AVA-ActiveSpeaker benchmark, our VSD model surpasses all visual-only baselines and even outperforms several recent audio-visual methods.

== diffusion policy@diffusion formulation@diffusion model ==

标题: Diffusion Model as Representation Learner

作者: Xingyi Yang, Xinchao Wang

PubTime: 2023-10

Downlink: https://ieeexplore.ieee.org/document/10377906/

Journal: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

GitHub: https://github.com/Adamdad/Repfusion|https://github.com/Adamdad/Repfusion|

中文摘要: 扩散概率模型（DPMs）最近在各种生成任务上展示了令人印象深刻的结果。尽管有其承诺，然而，预先训练的DPM的习得表征还没有被完全理解。在本文中，我们对DPMs的表示能力进行了深入的研究，并提出了一种新的知识转移方法，该方法利用生成式DPMs获得的知识进行识别任务。我们的研究从检查DPM的特征空间开始，揭示了DPM本质上是去噪自动编码器，它平衡了表示学习和正则化模型能力。为此，我们引入了一种新的知识转移范式RepFusion。我们的范式从现成的DPM中提取不同时间步长的表示，并动态地将它们用作学生网络的监督，其中最佳时间通过强化学习来确定。我们在几个图像分类、语义分割和地标检测基准上评估了我们的方法，并证明了它优于最先进的方法。我们的结果揭示了DPMs作为表征学习的强大工具的潜力，并提供了对样本生成之外的生成模型的有用性的见解。该代码可在https：//github.com/Adamdad/Repfusion。

摘要: Diffusion Probabilistic Models (DPMs) have recently demonstrated impressive results on various generative tasks. Despite its promises, the learned representations of pre-trained DPMs, however, have not been fully understood. In this paper, we conduct an in-depth investigation of the representation power of DPMs, and propose a novel knowledge transfer method that leverages the knowledge acquired by generative DPMs for recognition tasks. Our study begins by examining the feature space of DPMs, revealing that DPMs are inherently denoising autoencoders that balance the representation learning with regularizing model capacity. To this end, we introduce a novel knowledge transfer paradigm named RepFusion. Our paradigm extracts representations at different time steps from off-the-shelf DPMs and dynamically employs them as supervision for student networks, in which the optimal time is determined through reinforcement learning. We evaluate our approach on several image classification, semantic segmentation, and landmark detection benchmarks, and demonstrate that it outperforms state-of-the-art methods. Our results uncover the potential of DPMs as a powerful tool for representation learning and provide insights into the usefulness of generative models beyond sample generation. The code is available at https://github.com/Adamdad/Repfusion.

标题: The diffusion of electric vehicles in Italy as a means to tackle main environmental issues

作者: Simone Franzò, Federico Frattini, Vito Manfredi Latilla

PubTime: 2017-04

Downlink: https://ieeexplore.ieee.org/document/7935890/

Journal: 2017 Twelfth International Conference on Ecological Vehicles and Renewable Energies (EVER)

Project: http://www.w3.org/1998/Math/MathML|http://www.w3.org/1999/xlink|http://www.w3.org/1998/Math/MathML|http://www.w3.org/1999/xlink|

摘要: Today there is a huge debate about the diffusion of electric vehicles (EV) to reduce air pollution and the countries’ dependence on fossil fuels. The aim of this work is to estimate the environmental impact that the diffusion of EV may have in Italy. Notwithstanding the penetration of EV in the Italian market is still negligible, policy makers consider the massive diffusion of EV as a valid answer to tackle pollution issues (especially at city level). The scenario analysis conducted in this work shows how the diffusion of EVs would dramatically reduce CO
₂
emissions, with particular reference to some Italian regions which offer a higher margin for intervention in terms of number of traditional vehicles that can be replaced with EV.

标题: Prospect of the next-generation digital content industry: Three perspective approach to the user acceptance of the Realistic content technology

作者: Hyungjin Park, Hyenyoung Yoon, Junseok Hwang

PubTime: 2016-02

Downlink: https://ieeexplore.ieee.org/document/7423515/

Journal: 2016 18th International Conference on Advanced Communication Technology (ICACT)

中文摘要: 我们现在正面临一个由可穿戴设备的扩散引起的数字内容产业的新时代。现实内容，包括虚拟现实、增强现实和全息技术。它被许多机构和研究人员预测为未来的数字内容技术。然而，试图通过分析用户对现实内容技术的接受程度来预测现实内容可能性的研究人员非常有限。为了分析用户接受意愿，从三个不同的角度将因素与技术接受模型相结合，以提高研究模型的信度和效度，并给出更好的分析结果。调查是由韩国用户（N=429）进行的，他们意识到现实内容技术的存在。基于结构方程模型（SEM）方法进行分析。因子分析结果表明，流动和空间性对感知有用性有显著影响。交互和显示对感知易用性有显著影响。同时，用户指出隐私风险是阻碍用户使用真实内容的最大风险

摘要: We are now facing a new age of digital content industry caused by diffusion of wearable device. The Realistic content, which consists of virtual reality, augmented reality and hologram technologies. It had been predicted as the future digital content technology by many institutions and researchers. However, there were very limited researchers who tried to predict the possibility of realistic content by analysing the user acceptance of realistic content technology. To analyse the intention of user acceptance, factors from three different perspectives were integrated with technology acceptance model to improve the reliability and validity of the research model and to give better results of analysis. Survey had been conducted by users in South Korea(N=429) who aware the existence of realistic content technology. Analysis was made based on structural equation modeling(SEM) method. The result of factor analysis showed that Flow and Spatiality have significant influence to Perceived usefulness. Interaction and Display have significant influence to the Perceived ease of use. Meanwhile users pointed out the Privacy risk as the most significant risk that avoid users to use the realistic content.

标题: Multi-agent simulator of incentive influence on PV adoption

作者: Andrea Borghesi, Michela Milano

PubTime: 2014-10

Downlink: https://ieeexplore.ieee.org/document/7016446/

Journal: 2014 International Conference on Renewable Energy Research and Application (ICRERA)

中文摘要: 可再生能源技术在不同程度上受益于工业化国家在过去20年推出的激励计划的支持。了解激励计划对采用可再生能源的影响是决策者的一个重要方面。在本文中，我们通过设计一个基于代理的模拟器来研究国家激励措施对意大利光伏电站扩散的影响，该模拟器目前正在根据真实数据进行调整，将帮助政策制定者预测未来的采用。

摘要: Renewable energy technologies have benefited to varying extent from the support of incentive programmes introduced in the industrialised countries over the last 20 years. Understanding the impact of incentive schemas on the adoption of renewable energy sources is a crucial aspect for policy makers. We study in this paper the impact of national incentives on photovoltaic plant diffusion in Italy by devising an agent-based simulator which is currently being tuned on real data and will help policy makers to forecast future adoptions.

标题: Deciding on optimal assistance policies in haptic shared control tasks

作者: Javier Corredor, Jorge Sofrony, Angelika Peer

PubTime: 2014-06

Downlink: https://ieeexplore.ieee.org/document/6907243/

Journal: 2014 IEEE International Conference on Robotics and Automation (ICRA)

中文摘要: 本文介绍了一种触觉助手，它通过增益调度阻抗控制器来增强任务性能和人机交互。所提出的辅助策略建立在认知科学领域首次提出的决策研究和模型的基础上，并将这些模型与增益调度阻抗控制技术相结合，以便在具有环境不确定性的跟踪任务中增强人机交互。本文探讨了漂移扩散模型作为决策模型，并提出了一种自适应阻抗控制策略，以提高任务性能和人机交互。

摘要: This paper presents a haptic assistant that enhances task performance and human-machine interaction via a gain-scheduled impedance controller. The assistance strategy proposed builds on decision-making studies and models first proposed in the field of cognitive science and combines these models with a gain-scheduled impedance control technique in order to enhance human machine interaction in a tracking task with environmental uncertainties. This paper explores the Drift-Diffusion Model as decision making model and proposes an adaptive impedance control strategy that enhances both, task performance and human-machine interaction.

标题: Innovation pattern analysis

作者: Claudia Diamantini, Laura Genga, Domenico Potena

PubTime: 2013-05

Downlink: https://ieeexplore.ieee.org/document/6567301/

Journal: 2013 International Conference on Collaboration Technologies and Systems (CTS)

中文摘要: 过去几十年创新管理的发展受到了Chesbrough[1]提出的“开放式创新”理论的强烈影响和领导，并已成为商业文献中最热门的话题之一。在当前的经济形势下，越来越多的组织决定在其创新政策中采取更开放的方法，试图与外部合作伙伴建立或多或少的牢固关系，让他们直接参与创新项目。因此，协作工作在组织的创新实践中变得越来越重要，因为创新项目的成败往往与协作任务的结果密切相关。因此，为了支持组织的创新过程，人们可以调查和改进其协作实践，目的是发现最佳实践，即那些最大化组织创新项目成功概率的实践。然而，这种分析往往因缺乏真实世界的数据而受阻，主要是因为能够收集创新活动痕迹的创新管理系统的传播有限。然而，企业内部和外部的日常活动几乎完全由软件系统来执行。这些系统显式和隐式地跟踪用户的活动，例如ERP日志、版本控制系统、电子邮件列表、文件时间戳等等。在目前的工作中，我们提出了一种方法，旨在基于企业每天收集的真实数据来发现相关的协作模式，目的是为业务用户提供对协作组成员之间交互动态的更好理解。我们的想法是首先收集创新项目协作开发过程中产生的任何类型的数据，然后将它们集成到一个独特的知识库中，存储企业活动的痕迹。通过预处理分析，这样的痕迹被转化为过程模式，过程模式可以被认为是组织中协同创新过程的表示，在此基础上我们可以执行模式发现。为此，我们考虑了层次聚类，它能够提取代表共同协作模式的频繁子过程，并将它们排列在具有不同抽象级别的层次中。本工作的其余部分分为两个部分，前者旨在描述方法的主要思想，后者勾勒出我们计划进行的未来扩展。

摘要: The evolution of innovation management in last decades was strongly influenced and led by the theory of the “Open Innovation” introduced by Chesbrough [1], and has become one of the hottest topic in business Literature. In the current economical scenario an increasingly number of organizations decide to adopt a more open approach in their innovation policy, trying to establish more or less strong relations with external partners, directly involving them in innovative projects. Consequently the collaborative work is gaining a growing importance in innovation practices of organizations, since the success or failure of innovative projects is often strictly related to results of collaborative tasks. Therefore, to support innovation processes of an organization one can investigate and improve its collaboration practices, with the aim to discover the best ones, i.e. those that maximize the success probability of organizations innovative projects. However, this kind of analysis is often prevented by the lack of real world data, mainly due to the limited diffusion of innovation management systems capable to collect innovation activities traces. Nevertheless, the daily activities of an enterprise, both internal and external, are almost completely performed by software systems. Both explicitly and implicitly, these systems keep track of users activities, e.g. ERP logs, versioning systems, list of emails, file timestamps, and so forth. In the present work we propose a methodology aimed to discover relevant collaboration patterns based on real data daily collected by enterprises, with the aim of providing business users with a better understanding of the dynamics of the interactions among members of collaborating groups. Our idea is firstly to collect any kind of data produced during the collaborative development of an innovation project, then to integrate them into a unique knowledge base storing traces of enterprise activities. Through preprocessing analysis, such traces are translated into process schemas, that can be considered as a representation of collaborative innovation processes in the organization, on which we can perform pattern discovery. To this aim we consider hierarchical clustering, which is capable to extracts frequent subprocesses representing common collaboration patterns and to arrange them in a hierarchy with different level of abstractions. The rest of this work is organized in two sections, the former aimed to describe the main ideas of the methodology, the latter to sketch out future extensions we plan to conduct.

== Visual Navigation@VLN @ Visual Language Navigation ==

标题: A Novel Paradigm of Indoor Navigation System using Li-Fi Technology

作者: P Srinithi, S Kalpanadevi, P Rekha

PubTime: 2023-12

Downlink: https://ieeexplore.ieee.org/document/10405220/

Journal: 2023 2nd International Conference on Automation, Computing and Renewable Systems (ICACRS)

中文摘要: 根据世界卫生组织（WMO）2022年的一项调查，全球有近22亿人患有视觉障碍，而且这一数字还将增长，尤其是随着全球人口老龄化。当这些人访问不熟悉的室内位置时，他们面临许多困难。虽然室内导航对视力正常的人来说很方便，但对于视力受损的人来说，拥有日常活动的辅助工具是必不可少的。开发电子旅行辅助设备是为了探测障碍物，并通过声音信号引导用户进行简单的导航。为了应对这些挑战，开发了一种利用Li-Fi技术的室内导航系统。这项技术使用LED灯泡进行数据传输，并在不受射频噪声影响的免费、免许可频谱上运行。Li-Fi提供高速、低成本的无线通信，高度安全，不易被拦截。此外，与Wi-Fi相比，它提供了更大的带宽。为了提高用户的便利性，该系统集成了用于障碍物检测的超声波传感器和用于监控用户角速度的加速度计。这种集成使该系统非常适合在不熟悉的室内环境中指导视障人士。

摘要: According to a 2022 survey by the World Health Organization (WHO), nearly 2.2 billion people across the globe are living with visual impairments, and this number is poised to grow, especially as the global population ages. When these individuals visit unfamiliar indoor locations, they face many difficulties. While navigation indoors is convenient for sighted individuals, it is essential for the visually impaired to have aiding tools for their daily activities. Electronic Travel Aids (ETAs) were developed to detect obstacles and guide users through acoustic signals for simple navigation. To address these challenges, an indoor navigation system utilizing Li-Fi technology was developed. This technology uses LED light bulbs for data transmission and operates on a free, unlicensed spectrum that is unaffected by RF noise. Li-Fi provides high-speed, low-cost wireless communication, is highly secure, and cannot be easily intercepted. Additionally, it offers a larger bandwidth compared to Wi-Fi. To enhance user convenience, the system is integrated with an ultrasonic sensor for obstacle detection and an accelerometer to monitor the user’s angular velocity. This integration makes the system well-suited for guiding visually impaired individuals in unfamiliar indoor environments.

标题: Research on Adaptive Navigation System of Mountain Orchard Plant Protection Unmanned Aerial Vehicle Based on Simultaneous Localization and Mapping and Global Navigation Satellite System Fusion

作者: Xuan Ouyang, Xujun Liu, Xiangkai Xu

PubTime: 2022-12

Downlink: https://ieeexplore.ieee.org/document/10044337/

Journal: 2022 2nd International Conference on Electronic Information Technology and Smart Agriculture (ICEITSA)

中文摘要: 中国西部地形以山地、高原和盆地为主，东部以平原和丘陵为主。山区占全国总面积的2/3。由于地形条件的限制，果园经营机械化程度不高，果农劳动强度大，成本高。果农对果园机械化管理和保护的渴望非常迫切。本系统针对四旋翼无人机开发了地形自适应起落架，分析了地形识别算法，结合山地果园植保无人机的作业要求，植保无人机采取作业线提取法、最优路径为作业，根据果树分布的特点研究结合视觉同步定位与测绘技术和全球导航卫星系统技术水平的无人机导航控制方法，结合以上方法构建适合山地果园四旋翼植保无人机作业的自适应导航控制系统，建立山地果园植保无人机自适应导航系统，具有很大的研究和经济价值。

摘要: China’s western terrain is dominated by mountains, plateaus and basins, and the east is dominated by plains and hills. The mountainous area occupies 2/3 of the total national area. Due to the restriction of terrain conditions, the mechanization degree of orchard management is not high, and the labor intensity of fruit farmers is large and the cost is high. The desire of fruit farmers for orchard mechanization management and protection is very urgent. This system in view of the four rotor unmanned aerial vehicle developed a terrain adaptive landing gear, analysis of the terrain recognition algorithm, combined with mountain orchard plant protection drone operation requirements, the plant protection drones take operation line extraction method, the optimal path for homework, according to the characteristics of the fruit tree distribution research combines visual simultaneous localization and mapping technology and global navigation satellite system technology level of unmanned aerial vehicle navigation control method, It is of great research and economic value to combine the above methods and build an adaptive navigation control system suitable for quadrotor plant protection unmanned aerial vehicle operation in mountain orchards, and to establish an adaptive navigation system for mountain orchard plant protection unmanned aerial vehicle.

标题: DC-VINS: Dynamic Camera Visual Inertial Navigation System with Online Calibration

作者: Jason Rebello, Chunshang Li, Steven L. Waslander

PubTime: 2021-10

Downlink: https://ieeexplore.ieee.org/document/9607726/

Journal: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

中文摘要: 视觉惯性（VI）传感器组合由于其低成本、有限的功耗和互补的传感能力，在各种自动驾驶和航空导航应用中变得无处不在。然而，当前的VI传感器配置假设摄像机和IMU之间的静态刚性转换，排除了独立于IMU运动操纵摄像机的视点，这在特征分布不均匀的情况下和对于高速动态运动是重要的。正如在大多数商用无人机上看到的那样，万向节稳定相机在SLAM中的使用有限，因为无法解决紧密传感器融合中所需的IMU和相机之间的时变外部校准。在本文中，我们介绍了安装在致动机构上的动态摄像机和安装在车身上的IMU之间的在线外部校准，IMU集成到视觉里程计管道中。此外，我们提供了校准参数的退化分析，导致校准中使用的驱动机构的新参数化。我们将我们的校准构建到VINS融合包中，并表明我们能够准确地在线恢复校准参数，同时操纵相机的视点以突出丰富的区域，从而在340米的平均轨迹长度上实现0.26米的平均RMSE误差，比具有静态相机的传统视觉惯性管道低31.45%。

摘要: Visual-inertial (VI) sensor combinations are becoming ubiquitous in a variety of autonomous driving and aerial navigation applications due to their low cost, limited power consumption and complementary sensing capabilities. However, current VI sensor configurations assume a static rigid transformation between the camera and IMU, precluding manipulating the viewpoint of the camera independent of IMU movement which is important in situations with uneven feature distribution and for high-rate dynamic motions. Gimbal stabilized cameras, as seen on most commercially available drones, have seen limited use in SLAM due to the inability to resolve the time-varying extrinsic calibration between the IMU and camera needed in tight sensor fusion. In this paper, we present the online extrinsic calibration between a dynamic camera mounted to an actuated mechanism and an IMU mounted to the body of the vehicle integrated into a Visual Odometry pipeline. In addition, we provide a degeneracy analysis of the calibration parameters leading to a novel parameterization of the actuated mechanism used in the calibration. We build our calibration into the VINS-Fusion package and show that we are able to accurately recover the calibration parameters online while manipulating the viewpoint of the camera to feature rich areas thereby achieving an average RMSE error of 0.26m over an average trajectory length of 340m, 31.45% lower than a traditional visual inertial pipeline with a static camera.

标题: LOVINS:Lightweight Omnidirectional Visual-Inertial Navigation System

作者: Bo Gao, Dongjia Wang, Baowang Lian

PubTime: 2021-08

Downlink: https://ieeexplore.ieee.org/document/9564577/

Journal: 2021 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC)

中文摘要: 视觉惯性导航系统（VINS）是自主定位和导航的通用系统，由摄像机和惯性测量单元（IMU）组成。然而，由于尺寸和成本的限制，系统可能在一些计算资源有限的平台上仅使用廉价、低性能的传感器或处理器，因此在算法鲁棒性和计算效率方面存在许多挑战。为此，我们开发了一种轻型全向视觉惯性导航系统（LOVINS），这是一种结合了宽视场（FOV）相机和IMU的导航系统。为了限制计算复杂度，在系统前端采用直接方法初始化系统并跟踪非关键帧进行姿态估计，后端采用基于特征的方法跟踪关键帧进行非线性优化。在后端，采用滑动窗口进行非线性优化，采用边际化的方法固定关键帧数，保证稀疏性，适当减少系统数据冗余。在TUM VI基准测试上的实验表明，与其他现有方法相比，由于宽视场摄像机和帧跟踪策略的优势，LOVINS在精度和鲁棒性方面具有更高的性能，尤其是在实时性方面。

摘要: Visual-inertial navigation system (VINS) is the common system for autonomous positioning and navigation, which consists of a camera and an inertial measurement unit (IMU). However, due to size and cost constraints, it is possible for the system to use only cheap, low performance sensors or processors in some platforms with limited computing resources, thus there are many challenges in terms of algorithm robustness and computational efficiency. For this reason, we developed a lightweight omnidirectional visual-inertial navigation system (LOVINS), which is a navigation system that incorporates wide field of view (FOV) camera and IMU. In order to limit the computational complexity, at the front-end of the system, direct method is used to initialize the system and track non-keyframes for pose estimation, feature-based method is used to track keyframes for back-end nonlinear optimization. While at the back-end, sliding window is used for nonlinear optimization, and marginalization is adopted to fix the number of keyframes and ensure the sparsity, thus reduce the system data redundancy properly. The experiments on TUM VI benchmark demonstrate that, compared with other state-of-the-art methods, LOVINS has a higher performance in accuracy and robustness, especially in real-time, due to the advantages of wide FOV camera and frame tracking strategy.

标题: Development of multi-sensor information fusion and AGV navigation system

作者: Shengguo Zhou, Guanghe Cheng, Qinglong Meng

PubTime: 2020-06

Downlink: https://ieeexplore.ieee.org/document/9084687/

Journal: 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC)

摘要: The homing truck-AGV (Automated Guided Vehicle) is equipped with electromagnetic or optical automatic guiding device, such as Guided by the move to routes, which has the function of security protection and a variety of transfer truck, AGV belongs to the Wheeled Mobile Robot (WMR-Wheeled Mobile Robot) category [1]. In industrial applications, AGV is powered by battery and controls the driving route through computer, and completes the handling of goods through inertial navigation, magnetic navigation, electromagnetic navigation, lidar navigation, visual navigation and other navigation methods. On the basis of summarizing the previous research results, this paper briefly reviews the overall research status, points out the research problems encountered in the current exploration and the future research trend, in order to provide reference for further research in this field.

标题: Hybrid IMU-Aided Approach for Optimized Visual Odometry

作者: Ahmed Mahmoud, Mohamed M. Atia

PubTime: 2019-11

Downlink: https://ieeexplore.ieee.org/document/8969460/

Journal: 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

摘要: Autonomous navigation of unmanned vehicles in GPS-denied environments is a challenging problem, especially for small ground vehicles and micro aerial vehicles (MAVs) which are characterized by their small payload, short battery lifetime and limited processing resources. Stereo vision positioning has been introduced as a scale-free positioning technique, but it is computationally expensive. Monocular vision systems aided by inertial measurement unit (IMU) are more computationally efficient but it suffers from IMU random biases and scale errors. In this paper, we propose a hybrid visual-inertial odometry solution that minimizes the computation load by dividing the mission into two interchangeable stages. Firstly, a stereo vision stage in which a loosely coupled integration between stereo cameras and IMU is performed. In this stage, an extended Kalman filter (EKF) is used to automatically and dynamically estimate IMU biases. Once the IMU is calibrated, a monocular stage is activated where the system is downgraded into single camera getting the motion scale from the calibrated IMU. The proposed solution has been tested using the popular IMU-enabled ZED-Mini tracking camera. We compared our stereo vision solution against the IMU-aided monocular solution and the results showed accurate positioning with the advantage of less computation. Further analysis is provided where we compared our solution with the built-in solutions of the ZED Mini camera and the Intel Realsense T265 tracking camera.

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

在这里插入图片描述

晓理紫

关注

22
点赞
踩
21

收藏

觉得还不错? 一键收藏
0
评论
[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--大模型、扩散模型

我们推出SongComposer，这是一款为歌曲创作而设计的创新LLM。通过利用LLM的能力，它可以理解并生成象征性歌曲表示中的旋律和歌词。现有的与音乐相关的LLM将音乐视为量化的音频信号，而这种隐式编码导致编码效率低和灵活性差。相比之下，我们求助于象征性的歌曲表现，这是人类为音乐设计的成熟而高效的方式，并使LLM能够像人类一样明确地创作歌曲。
复制链接

扫一扫

专栏目录