【论文速递】2025年07周（Robotics/Embodied AI/LLM）-CSDN博客

本文链接：https://blog.csdn.net/maizousidemao/article/details/147531056

LLM肩上的随机鹦鹉：对物理概念理解的总结评估
- 英文摘要
- 中文摘要
1B LLM可以超过405B LLM吗？重新思考计算最佳测试时间缩放
- 英文摘要
- 中文摘要
InfiniteHiP：将语言模型上下文扩展到一个GPU上最多300万个Token
- 英文摘要
- 中文摘要
扩展测试时计算与潜在推理：一种循环深度方法
- 英文摘要
- 中文摘要
预料到意外：金融领域的故障保护长上下文问答
- 英文摘要
- 中文摘要
悟空：基于流的视频生成基础模型
- 英文摘要
- 中文摘要
SynthDetoxM：现代语言模型是少量样本并行排毒数据标注者
- 英文摘要
- 中文摘要
具有大型推理模型的竞争性编程
- 英文摘要
- 中文摘要
VideoRoPE：什么构成了良好的视频旋转位置嵌入？
- 英文摘要
- 中文摘要
探索学习数学推理的结果奖励极限
- 英文摘要
- 中文摘要
FINO1：关于推理增强LLM融资的可转让性
- 英文摘要
- 中文摘要
Benchmax：大型语言模型的全面多语言评估套件
- 英文摘要
- 中文摘要
使用滑动分块注意力机制快速生成视频
- 英文摘要
- 中文摘要
TransMLA：多头潜在注意力就是你所需要的
- 英文摘要
- 中文摘要
CodeI/O：通过代码输入输出预测来凝结推理模式
- 英文摘要
- 中文摘要
蒸馏缩放定律
- 英文摘要
- 中文摘要
Light-A-Video：通过渐进式光融合进行无训练的视频重新拍摄
- 英文摘要
- 中文摘要
Skrr：跳过并重用文本编码器层以实现内存高效的文本到图像生成
- 英文摘要
- 中文摘要
TextAtlas5M：一个用于密集文本图像生成的大规模数据集
- 英文摘要
- 中文摘要
QuEST：对具有1位权重和激活的LLM的稳定培训
- 英文摘要
- 中文摘要
CineMaster：电影到视频的3D感知框架
- 英文摘要
- 中文摘要
TripoSG：高保真3D形状合成，使用大型整流流模型
- 英文摘要
- 中文摘要
财务时间序列的检索大型语言模型预测
- 英文摘要
- 中文摘要
大型语言模型（LLMs）可以轻松地从演示中学习推理——重要的是结构，而不是内容！
- 英文摘要
- 中文摘要
大语模型中深度的诅咒
- 英文摘要
- 中文摘要
训练用于社交推理的语言模型与多智能体强化学习
- 英文摘要
- 中文摘要
SelfCite：大语言模型中上下文归因的自我监督对齐
- 英文摘要
- 中文摘要
EmbodiedBench：全面基准测试多模态大型语言模型在视觉驱动的具身代理中的应用
- 英文摘要
- 中文摘要
Magic 1-For-1：在一分钟内生成一分钟的视频剪辑
- 英文摘要
- 中文摘要
AuraFusion360：基于参考的360°无界场景修复的增强未见区域对齐
- 英文摘要
- 中文摘要

LLM肩上的随机鹦鹉：对物理概念理解的总结评估

标题: The Stochastic Parrot on LLM’s Shoulder: A Summative Assessment of Physical Concept Understanding
作者: Mo Yu, Lemao Liu, Junjie Wu, Tsz Ting Chung, Shunchi Zhang, Jiangnan Li, Dit-Yan Yeung, Jie Zhou
日期: 2025-02-13
论文链接: https://arxiv.org/pdf/2502.08946
项目链接: https://physico-benchmark.github.io

英文摘要

In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.

中文摘要

我们以系统的方式研究了一个广泛提出的问题：LLMS是否真的了解他们说的话？这与更熟悉的术语随机鹦鹉有关。为此，我们提出了对理解的物理概念理解任务物理学的总结性评估。我们的任务通过使用网格形式的输入来减轻记忆问题，这些输入抽象地描述了物理现象。网格代表了不同水平的理解水平，从核心现象，应用程序示例到类比到网格世界中的其他抽象模式。对我们任务的全面研究表明：（1）最先进的LLM，包括GPT-4O，O1和Gemini 2.0 Flash Thinking，落后于人类落后于40％；（2）随机鹦鹉现象存在于LLM中，因为它们在我们的网格任务上失败，但可以很好地描述和认识到自然语言的相同概念；（3）由于内在的困难而不是陌生的网格格式，我们的任务挑战了LLMS，因为在同一格式的数据上进行了文化学习和微调，对其性能几乎没有增加。

1B LLM可以超过405B LLM吗？重新思考计算最佳测试时间缩放

标题: Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
作者: Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, Bowen Zhou
日期: 2025-02-10
论文链接: https://arxiv.org/pdf/2502.06703
项目链接: https://ryanliu112.github.io/compute-optimal-tts

英文摘要

Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.

中文摘要

测试时间缩放（TTS）是通过在推理阶段使用其他计算来改善大语言模型（LLMS）性能的重要方法。但是，当前的研究并未系统地分析政策模型，过程奖励模型（PRM）和问题难度如何影响TT。缺乏分析限制了TTS方法的理解和实际使用。在本文中，我们关注两个核心问题：（1）跨不同策略模型，PRM和问题难度级别的规模测试时间计算的最佳方法是什么？（2）扩展计算在多大程度上可以改善LLM在复杂任务上的性能，并且通过这种方法，较小的语言模型可以优于较大的语言模型？通过有关数学500和挑战AIME24任务的全面实验，我们有以下观察结果：（1）计算最佳的TTS策略高度取决于政策模型，PRM和问题困难的选择。（2）借助我们的计算最佳TTS策略，极小的政策模型可以超过较大的模型。例如，1B LLM在Math-500上可以超过405B LLM。此外，在Math-500和AIME24上，0.5B LLM的表现均优于GPT-4O，3B LLM超过405B LLM，而7B LLM击败O1和DeepSeek-R1，而推理效率较高。这些发现表明，将TTS策略适应每个任务和模型的特定特征的重要性，并表明TTS是增强LLMS推理能力的有前途的方法。

InfiniteHiP：将语言模型上下文扩展到一个GPU上最多300万个Token

标题: InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
作者: Heejun Lee, Geon Park, Jaduk Suh, Sung Ju Hwang
日期: 2025-02-13
论文链接: https://arxiv.org/pdf/2502.08910

英文摘要

In modern large language models (LLMs), handling very long context lengths presents significant challenges as it causes slower inference speeds and increased memory costs. Additionally, most existing pre-trained LLMs fail to generalize beyond their original training sequence lengths. To enable efficient and practical long-context utilization, we introduce InfiniteHiP, a novel, and practical LLM inference framework that accelerates processing by dynamically eliminating irrelevant context tokens through a modular hierarchical token pruning algorithm. Our method also allows generalization to longer sequences by selectively applying various RoPE adjustment methods according to the internal attention patterns within LLMs. Furthermore, we offload the key-value cache to host memory during inference, significantly reducing GPU memory pressure. As a result, InfiniteHiP enables the processing of up to 3 million tokens on a single L40s 48GB GPU – 3x larger – without any permanent loss of context information. Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training. We implement our method in the SGLang framework and demonstrate its effectiveness and practicality through extensive evaluations.

中文摘要

在现代大型语言模型（LLMS）中，处理非常长的上下文长度会带来重大挑战，因为它会导致推理速度较慢并增加记忆成本。此外，大多数现有的预训练的LLM都无法推广其原始训练序列长度。为了实现高效且实用的长篇文化利用，我们引入了无限且实用的LLM推理框架，该框架通过动态消除无关的上下文标记通过模块化的层次标记固定算法来加速处理。我们的方法还允许通过根据LLM中的内部注意力模式选择性地应用各种绳索调整方法来概括更长的序列。此外，我们在推理过程中将密钥值缓存卸载到主机内存，从而大大降低了GPU内存压力。结果，Infinithip可以在单个L40S 48GB GPU上处理多达300万个令牌-3倍 - 而没有任何永久性的上下文信息丢失。我们的框架在不需要额外的培训的情况下，对100万个令牌环境的注意力解码实现了18.95倍的速度。我们在SGLANG框架中实施我们的方法，并通过广泛的评估来证明其有效性和实用性。

扩展测试时计算与潜在推理：一种循环深度方法

标题: Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
作者: Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein
日期: 2025-02-07
论文链接: https://arxiv.org/pdf/2502.05171

英文摘要

We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.

中文摘要

我们研究一种新型的语言模型体系结构，能够通过潜在空间中隐式推理来扩展测试时间计算。我们的模型通过迭代复发块来起作用，从而在测试时间内展开对任意深度。这与主流推理模型相反，该模型通过产生更多的令牌来扩展计算。与基于思想链的方法不同，我们的方法不需要任何专业的培训数据，可以与小型上下文窗口一起使用，并且可以捕获不容易用文字表示的推理类型。我们将概念验证模型扩展到35亿参数和8000亿个令牌。我们表明，所得模型可以在推理基准上提高其性能，有时是显着的，直到相当于500亿个参数的计算负载。

预料到意外：金融领域的故障保护长上下文问答

标题: Expect the Unexpected: FailSafe Long Context QA for Finance
作者: Kiran Kamble, Melisa Russak, Dmytro Mozolevskyi, Muayad Ali, Mateusz Russak, Waseem AlShikh
日期: 2025-02-10
论文链接: https://arxiv.org/pdf/2502.06329

英文摘要

We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance. We concentrate on two case studies: Query Failure and Context Failure. In the Query Failure scenario, we perturb the original query to vary in domain expertise, completeness, and linguistic accuracy. In the Context Failure case, we simulate the uploads of degraded, irrelevant, and empty documents. We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models. The results suggest that although some models excel at mitigating input perturbations, they must balance robust answering with the ability to refrain from hallucinating. Notably, Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained strong baseline performance but encountered challenges in sustaining robust predictions in 17% of test cases. On the other hand, the most robust model, OpenAI o3-mini, fabricated information in 41% of tested cases. The results demonstrate that even high-performing models have significant room for improvement and highlight the role of FailSafeQA as a tool for developing LLMs optimized for dependability in financial applications. The dataset is available at: https://huggingface.co/datasets/Writer/FailSafeQA

中文摘要

我们提出了一种新的长篇小说金融基准，FailSAFEQA，旨在测试LLM的稳健性和上下文意识，以与金融内LLM基于LLM的查询 - 答案系统中人际关系相互作用的六种变化。我们专注于两个案例研究：查询失败和上下文失败。在查询失败方案中，我们将原始查询扰动以在域专业知识，完整性和语言准确性方面变化。在上下文失败情况下，我们模拟了降级，无关紧要和空的文档的上传。我们使用QWEN2.5-72B教学的LLM-AS-A-A-A-A-Gudge方法，并使用细粒度的评分标准来定义和计算24个现成模型的鲁棒性，上下文接地和合规分数。结果表明，尽管某些模型在减轻输入扰动方面表现出色，但它们必须平衡强大的答案和避免幻觉的能力。值得注意的是，被认为是最合规的模型的Palmyra-Fin-128k教学法保持着强劲的基线表现，但在17％的测试用例中遇到了挑战。另一方面，最强大的模型是OpenAi O3-Mini，在41％的测试案例中捏造了信息。结果表明，即使是高性能模型也具有重大改进的空间，并突出了FailSAFEQA作为开发为财务应用中可靠性优化的LLM的工具的作用。该数据集可在以下网址提供：https：//huggingface.co/datasets/writer/failsafeqa

悟空：基于流的视频生成基础模型

标题: Goku: Flow Based Video Generative Foundation Models
作者: Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu
日期: 2025-02-07
论文链接: https://arxiv.org/pdf/2502.04896

英文摘要

This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.

中文摘要

本文介绍了Goku，这是一个最先进的联合图像和视频生成模型，利用了整流的流动Transformers来实现行业领先的绩效。我们详细介绍了实现高质量视觉生成的基础元素，包括数据策划管道，模型体系结构设计，流程配方和高级基础架构，以进行高效且稳健的大规模培训。Goku模型在定性和定量评估中都表现出卓越的性能，从而在主要任务中设定了新的基准测试。具体而言，Goku在文本到图像生成的DPG基础上达到了Geneval的0.76，在DPG基础上达到了83.65，而在文本到视频任务的VBench上达到了84.85。我们认为，这项工作为研究社区提供了有价值的见解和实践进步，以开发共同的图像和视频生成模型。

SynthDetoxM：现代语言模型是少量样本并行排毒数据标注者

标题: SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
作者: Daniil Moskovskiy, Nikita Sushko, Sergey Pletenev, Elena Tutubalina, Alexander Panchenko
日期: 2025-02-10
论文链接: https://arxiv.org/pdf/2502.06394
项目链接: https://s-nlp.github.io/synthdetoxm/

英文摘要

Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.

中文摘要

平行多语言数据集的稀缺性，现有的多语言文本排毒方法受到了阻碍。在这项工作中，我们引入了一条流量，用于生成多语言并行排毒数据。我们还介绍了SyntheToxm，这是一种手动收集和合成生成的多语言平行文本排毒数据集，其中包括16,000个高质量的排毒句子对，遍布德语，法语，西班牙语和俄语。这些数据来自不同的毒性评估数据集，然后在几次射击设置中用9个现代开源LLM重写。我们的实验表明，在生产的合成数据集上训练的模型甚至在数据有限的设置中，在人类通知的多氧化苯甲酸酯数据集中训练的模型具有较高的性能。在合成毒素上训练的型号在几次射击设置中均优于所有评估的LLM。我们发布数据集和代码，以帮助进一步研究多语言文本排毒。

具有大型推理模型的竞争性编程

标题: Competitive Programming with Large Reasoning Models
作者: OpenAI, Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Lukasz Kaiser, Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, o3 contributors, Oleg Mürk, Rhythm Garg, Rui Shu, Szymon Sidor, Vineet Kosaraju, Wenda Zhou
日期: 2025-02-03
论文链接: https://arxiv.org/pdf/2502.06807

英文摘要

We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a Codeforces rating on par with elite human competitors. Overall, these results indicate that scaling general-purpose reinforcement learning, rather than relying on domain-specific techniques, offers a robust path toward state-of-the-art AI in reasoning domains, such as competitive programming.

中文摘要

我们表明，应用于大型语言模型（LLM）的强化学习可显着提高复杂编码和推理任务的性能。此外，我们将两个通用推理模型 - OpenAI O1和O3的早期检查点与域特异性系统O1-IIOI进行了比较，O1-IOI使用手工设计的推理策略，旨在在2024年在Informatics（IOI）中竞争2024年国际奥林匹克运动会（IOI）。我们与O1-IOI一起在IOI 2024中现场比赛，并使用手工制作的测试时间策略排名第49个百分位。在轻松的竞争限制下，O1-IOI获得了金牌。但是，在评估后来的O3之类的模型时，我们发现O3无需手工制作的领域特定策略或放松的约束就可以实现黄金。我们的发现表明，尽管O1-IOI等专门的管道可实现可靠的改进，但扩展的通用O3模型超过了这些结果，而无需依赖手工制作的推理启发式方法。值得注意的是，O3在2024年IOI上获得了金牌，并获得了与精英人类竞争对手的评级。总体而言，这些结果表明，扩展通用强化学习，而不是依靠特定领域的技术，而是在推理领域（例如竞争性编程）中通往最先进的AI的强大途径。

VideoRoPE：什么构成了良好的视频旋转位置嵌入？

标题: VideoRoPE: What Makes for Good Video Rotary Position Embedding?
作者: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin
日期: 2025-02-07
论文链接: https://arxiv.org/pdf/2502.05173

英文摘要

While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce VideoRoPE, with a 3D structure designed to preserve spatio-temporal relationships. VideoRoPE features low-frequency temporal allocation to mitigate periodic oscillations, a diagonal layout to maintain spatial symmetry, and adjustable temporal spacing to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at https://github.com/Wiselnn570/VideoRoPE{https://github.com/Wiselnn570/VideoRoPE}.

中文摘要

虽然旋转位置嵌入（绳索）及其变体因其长篇文化功能而被广泛采用，但具有复杂的时空结构的1D绳索扩展到视频，仍然是一个开放的挑战。这项工作首先介绍了一项全面的分析，该分析确定了有效适应绳索对视频至关重要的四个关键特征，这些特征在先前的工作中尚未完全考虑。作为分析的一部分，我们引入了一个具有挑战性的V-NIAH-D（带有干扰器的视觉针中的视力）任务，该任务将定期干扰物添加到V-Niah中。V-NIAH-D任务表明，缺乏适当的时间维度分配的先前绳索变体很容易被干扰物误导。基于我们的分析，我们介绍了Videorope，其3D结构旨在保留时空关系。Videorope具有低频时间分配，以减轻周期性振荡，一种对角线布局，以保持空间对称性以及可调节的时间间距，以解开时间和空间索引。视频始终超过以前的绳索变体，跨越了多种下游任务，例如长期视频检索，视频理解和视频幻觉。我们的代码将在https://github.com/wiselnn570/videorope {https://github.com/wiselnn570/videorope}上找到。

探索学习数学推理的结果奖励极限

标题: Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
作者: Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, Kai Chen
日期: 2025-02-10
论文链接: https://arxiv.org/pdf/2502.06781

英文摘要

Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through Outcome REwArd-based reinforcement Learning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. This formulation further implies that the rewards of negative samples should be reshaped to ensure the gradient consistency between positive and negative samples. To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning. With OREAL, for the first time, a 7B model can obtain 94.0 pass@1 accuracy on MATH-500 through RL, being on par with 32B models. OREAL-32B also surpasses previous 32B models trained by distillation with 95.0 pass@1 accuracy on MATH-500. Our investigation also indicates the importance of initial policy models and training queries for RL. Code, models, and data will be released to benefit future researchhttps://github.com/InternLM/OREAL.

中文摘要

推理能力，尤其是解决复杂数学问题的能力，是一般智能的关键组成部分。专有公司的最新进展，例如Openai的O系列模型，在推理任务上取得了显着进步。但是，完整的技术细节仍然没有透露，并且认为被认为要采用的技术仅仅是强化学习（RL）和漫长的思想链。本文提出了一个新的RL框架，称为Oreal，以追求通过基于结果奖励的增强学习来实现的绩效限制，以实现数学推理任务，其中只有二进制成果奖励很容易获得。从理论上讲，我们证明了从最佳N（BON）采样的积极轨迹上克隆的行为足以在二进制反馈环境中学习KL调节的最佳策略。该公式进一步意味着应重塑负样品的奖励，以确保正样品和负样品之间的梯度一致性。为了减轻RL中稀疏奖励带来的长期存在的困难，这甚至因对推理任务的长期思想的部分正确性而加剧了，我们进一步应用了代币级别的奖励模型来在推理轨迹中进行学习的重要令牌。使用Oreal，第一次，7B模型可以在Math-500到RL上获得94.0 Pass@1的精度，并与32B型号相当。Oreal-32B还超过了先前的32B型号，该模型在Math-500上以95.0 Pass@1的精度进行了蒸馏训练。我们的调查还表明，初始政策模型和RL培训查询的重要性。代码，模型和数据将被发布，以使未来的ResearchHttps：//github.com/internlm/oreal受益。

FINO1：关于推理增强LLM融资的可转让性

标题: Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance
作者: Lingfei Qian, Weipeng Zhou, Yan Wang, Xueqing Peng, Jimin Huang, Qianqian Xie
日期: 2025-02-12
论文链接: https://arxiv.org/pdf/2502.08127

英文摘要

Recent advancements in large language models (LLMs) have shown strong general reasoning abilities, yet their effectiveness in financial reasoning remains underexplored. In this study, we comprehensively evaluate 16 powerful reasoning and general LLMs on three complex financial tasks involving financial text, tabular data, and equations, assessing numerical reasoning, tabular interpretation, financial terminology comprehension, long-context processing, and equation-based problem solving. Our results show that while better datasets and pretraining improve financial reasoning, general enhancements like CoT fine-tuning do not always yield consistent gains. Moreover, all reasoning strategies face challenges in improving performance on long-context and multi-table tasks. To address these limitations, we develop a financial reasoning-enhanced model based on Llama-3.1-8B-Instruct, by CoT fine-tuning and reinforcement learning with domain-specific reasoning paths. Even with simple fine-tuning with one financial dataset, our model achieves a consistent 10% performance improvement across tasks, surpassing all 8B models and even Llama3-70B-Instruct and Llama3.1-70B-Instruct on average. Our results highlight the need for domain-specific adaptations in financial tasks, emphasizing future directions such as multi-table reasoning, long-context processing, and financial terminology comprehension. All our datasets, models, and codes are publicly available. Furthermore, we introduce a leaderboard for benchmarking future datasets and models.

中文摘要

大型语言模型（LLM）的最新进展表现出了强大的一般推理能力，但其在财务推理中的有效性仍然没有得到充实。在这项研究中，我们对涉及财务文本，表格数据和方程的三个复杂财务任务进行了全面评估16项强大的推理和一般LLM，评估了数值推理，表格解释，财务术语理解，长篇文章处理，基于方程的问题解决。我们的结果表明，尽管更好的数据集和预处理改善了财务推理，但COT微调等一般增强功能并不总是会产生一致的收益。此外，所有推理策略都在改善长篇文章和多桌子任务的绩效方面面临挑战。为了解决这些局限性，我们通过使用针对特定领域的推理路径来开发基于Llama-3.1-8B教学的财务推理增强模型。即使使用一个财务数据集进行了简单的微调，我们的模型在整个任务之间的绩效提高了10％，超过了所有8B型号，甚至可以平均Llama3-70B-Instruct和Llama3.1-70B教学。我们的结果强调了对财务任务中特定领域的适应的必要性，强调了未来的方向，例如多桌推理，长篇小说处理和财务术语理解。我们所有的数据集，模型和代码均可公开使用。此外，我们介绍了一个排行榜，用于对未来的数据集和模型进行基准测试。

Benchmax：大型语言模型的全面多语言评估套件

标题: BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models
作者: Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, Fei Yuan
日期: 2025-02-11
论文链接: https://arxiv.org/pdf/2502.07346

英文摘要

Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on. However, measuring these advanced capabilities across languages is underexplored. To address the disparity, we introduce BenchMAX, a multi-way multilingual evaluation benchmark that allows for fair comparisons of these important abilities across languages. To maintain high quality, three distinct native-speaking annotators independently annotate each sample within all tasks after the data was machine-translated from English into 16 other languages. Additionally, we present a novel translation challenge stemming from dataset construction. Extensive experiments on BenchMAX reveal varying effectiveness of core capabilities across languages, highlighting performance gaps that cannot be bridged by simply scaling up model size. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code are publicly accessible.

中文摘要

以前的多语言基准主要集中于简单理解任务，但是对于大型语言模型（LLMS），我们强调熟练的教学能力以下，推理，长篇小说理解，代码生成等。但是，跨语言测量这些高级功能并没有得到充分展望。为了解决差异，我们引入了Benchmax，这是一种多语言评估基准，可以对跨语言进行这些重要能力的公平比较。为了维持高质量，在将数据从英语转换为16种其他语言之后，在所有任务中，在所有任务中独立注释每个样本中的每个样本都独立注释每个样本。此外，我们提出了来自数据集构建的新型翻译挑战。对基台马度的广泛实验揭示了跨语言核心能力的有效性，突出了不能通过简单地扩大模型大小来弥合的性能差距。Benchmax是一个全面的多语言评估平台，提供了有前途的测试床来促进多语言模型的发展。数据集和代码可公开访问。

使用滑动分块注意力机制快速生成视频

标题: Fast Video Generation with Sliding Tile Attention
作者: Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, Hao Zhang
日期: 2025-02-06
论文链接: https://arxiv.org/pdf/2502.04507

英文摘要

Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost – when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench.

中文摘要

具有3D全注意力的扩散Transformers（DIT）（DITS）最先进的视频生成，但遭受了高度的计算成本 - 当仅产生5秒的720p视频时，单独的注意力只需要945秒的总推断时间中的800。本文介绍了滑动分块的注意（STA），以应对这一挑战。STA利用了这样的观察结果，即预处理的视频扩散模型中的注意力分数主要集中在局部3D窗口中。通过在局部时空区域滑动和参加，STA消除了全部关注的冗余。与传统的令牌滑动窗户的关注（SWA）不同，STA通过新颖的硬件滑动窗户设计逐个块，在硬件有效的同时保持表现力。通过仔细的内核级优化，STA提供了第一个有效的2D/3D滑动窗口般的注意力实现，可实现58.79％的MFU。确切地说，在Flashattention-2（FA2）（FA2）上，STA在Flashattention-3（FA3）（FA3）上加速了2.8-17x的注意力。在领先的视频DIT（Hunyuanvideo）上，STA将端到端的潜伏期从945（FA3）降低到685秒，而无需质量退化，不需要培训。实现Finetun的进一步降低了268S的潜伏期，Vbench下降了0.09％。

TransMLA：多头潜在注意力就是你所需要的

标题: TransMLA: Multi-head Latent Attention Is All You Need
作者: Fanxu Meng, Zengwei Yao, Muhan Zhang
日期: 2025-02-11
论文链接: https://arxiv.org/pdf/2502.07864

英文摘要

Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA employs an up-projection matrix to increase expressiveness, trading additional computation for reduced communication overhead. Although MLA has demonstrated efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on Group Query Attention (GQA) and have not announced any plans to adopt MLA. In this paper, we show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold. To encourage broader use of MLA, we introduce **TransMLA**, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo additional training to boost expressiveness without increasing the KV cache size. Furthermore, we plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models, thus enabling more efficient distillation of Deepseek R1.

中文摘要

现代大型语言模型（LLMS）经常在当前硬件上遇到通信瓶颈，而不是纯粹的计算约束。多头潜在注意力（MLA）通过在钥匙值（KV）层中使用低级矩阵来应对这一挑战，从而使压缩潜在的KV状态被缓存。相对于传统的多头关注，这种方法大大降低了KV缓存的大小，从而更快地推断了推断。此外，MLA采用了上注矩阵来提高表现力，以减少通信开销的额外计算。尽管MLA在DeepSeek V2/V3/R1中表现出效率和有效性，但许多主要的模型提供商仍然依靠小组查询关注（GQA），并且尚未宣布任何采用MLA的计划。在本文中，我们表明GQA始终可以由MLA表示，同时保持相同的KV缓存开销，但相反的情况不存在。为了鼓励更广泛地使用MLA，我们将**Transmla **引入了一种后训练方法，该方法将广泛使用的基于GQA的模型（例如Llama，Qwen，Mixtral）转换为基于MLA的模型。转换后，该模型可以进行额外的训练以提高表现力而不增加KV缓存大小。此外，我们计划开发MLA特异性的推理加速技术，以保持转换模型中的潜伏期低，从而可以对DeepSeek R1进行更有效的蒸馏。

CodeI/O：通过代码输入输出预测来凝结推理模式

标题: CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction
作者: Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, Junxian He
日期: 2025-02-11
论文链接: https://arxiv.org/pdf/2502.07316
项目链接: https://codei-o.github.io/

英文摘要

Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextually-grounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives – like logic flow planning, state-space searching, decision tree traversal, and modular decomposition – while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models are available at https://github.com/hkust-nlp/CodeIO.

中文摘要

推理是大语言模型的基本能力。尽管先前的研究主要集中于增强数学或代码生成等狭窄技能，但由于稀疏和分散的培训数据，改善许多其他推理任务的绩效仍然具有挑战性。为了解决这个问题，我们提出了一种新颖的方法，该方法是通过将原始代码转换为代码输入输入输入输入预测格式，系统地凝结了本质上嵌入在上下文代码中的各种推理模式。通过培训模型，可以完全用自然语言（COT）理由预测给定的代码和测试案例，我们将其暴露于通用推理基础上，例如逻辑流计划，州空间搜索，决策树横向遍历和模块化的分解 - 同时将结构性的推理从代码规范性的特定于指定性的族裔纳税符号和保留的过程中。实验结果表明，Codei/O导致符号，科学，逻辑，数学和数值和常识性推理任务的一致改进。通过匹配现有的地面真相输出或将代码重新执行预测输入，我们可以验证每个预测并通过多转弯修订进一步增强COTS，从而导致CodeI/O ++并实现更高的性能。我们的数据和模型可在https://github.com/hkust-nlp/codeio上找到。

蒸馏缩放定律

标题: Distillation Scaling Laws
作者: Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb
日期: 2025-02-12
论文链接: https://arxiv.org/pdf/2502.08606

英文摘要

We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1) a teacher exists, or 2) a teacher needs training. If many students are to be distilled, or a teacher already exists, distillation outperforms supervised pretraining until a compute level which grows predictably with student size. If one student is to be distilled and a teacher also needs training, supervised learning should be done instead. Additionally, we provide insights across our large scale study of distillation, which increase our understanding of distillation and inform experimental design.

中文摘要

我们提供了一项蒸馏缩放法律，该法律根据计算预算及其在学生和老师之间的分配来估算蒸馏模型性能。我们的发现降低了与大规模使用蒸馏有关的风险；现在，可以完成对教师和学生模型的计算分配，以最大程度地提高学生的表现。我们提供计算最佳蒸馏食谱，以何时存在1）老师，或者2）老师需要培训。如果要蒸馏出许多学生，或者已经存在老师，那么蒸馏的表现要优于预处理，直到计算水平随着学生规模而增长。如果要蒸馏一名学生，并且教师还需要培训，则应进行监督的学习。此外，我们在大规模蒸馏研究中提供了见解，从而增加了我们对蒸馏的理解并为实验设计提供了信息。

Light-A-Video：通过渐进式光融合进行无训练的视频重新拍摄

标题: Light-A-Video: Training-free Video Relighting via Progressive Light Fusion
作者: Yujie Zhou, Jiazi Bu, Pengyang Ling, Pan Zhang, Tong Wu, Qidong Huang, Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Anyi Rao, Jiaqi Wang, Li Niu
日期: 2025-02-12
论文链接: https://arxiv.org/pdf/2502.08590
项目链接: https://bujiazi.github.io/light-a-video.github.io/

英文摘要

Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive training costs and the scarcity of diverse, high-quality video relighting datasets. A simple application of image relighting models on a frame-by-frame basis leads to several issues: lighting source inconsistency and relighted appearance inconsistency, resulting in flickers in the generated videos. In this work, we propose Light-A-Video, a training-free approach to achieve temporally smooth video relighting. Adapted from image relighting models, Light-A-Video introduces two key techniques to enhance lighting consistency. First, we design a Consistent Light Attention (CLA) module, which enhances cross-frame interactions within the self-attention layers to stabilize the generation of the background lighting source. Second, leveraging the physical principle of light transport independence, we apply linear blending between the source video’s appearance and the relighted appearance, using a Progressive Light Fusion (PLF) strategy to ensure smooth temporal transitions in illumination. Experiments show that Light-A-Video improves the temporal consistency of relighted video while maintaining the image quality, ensuring coherent lighting transitions across frames. Project page: https://bujiazi.github.io/light-a-video.github.io/.

中文摘要

由大规模数据集和预训练的扩散模型驱动的图像重新模型的最新进展使得施加了一致的照明。但是，视频的重新保留仍然滞后，这主要是由于培训成本过多以及多样化，高质量的视频重新保留数据集所致。图像重新确定模型在逐帧基础上的简单应用导致了几个问题：照明源不一致和重新出现的外观不一致，从而导致生成的视频闪烁。在这项工作中，我们提出了Light-a-Video，这是一种无训练的方法，以实现时间顺畅的视频重新保留。Light-A-Video改编自图像重新模型，引入了两种关键技术，以增强照明一致性。首先，我们设计了一个一致的轻度注意（CLA）模块，该模块增强了自我发项层内的跨框架相互作用，以稳定背景照明源的产生。其次，利用光传输独立性的物理原理，我们使用渐进式光融合（PLF）策略在源视频的外观和重新出现的外观之间进行线性混合，以确保照明中的平滑时间过渡。实验表明，Light-A-Video在保持图像质量的同时提高了重新视频的时间一致性，从而确保了跨帧的相干照明过渡。项目页面：https：//bujiazi.github.io/light-a-video.github.io/。

Skrr：跳过并重用文本编码器层以实现内存高效的文本到图像生成

标题: Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation
作者: Hoigi Seo, Wongi Jeong, Jae-sun Seo, Se Young Chun
日期: 2025-02-12
论文链接: https://arxiv.org/pdf/2502.08690

英文摘要

Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.

中文摘要

文本到图像（T2I）扩散模型中的大规模文本编码在从文本提示中生成高质量的图像时表现出了出色的性能。与依赖多个迭代步骤的DeNORING模块不同，文本编码器仅需要单个正向通行证来产生文本嵌入。但是，尽管对总推断时间和浮点操作（FLOPS）的贡献很小，但文本编码的需求明显更高的内存使用情况，最多是降级模块的八倍。为了解决此效率低下，我们提出了跳过和重复使用层（SKRR），这是一种简单而有效的修剪策略，专为T2I扩散模型中的文本编码设计而设计。SKRR通过针对T2I任务量身定制的方式选择性跳过或重复某些层来利用Transformers块中固有的冗余性，从而在不损害性能的情况下减少内存消耗。广泛的实验表明，即使在高稀疏度下，SKRR也保持与原始模型相当的图像质量，表现优于现有的块状修剪方法。此外，SKRR可实现最新的记忆效率，同时在多个评估指标（包括FID，剪辑，DreamsIM和Geneval分数）中保持性能。

TextAtlas5M：一个用于密集文本图像生成的大规模数据集

标题: TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
作者: Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, Min Li
日期: 2025-02-11
论文链接: https://arxiv.org/pdf/2502.07870

英文摘要

Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the generation of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering in text-conditioned image generation. Our dataset consists of 5 million long-text generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 3000 human-improved test set TextAtlasEval across 3 data domains, establishing one of the most extensive benchmarks for text-conditioned generation. Evaluations suggest that the TextAtlasEval benchmarks present significant challenges even for the most advanced proprietary models (e.g. GPT4o with DallE-3), while their open-source counterparts show an even larger performance gap. These evidences position TextAtlas5M as a valuable dataset for training and evaluating future-generation text-conditioned image generation models.

中文摘要

近年来，文本条件的图像生成引起了人们的重大关注，并且正在越来越长时间的文本提示。在日常生活中，茂密而复杂的文本出现在广告，信息图表和标牌等上下文中，其中文本和视觉效果的整合对于传达复杂信息至关重要。但是，尽管有这些进步，但包含长篇文本的图像的产生仍然是一个持续的挑战，这在很大程度上是由于现有数据集的局限性，这些数据集通常集中在较短，更简单的文本上。为了解决这一差距，我们介绍了TextAtlas5m，这是一个专门设计用于评估文本条件图像生成中的长文本渲染的新颖数据集。我们的数据集由500万个长篇文本跨不同数据类型产生和收集的图像组成，从而可以全面评估长文本图像生成的大规模生成模型。我们进一步策划了3000个在3个数据域中进行的3000个人经过的测试集TextAtlaseVal，这是文本条件生成最广泛的基准之一。评估表明，即使对于最先进的专有模型（例如，使用Dalle-3），TextAtlaseVal基准也提出了重大挑战，而其开源对应物显示出更大的性能差距。这些证据位置TextAtlas5m作为培训和评估未来生成文本条件图像生成模型的宝贵数据集。

QuEST：对具有1位权重和激活的LLM的稳定培训

标题: QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
作者: Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh
日期: 2025-02-07
论文链接: https://arxiv.org/pdf/2502.05003

英文摘要

One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the “optimal” bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the “true” (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.

中文摘要

降低大语言模型（LLM）大量成本的一种方法是使用量化或稀疏表示形式进行培训或部署。虽然训练后压缩方法非常流行，但通过直接培训此类表示形式获得更准确的压缩模型的问题仍然是开放的：例如，最近的研究（ARXIV：2411.04330V2）（ARXIV：2411.04330V2）可以使用QAT的“最佳”培训，同时使用QAT训练的“最佳”型号，同时使用QAT的精确度，同时使用QAT的精确度，同时使用QAT的精确度，请访问QAT的精确量，并确定QAT的精确度ATP，并确定QAT的精确度ATPEC，同时使用QAT的精确度，以下方面的精确度ATP PRET/compets pp pp。8位的重量和激活。我们通过一种称为Quest的新方法推进了这一最新方法，该方法具有FP16的帕累托竞争力，即，它在较低的型号大小时提供了更好的准确性，而训练模型则具有4位或更少的重量和激活。此外，Quest允许使用1位权重和激活的稳定训练。Quest通过改善QAT方法的两个关键方面来实现这一目标：（1）通过Hadamard归一化和MSE-Priptimal拟合对权重和激活的（连续）分布进行准确而快速量化；（2）基于明确最大程度地减少量化状态计算出的嘈杂梯度与“ true”（但未知）完全精确梯度之间的噪声之间的误差的想法的新信任梯度估计器。关于骆驼型体系结构的实验表明，Quest在整个硬件支持的精确范围内诱导稳定的缩放定律，并且可以扩展到稀疏表示。我们提供GPU内核支持，以表明可以有效执行任务生产的模型。我们的代码可在https://github.com/ist-daslab/quest上找到。

CineMaster：电影到视频的3D感知框架

标题: CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
作者: Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai
日期: 2025-02-12
论文链接: https://arxiv.org/pdf/2502.08639
项目链接: https://cinemaster-dev.github.io/

英文摘要

In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals–comprising rendered depth maps, camera trajectories and object class labels–serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: https://cinemaster-dev.github.io/.

中文摘要

在这项工作中，我们展示了Cinemaster，这是一个新颖的框架，用于3D感知和可控制的文本到视频一代。我们的目标是使用户具有可比的可控性作为专业电影导演：在场景中的精确放置，在3D空间中对物体和相机的灵活操纵以及对渲染框架的直观布局控制。为了实现这一目标，摄影师分为两个阶段。在第一阶段，我们设计了一个交互式工作流程，允许用户通过在3D空间内定位对象边界框并定义相机运动来直观地构造3D感知的条件信号。在第二阶段，这些控制信号 - 复杂的渲染深度图，摄像头轨迹和对象类标签 - 将作为文本对视频扩散模型的指导，确保生成用户意义的视频内容。此外，为了克服3D对象运动和相机姿势注释的野外数据集的稀缺性，我们仔细建立了一个自动数据注释管道，从大型视频数据中提取3D边界框和摄像头轨迹。广泛的定性和定量实验表明，摄影师显着胜过现有方法，并实现了突出的3D感知文本到视频的生成。项目页面：https：//cinemaster-dev.github.io/。

TripoSG：高保真3D形状合成，使用大型整流流模型

标题: TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models
作者: Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, Yan-Pei Cao
日期: 2025-02-10
论文链接: https://arxiv.org/pdf/2502.06608
项目链接: https://yg256li.github.io/TripoSG-Page/

英文摘要

Recent advancements in diffusion techniques have propelled image and video generation to unprece- dented levels of quality, significantly accelerating the deployment and application of generative AI. However, 3D shape generation technology has so far lagged behind, constrained by limitations in 3D data scale, complexity of 3D data process- ing, and insufficient exploration of advanced tech- niques in the 3D domain. Current approaches to 3D shape generation face substantial challenges in terms of output quality, generalization capa- bility, and alignment with input conditions. We present TripoSG, a new streamlined shape diffu- sion paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images. Specifically, we propose: 1) A large-scale rectified flow transformer for 3D shape generation, achieving state-of-the-art fidelity through training on extensive, high-quality data. 2) A hybrid supervised training strategy combining SDF, normal, and eikonal losses for 3D VAE, achieving high- quality 3D reconstruction performance. 3) A data processing pipeline to generate 2 million high- quality 3D samples, highlighting the crucial rules for data quality and quantity in training 3D gen- erative models. Through comprehensive experi- ments, we have validated the effectiveness of each component in our new framework. The seamless integration of these parts has enabled TripoSG to achieve state-of-the-art performance in 3D shape generation. The resulting 3D shapes exhibit en- hanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input im- ages. Moreover, TripoSG demonstrates improved versatility in generating 3D models from diverse image styles and contents, showcasing strong gen- eralization capabilities. To foster progress and innovation in the field of 3D generation, we will make our model publicly available.

中文摘要

扩散技术的最新进步已将图像和视频生成推动到未经表面的质量水平，从而显着加速了生成AI的部署和应用。但是，到目前为止，3D形状生成技术已经落后，受到3D数据量表的限制，3D数据处理的复杂性以及3D域中先进技术的探索不足。在产出质量，概括性和与输入条件的一致性方面，3D形状生成的当前方法面临着重大挑战。我们提出了TripoSG，这是一种新的流线型形状扩散范式，能够生成具有与输入图像的精确对应的高保真3D网格。具体来说，我们建议：1）一种用于3D形状生成的大规模整流流量Transformers，通过对广泛的高质量数据进行培训来实现最新的保真度。2）结合了3D VAE的SDF，正常和Eikonal损失的混合监督培训策略，可实现高质量的3D重建性能。3）一个数据处理管道生成200万个高质量3D样品，突出了培训3D代理模型中数据质量和数量的关键规则。通过全面的实验，我们在新框架中验证了每个组件的有效性。这些零件的无缝集成使Triposg能够在3D形状生成中实现最先进的性能。由于高分辨率的能力，所得的3D形状表现出了细节，并表现出对输入进度的非凡忠诚。此外，TripoSG在从不同的图像样式和内容中生成3D模型方面的多功能性提高了，展示了强大的一般性功能。为了促进3D一代领域的进步和创新，我们将公开使用我们的模型。

财务时间序列的检索大型语言模型预测

标题: Retrieval-augmented Large Language Models for Financial Time Series Forecasting
作者: Mengxi Xiao, Zihao Jiang, Lingfei Qian, Zhengyu Chen, Yueru He, Yijing Xu, Yuecheng Jiang, Dong Li, Ruey-Ling Weng, Min Peng, Jimin Huang, Sophia Ananiadou, Qianqian Xie
日期: 2025-02-09
论文链接: https://arxiv.org/pdf/2502.05878

英文摘要

Stock movement prediction, a fundamental task in financial time-series forecasting, requires identifying and retrieving critical influencing factors from vast amounts of time-series data. However, existing text-trained or numeric similarity-based retrieval methods fall short in handling complex financial analysis. To address this, we propose the first retrieval-augmented generation (RAG) framework for financial time-series forecasting, featuring three key innovations: a fine-tuned 1B parameter large language model (StockLLM) as the backbone, a novel candidate selection method leveraging LLM feedback, and a training objective that maximizes similarity between queries and historically significant sequences. This enables our retriever, FinSeer, to uncover meaningful patterns while minimizing noise in complex financial data. We also construct new datasets integrating financial indicators and historical stock prices to train FinSeer and ensure robust evaluation. Experimental results demonstrate that our RAG framework outperforms bare StockLLM and random retrieval, highlighting its effectiveness, while FinSeer surpasses existing retrieval methods, achieving an 8% higher accuracy on BIGDATA22 and retrieving more impactful sequences. This work underscores the importance of tailored retrieval models in financial forecasting and provides a novel framework for future research.

中文摘要

股票运动预测是财务时间序列预测的基本任务，需要从大量的时间序列数据中识别和检索关键影响因素。但是，现有的基于文本培训或基于数字相似性的检索方法在处理复杂的财务分析方面缺乏。为了解决这个问题，我们提出了第一个针对财务时间序列预测的检索型（RAG）框架，其中包含三个关键创新：一种微调的1B参数大语言模型（Stockllm）作为骨干，这是一种新颖的候选选择方法，利用LLM的反馈，并在Queriess和历史悠久的大量序列之间最大程度地提高训练目标，并最大程度地提高Queriess和历史悠久的序列。这使我们的猎犬Finseer能够发现有意义的模式，同时最大程度地减少复杂财务数据中的噪音。我们还构建了整合财务指标和历史股票价格的新数据集，以培训Finseer并确保稳健的评估。实验结果表明，我们的抹布框架的表现优于裸露的stockllm和随机检索，突出了其有效性，而Finseer则超过了现有的检索方法，在BigData22上获得了8 \％的精度，并取回了更具影响力的序列。这项工作强调了量身定制的检索模型在财务预测中的重要性，并为未来的研究提供了一个新颖的框架。

大型语言模型（LLMs）可以轻松地从演示中学习推理——重要的是结构，而不是内容！

标题: LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
作者: Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
日期: 2025-02-11
论文链接: https://arxiv.org/pdf/2502.07374

英文摘要

Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA). With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks, including 56.7% (+40.0%) on AIME 2024 and 57.0% (+8.1%) on LiveCodeBench, competitive to the proprietary o1-preview model’s score of 44.6% and 59.1%. More importantly, we find that the structure of Long CoT is critical to the learning process, whereas the content of individual reasoning steps has minimal impact. Perturbations affecting content, such as training on incorrect samples or removing reasoning keywords, have little impact on performance. In contrast, structural modifications that disrupt logical consistency in the Long CoT, such as shuffling or deleting reasoning steps, significantly degrade accuracy. For example, a model trained on Long CoT samples with incorrect answers still achieves only 3.2% lower accuracy compared to training with fully correct samples. These insights deepen our understanding of how to elicit reasoning capabilities in LLMs and highlight key considerations for efficiently training the next generation of reasoning models. This is the academic paper of our previous released Sky-T1-32B-Preview model. Codes are available at https://github.com/NovaSky-AI/SkyThought.

中文摘要

大型推理模型（LRMS）通过遵循结合反射，回溯和自我验证的长期思考（长床）来解决复杂的推理问题。但是，培训技术和数据要求引起长床的理解仍然很少。在这项工作中，我们发现大型语言模型（LLM）可以通过数据有效监督的微调（SFT）和参数有效的低级适应（LORA）有效地学习长期的COT推理。QWEN2.5-32B-Insruct模型仅17K长COT培训样本，在广泛的数学和编码基准上取得了重大改进，包括56.7％（+40.0％）的AIME 2024和57.0％和57.0％（+8.1％）在Livecodebench上竞争Livecodebench的竞争力，在Livecodebench上有竞争力，在Livecodebench上有竞争力，可在livecodebench上竞争。更重要的是，我们发现长床的结构对学习过程至关重要，而单个推理步骤的内容具有最小的影响。影响内容的扰动，例如对不正确的样本进行培训或删除推理关键字，对性能几乎没有影响。相反，在长床上破坏逻辑一致性的结构修饰（例如改组或删除推理步骤）会大大降低准确性。例如，与完全正确的样品训练相比，在具有不正确答案的长床样品上训练的模型仍然仅达到3.2％。这些见解加深了我们对如何在LLM中引起推理能力的理解，并突出了有效培训下一代推理模型的关键考虑。这是我们先前发布的SKY-T1-32B-Preiview模型的学术论文。代码可在https://github.com/novasky-ai/skythought中找到。

大语模型中深度的诅咒

标题: The Curse of Depth in Large Language Models
作者: Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu
日期: 2025-02-09
论文链接: https://arxiv.org/pdf/2502.05795

英文摘要

In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models(LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling, which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Our experimental results, spanning model sizes from 130M to 1B, demonstrate that LayerNorm Scaling significantly enhances LLM pre-training performance compared to Pre-LN. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training.

中文摘要

在本文中，我们介绍了深度的诅咒，该概念突出显示，解释和解决了现代大型语言模型（LLMS）的最新观察结果，其中将近一半的层比预期的效率差。我们首先证实了这种现象在最受欢迎的LLM家族中的广泛存在，例如Llama，Mistral，Deepseek和Qwen。我们的理论和经验上的分析表明，LLMS中深层无效的根本原因是预层归一化的广泛使用（前LN）。虽然前LN稳定了TransformersLLM的训练，但其输出差异指数呈指数增长，这是不可思议的导致深Transformers块的导数为身份矩阵，因此几乎没有贡献训练。为了解决这种训练陷阱，我们提出了分层缩放缩放，该缩放比例缩放了图层的输出的方差，通过其深度的平方根成反比。这种简单的修改减轻了更深的Transformers层的输出方差爆炸，从而改善了它们的贡献。我们的实验结果涵盖了从130m到1B的模型大小，表明与LN相比，分层缩放显着提高了LLM前训练性能。此外，这种改进无缝地延续了监督的微调。所有这些收益都可以归因于以下事实：分层缩放使更深的层在训练过程中更有效地贡献。

训练用于社交推理的语言模型与多智能体强化学习

标题: Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning
作者: Bidipta Sarkar, Warren Xia, C. Karen Liu, Dorsa Sadigh
日期: 2025-02-09
论文链接: https://arxiv.org/pdf/2502.06060

英文摘要

Communicating in natural language is a powerful tool in multi-agent settings, as it enables independent agents to share information in partially observable settings and allows zero-shot coordination with humans. However, most prior works are limited as they either rely on training with large amounts of human demonstrations or lack the ability to generate natural and useful communication strategies. In this work, we train language models to have productive discussions about their environment in natural language without any human demonstrations. We decompose the communication problem into listening and speaking. Our key idea is to leverage the agent’s goal to predict useful information about the world as a dense reward signal that guides communication. Specifically, we improve a model’s listening skills by training them to predict information about the environment based on discussions, and we simultaneously improve a model’s speaking skills with multi-agent reinforcement learning by rewarding messages based on their influence on other agents. To investigate the role and necessity of communication in complex social settings, we study an embodied social deduction game based on Among Us, where the key question to answer is the identity of an adversarial imposter. We analyze emergent behaviors due to our technique, such as accusing suspects and providing evidence, and find that it enables strong discussions, doubling the win rates compared to standard RL. We release our code and models at https://socialdeductionllm.github.io/

中文摘要

以自然语言进行交流是多代理设置中的强大工具，因为它使独立代理可以在可观察到的设置中共享信息，并允许与人类零射击。但是，大多数先前的作品都受到限制，因为它们要么依靠大量的人类示范培训，要么缺乏产生自然和有用的沟通策略的能力。在这项工作中，我们训练语言模型以对其自然语言的环境进行有效的讨论，而没有任何人类的示威。我们将沟通问题分解为聆听和说话。我们的关键思想是利用代理商的目标，将有关世界的有用信息预测为指导沟通的密集奖励信号。具体来说，我们通过训练他们的聆听技巧来提高模型的听力技能，以根据讨论来预测有关环境的信息，并且我们同时通过基于对其他代理的影响来奖励消息，通过奖励消息来提高模型的口语技巧。为了调查复杂社会环境中沟通的作用和必要性，我们根据我们中间研究一个具体的社会演绎游戏，在这里回答的关键问题是对抗性冒名顶替者的身份。我们分析了由于我们的技术而进行的新兴行为，例如指责嫌疑犯并提供证据，并发现它可以进行强有力的讨论，与标准RL相比，获胜率翻了一番。我们在https://socialdeductionllm.github.io/上发布代码和模型

SelfCite：大语言模型中上下文归因的自我监督对齐

标题: SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
作者: Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih
日期: 2025-02-13
论文链接: https://arxiv.org/pdf/2502.09604

英文摘要

We introduce SelfCite, a novel self-supervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generated responses. Instead of only relying on costly and labor-intensive annotations, SelfCite leverages a reward signal provided by the LLM itself through context ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations. The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks.

中文摘要

我们介绍了一种新颖的自我监督方法，它使LLM对齐，以生成高质量的，细粒度的，句子级别的引用，以对其产生的响应中的陈述产生。自我引用不仅依靠昂贵和劳动密集型的注释，而是通过上下文消融来利用LLM本身提供的奖励信号：如果需要引用，则从上下文中删除引用的文本应防止相同的响应；如果足够的话，仅保留引用的文本应保留相同的响应。该奖励可以指导推理时间最佳抽样策略，以显着提高引用质量，并用于优化优化，以直接微调模型以产生更好的引用。通过将引文F1提高到五个长形式的回答任务上的长基型基准测试中，自我引用F1的有效性是通过将引用F1提高到5.3点的。

EmbodiedBench：全面基准测试多模态大型语言模型在视觉驱动的具身代理中的应用

标题: EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
作者: Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang
日期: 2025-02-13
论文链接: https://arxiv.org/pdf/2502.09560
项目链接: https://embodiedbench.github.io

英文摘要

Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 13 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code is available at https://embodiedbench.github.io.

中文摘要

利用多模式的大语言模型（MLLM）创建具体的代理，为解决现实世界任务提供了有希望的途径。虽然以语言为中心的体现药物引起了很大的关注，但由于缺乏全面的评估框架，基于MLLM的体现药物仍未被忽视。为了弥合这一差距，我们引入了体现，这是一种广泛的基准测试，旨在评估视觉驱动的体现剂。体现Bench特征：（1）在四个环境中进行的一组1,128个测试任务，从高级语义任务（例如家庭）到涉及原子动作（例如导航和操纵）的低级任务范围；（2）六个精心策划的子集评估基本药物的能力，例如常识性推理，复杂的教学理解，空间意识，视觉感知和长期计划。通过广泛的实验，我们评估了体现甲板内的13个领先的专有和开源MLLM。我们的发现表明：MLLM在高级任务上表现出色，但在低级操纵中挣扎，最佳模型GPT-4O平均得分仅为28.9％。体现Bench提供了一个多面的标准化评估平台，不仅强调了现有的挑战，而且还提供了有价值的见解，以推动基于MLLM的体现代理。我们的代码可从https://embodiedbench.github.io获得。

Magic 1-For-1：在一分钟内生成一分钟的视频剪辑

标题: Magic 1-For-1: Generating One Minute Video Clips within One Minute
作者: Hongwei Yi, Shitong Shao, Tian Ye, Jiantong Zhao, Qingyu Yin, Michael Lingelbach, Li Yuan, Yonghong Tian, Enze Xie, Daquan Zhou
日期: 2025-02-11
论文链接: https://arxiv.org/pdf/2502.07701

英文摘要

In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.

中文摘要

在此技术报告中，我们提出了魔术1-ther-1（魔术141），这是一种有效的视频生成模型，具有优化的内存消耗和推理潜伏期。关键想法很简单：将文本对视频生成任务分解为两个单独的易于扩散步骤蒸馏的任务，即文本到图像生成和图像到视频生成。我们验证使用相同的优化算法，图像到视频任务确实更容易在文本到视频任务上收敛。我们还探索了一袋优化技巧，以减少从三个方面训练图像到视频（I2V）模型的计算成本：1）通过使用多模式的先验状态注入，模型收敛加速；2）通过应用对抗步骤蒸馏来提高推理潜伏期的速度，3）推理内存成本优化，参数稀疏。借助这些技术，我们能够在3秒内生成5秒的视频剪辑。通过应用测试时间滑动窗口，我们能够在一分钟内生成一个长时间的视频，并具有显着改善的视觉质量和运动动态，平均生成1秒的视频剪辑的花费不到1秒钟。我们进行了一系列初步探索，以找出扩散步骤蒸馏期间计算成本和视频质量之间的最佳权衡，并希望这可能是开源探索的良好基础模型。代码和型号的权重可在https://github.com/da-group-pku/magic-1-for-1上获得。

AuraFusion360：基于参考的360°无界场景修复的增强未见区域对齐

标题: AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting
作者: Chung-Ho Wu, Yang-Jung Chen, Ying-Huan Chen, Jie-Ying Lee, Bo-Hsu Ke, Chun-Wei Tuan Mu, Yi-Chuan Huang, Chin-Yang Lin, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu
日期: 2025-02-07
论文链接: https://arxiv.org/pdf/2502.05176

英文摘要

Three-dimensional scene inpainting is crucial for applications from virtual reality to architectural visualization, yet existing methods struggle with view consistency and geometric accuracy in 360{\deg} unbounded scenes. We present AuraFusion360, a novel reference-based method that enables high-quality object removal and hole filling in 3D scenes represented by Gaussian Splatting. Our approach introduces (1) depth-aware unseen mask generation for accurate occlusion identification, (2) Adaptive Guided Depth Diffusion, a zero-shot method for accurate initial point placement without requiring additional training, and (3) SDEdit-based detail enhancement for multi-view coherence. We also introduce 360-USID, the first comprehensive dataset for 360{\deg} unbounded scene inpainting with ground truth. Extensive experiments demonstrate that AuraFusion360 significantly outperforms existing methods, achieving superior perceptual quality while maintaining geometric accuracy across dramatic viewpoint changes. See our project page for video results and the dataset at https://kkennethwu.github.io/aurafusion360/.

中文摘要

三维场景介绍对于从虚拟现实到架构可视化的应用至关重要，但是现有的方法在360 {\ deg}无绑定的场景中遇到了视图一致性和几何准确性的困难。我们提出了Aurafusion360，这是一种新型的基于参考的方法，可实现高质量的对象去除和孔填充3D场景，以高斯分裂表示。我们的方法引入了（1）深度感知的未见面膜的产生，以进行准确的遮挡识别，（2）自适应引导深度扩散，一种零拍的方法，用于准确的初始点放置而无需额外的训练，以及（3）基于SDEDIT的基于SDEDIT的详细信息增强了多视频相干。我们还介绍了360 usid，这是第一个用于360 {\ deg}无界场景的综合数据集。广泛的实验表明，Aurafusion360显着胜过现有方法，达到了卓越的感知质量，同时保持跨观点变化的几何精度。请参阅我们的项目页面，以获取视频结果和数据集，网址为https://kkennethwu.github.io/aurafusion360/。