【论文速递】2025年03周（Robotics/Embodied AI/LLM）-CSDN博客

本文链接：https://blog.csdn.net/maizousidemao/article/details/147378098

MiniMax-01: Scaling Foundation Models with Lightning Attention
- 摘要
The Lessons of Developing Process Reward Models in Mathematical Reasoning
- 摘要
Tensor Product Attention Is All You Need
- 摘要
Enabling Scalable Oversight via Self-Evolving Critic
- 摘要
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
- 摘要
VideoRAG: Retrieval-Augmented Generation over Video Corpus
- 摘要
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
- 摘要
Towards Best Practices for Open Datasets for LLM Training
- 摘要
MangaNinja: Line Art Colorization with Precise Reference Following
- 摘要
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature
- 摘要
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints
- 摘要
Transformer^2: Self-adaptive LLMs
- 摘要
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
- 摘要
OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
- 摘要
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
- 摘要
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
- 摘要
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
- 摘要
3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering
- 摘要
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
- 摘要
Do generative video models learn physical principles from watching videos?
- 摘要
Diffusion Adversarial Post-Training for One-Step Video Generation
- 摘要
VideoAuteur: Towards Long Narrative Video Generation
- 摘要
Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models
- 摘要
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
- 摘要
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning
- 摘要

MiniMax-01: Scaling Foundation Models with Lightning Attention

作者: MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu
日期: 2025-01-14
论文链接: https://arxiv.org/pdf/2501.08313

摘要

We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01,which are comparable to top-tier models while offering superior capabilities inprocessing longer contexts. The core lies inlightning attentionand itsefficient scaling. To maximize computational capacity, we integrate it withMixture of Experts (MoE), creating a model with 32 experts and 456 billiontotal parameters, of which 45.9 billion are activated for each token. Wedevelop an optimized parallel strategy and highly efficientcomputation-communication overlaptechniques for MoE andlightning attention.This approach enables us to conductefficient trainingandinferenceon modelswith hundreds of billions of parameters across contexts spanning millions oftokens. Thecontext windowof MiniMax-Text-01 can reach up to 1 million tokensduring training and extrapolate to 4 million tokens duringinferenceat anaffordable cost. Ourvision-language model, MiniMax-VL-01 is built throughcontinued training with 512 billion vision-language tokens. Experiments on bothstandard and in-house benchmarks show that our models match the performance ofstate-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32times longercontext window. We publicly release MiniMax-01 athttps://github.com/MiniMax-AI.

The Lessons of Developing Process Reward Models in Mathematical Reasoning

作者: Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin
日期: 2025-01-13
论文链接: https://arxiv.org/pdf/2501.07301

摘要

Process Reward Models (PRMs)emerge as a promising approach for processsupervision in mathematical reasoning ofLarge Language Models (LLMs), whichaim to identify and mitigate intermediate errors in the reasoning processes.However, the development of effective PRMs faces significant challenges,particularly in data annotation and evaluation methodologies. In this paper,through extensive experiments, we demonstrate that commonly used Monte Carlo(MC) estimation-based data synthesis for PRMs typically yields inferiorperformance and generalization compared to LLM-as-a-judge and human annotationmethods. MC estimation relies on completion models to evaluate current-stepcorrectness, leading to inaccurate step verification. Furthermore, we identifypotential biases in conventionalBest-of-N (BoN) evaluation strategiesforPRMs: (1) The unreliablepolicy modelsgenerate responses with correct answersbut flawed processes, leading to a misalignment between the evaluation criteriaof BoN and the PRM objectives of process verification. (2) The tolerance ofPRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have asignificant proportion of minimum scores concentrated on the final answersteps, revealing the shift from process to outcome-based assessment in BoNOptimized PRMs. To address these challenges, we develop a consensus filteringmechanism that effectively integrates MC estimation with LLM-as-a-judge andadvocates a more comprehensive evaluation framework that combinesresponse-level andstep-level metrics. Based on the mechanisms, wesignificantly improve both model performance and data efficiency in the BoNevaluation and the step-wise error identification task. Finally, we release anew state-of-the-art PRM that outperforms existing open-source alternatives andprovides practical guidelines for future research in building processsupervision models.

Tensor Product Attention Is All You Need

作者: Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao
日期: 2025-01-11
论文链接: https://arxiv.org/pdf/2501.06425

摘要

Scaling language models to handle longer input sequences typicallynecessitates large key-value (KV) caches, resulting in substantial memoryoverhead during inference. In this paper, we propose Tensor Product Attention(TPA), a novel attention mechanism that usestensor decompositionsto representqueries, keys, and values compactly, significantly shrinkingKV cachesize atinference time. By factorizing these representations into contextual low-rankcomponents (contextual factorization) and seamlessly integrating withRoPE, TPAachieves improved model quality alongside memory efficiency. Based on TPA, weintroduce theTensor ProducT ATTenTion Transformer (T6), a new modelarchitecture forsequence modeling. Through extensive empirical evaluation oflanguage modeling tasks, we demonstrate that T6 exceeds theperformanceofstandard Transformerbaselines includingMHA,MQA,GQA, andMLAacross variousmetrics, includingperplexityand a range of renownedevaluation benchmarks.Notably, TPAs memory efficiency enables the processing of significantly longersequences under fixed resource constraints, addressing a critical scalabilitychallenge in modern language models. The code is available athttps://github.com/tensorgi/T6.

Enabling Scalable Oversight via Self-Evolving Critic

作者: Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin
日期: 2025-01-10
论文链接: https://arxiv.org/pdf/2501.05727

摘要

Despite their remarkable performance, the development of Large LanguageModels (LLMs) faces a critical challenge inscalable oversight: providingeffective feedback for tasks where human evaluation is difficult or where LLMsoutperform humans. While there is growing interest in using LLMs for critique,current approaches still rely on human annotations or more powerful models,leaving the issue of enhancing critique capabilities without externalsupervision unresolved. We introduce SCRIT (Self-evolving CRITic), a frameworkthat enables genuine self-evolution of critique abilities. Technically, SCRITself-improves by training onsynthetic data, generated by a contrastive-basedself-critic that uses reference solutions for step-by-step critique, and aself-validation mechanismthat ensurescritique qualitythrough correctionoutcomes. Implemented withQwen2.5-72B-Instruct, one of the most powerful LLMs,SCRIT achieves up to a 10.3% improvement oncritique-correctionand erroridentification benchmarks. Our analysis reveals that SCRIT’s performance scalespositively with data and model size, outperforms alternative approaches, andbenefits critically from its self-validation component.

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

作者: Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, Saining Xie
日期: 2025-01-16
论文链接: https://arxiv.org/pdf/2501.09732

摘要

Generative modelshave made significant impacts across various domains,largely due to their ability to scale during training by increasing data,computational resources, and model size, a phenomenon characterized by thescaling laws. Recent research has begun to explore inference-time scalingbehavior inLarge Language Models (LLMs), revealing how performance can furtherimprove with additional computation during inference. Unlike LLMs, diffusionmodels inherently possess the flexibility to adjust inference-time computationvia the number ofdenoising steps, although the performance gains typicallyflatten after a few dozen. In this work, we explore the inference-time scalingbehavior ofdiffusion modelsbeyond increasingdenoising stepsand investigatehow the generation performance can further improve with increased computation.Specifically, we consider a search problem aimed at identifying better noisesfor thediffusion sampling process. We structure the design space along twoaxes: the verifiers used to provide feedback, and the algorithms used to findbetternoise candidates. Through extensive experiments on class-conditioned andtext-conditioned image generationbenchmarks, our findings reveal thatincreasing inference-time compute leads to substantial improvements in thequality of samples generated bydiffusion models, and with the complicatednature of images, combinations of the components in the framework can bespecifically chosen to conform with different application scenario.

VideoRAG: Retrieval-Augmented Generation over Video Corpus

作者: Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang
日期: 2025-01-10
论文链接: https://arxiv.org/pdf/2501.05874

摘要

Retrieval-Augmented Generation (RAG)is a powerful strategy to address theissue of generatingfactually incorrect outputsinfoundation modelsbyretrievingexternal knowledgerelevant toqueriesand incorporating it intotheir generation process. However, existing RAG approaches have primarilyfocused ontextual information, with some recent advancements beginning toconsiderimages, and they largely overlookvideos, a rich source of multimodalknowledge capable of representingevents,processes, andcontextual detailsmore effectively than any other modality. While a few recent studies explorethe integration ofvideosin the response generation process, they eitherpredefine query-associatedvideoswithout retrieving them according toqueries,or convertvideosinto the textual descriptions without harnessing theirmultimodal richness. To tackle these, we introduceVideoRAG, a novel frameworkthat not onlydynamically retrievesrelevantvideosbased on their relevancewithqueriesbut also utilizes both visual andtextual informationofvideosintheoutput generation. Further, to operationalize this, our method revolvesaround the recent advance ofLarge Video Language Models (LVLMs), which enablethedirect processingofvideo contentto represent it for retrieval andseamless integration of the retrievedvideosjointly withqueries. Weexperimentally validate the effectiveness ofVideoRAG, showcasing that it issuperior to relevantbaselines.

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

作者: Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan
日期: 2025-01-10
论文链接: https://arxiv.org/pdf/2501.06186

摘要

Reasoning is a fundamental capability for solving complex multi-stepproblems, particularly in visual contexts where sequential step-wiseunderstanding is essential. Existing approaches lack a comprehensive frameworkfor evaluatingvisual reasoningand do not emphasize step-wise problem-solving.To this end, we propose a comprehensive framework for advancing step-by-stepvisual reasoninginlarge language models (LMMs)through three keycontributions. First, we introduce avisual reasoning benchmarkspecificallydesigned to evaluatemulti-step reasoning tasks. The benchmark presents adiverse set of challenges with eight different categories ranging from complexvisual perceptiontoscientific reasoningwith over 4k reasoning steps intotal, enabling robust evaluation of LLMs’ abilities to perform accurate andinterpretablevisual reasoningacross multiple steps. Second, we propose anovel metric that assessesvisual reasoning qualityat the granularity ofindividual steps, emphasizing both correctness andlogical coherence. Theproposed metric offers deeper insights into reasoning performance compared totraditional end-task accuracy metrics. Third, we present a new multimodalvisual reasoningmodel, namedLlamaV-o1, trained using a multi-step curriculumlearning approach, where tasks are progressively organized to facilitateincremental skill acquisition and problem-solving. The proposedLlamaV-o1isdesigned for multi-step reasoning and learns step-by-step through a structuredtraining paradigm. Extensive experiments show that ourLlamaV-o1outperformsexisting open-source models and performs favorably against close-sourceproprietary models. Compared to the recentLlava-CoT, ourLlamaV-o1achieves anaverage score of 67.3 with an absolute gain of 3.8% across six benchmarkswhile being 5 times faster during inference scaling. Our benchmark, model, andcode are publicly available.

Towards Best Practices for Open Datasets for LLM Training

作者: Stefan Baack, Stella Biderman, Kasia Odrozek, Aviya Skowron, Ayah Bdeir, Jillian Bommarito, Jennifer Ding, Maximilian Gahntz, Paul Keller, Pierre-Carl Langlais, Greg Lindahl, Sebastian Majstorovic, Nik Marda, Guilherme Penedo, Maarten Van Segbroeck, Jennifer Wang, Leandro von Werra, Mitchell Baker, Julie Belião, Kasia Chmielinski, Marzieh Fadaee, Lisa Gutermuth, Hynek Kydlíček, Greg Leppert, EM Lewis-Jong, Solana Larsen, Shayne Longpre, Angela Oduor Lungati, Cullen Miller, Victor Miller, Max Ryabinin, Kathleen Siminyu, Andrew Strait, Mark Surman, Anna Tumadóttir, Maurice Weber, Rebecca Weiss, Lee White, Thomas Wolf
日期: 2025-01-14
论文链接: https://arxiv.org/pdf/2501.08365

摘要

Many AI companies are training their large language models (LLMs) on datawithout the permission of the copyright owners. The permissibility of doing sovaries by jurisdiction: in countries like the EU and Japan, this is allowedunder certain restrictions, while in the United States, the legal landscape ismore ambiguous. Regardless of the legal status, concerns from creativeproducers have led to several high-profile copyright lawsuits, and the threatof litigation is commonly cited as a reason for the recent trend towardsminimizing the information shared about training datasets by both corporate andpublic interest actors. This trend in limiting data information causes harm byhindering transparency, accountability, and innovation in the broader ecosystemby denying researchers, auditors, and impacted individuals access to theinformation needed to understand AI models. While this could be mitigated by training language models on open access andpublic domain data, at the time of writing, there are no such models (trainedat a meaningful scale) due to the substantial technical and sociologicalchallenges in assembling the necessary corpus. These challenges includeincomplete and unreliable metadata, the cost and complexity of digitizingphysical records, and the diverse set of legal and technical skills required toensure relevance and responsibility in a quickly changing landscape. Buildingtowards a future where AI systems can be trained on openly licensed data thatis responsibly curated and governed requires collaboration across legal,technical, and policy domains, along with investments in metadata standards,digitization, and fostering a culture of openness.

MangaNinja: Line Art Colorization with Precise Reference Following

作者: Zhiheng Liu, Ka Leong Cheng, Xi Chen, Jie Xiao, Hao Ouyang, Kai Zhu, Yu Liu, Yujun Shen, Qifeng Chen, Ping Luo
日期: 2025-01-14
论文链接: https://arxiv.org/pdf/2501.08332

摘要

Derived fromdiffusion models, MangaNinjia specializes in the task ofreference-guided line art colorization. We incorporate two thoughtful designsto ensure precise character detail transcription, including a patch shufflingmodule to facilitatecorrespondence learningbetween the reference color imageand the target line art, and apoint-driven control schemeto enablefine-grained color matching. Experiments on aself-collected benchmarkdemonstrate the superiority of our model over current solutions in terms ofprecise colorization. We further showcase the potential of the proposedinteractive point control in handling challenging cases, cross-charactercolorization,multi-reference harmonization, beyond the reach of existingalgorithms.

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

作者: Alejandro Lozano, Min Woo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Austin Wolfgang Katzer, Collin Chiu, Anita Rau, Xiaohan Wang, Yuhui Zhang, Alfred Seunghoon Song, Robert Tibshirani, Serena Yeung-Levy
日期: 2025-01-13
论文链接: https://arxiv.org/pdf/2501.07171
项目链接: https://minwoosun.github.io/biomedica-website/

摘要

The development ofvision-language models (VLMs)is driven by large-scale anddiversemultimodal datasets. However, progress toward generalistbiomedicalVLMs is limited by the lack of annotated, publicly accessible datasets acrossbiology and medicine. Existing efforts are restricted to narrow domains,missing the full diversity ofbiomedical knowledge encoded in scientificliterature. To address this gap, we introduceBIOMEDICA, a scalable,open-source framework to extract, annotate, and serialize the entirety of thePubMed Central Open Access subsetinto an easy-to-use, publicly accessibledataset.Our framework produces a comprehensive archive with over 24 millionuniqueimage-text pairsfrom over 6 million articles. Metadata andexpert-guided annotationsare also provided. We demonstrate the utility andaccessibility of our resource by releasingBMCA-CLIP, a suite of CLIP-stylemodelscontinuously pre-trainedon theBIOMEDICAdataset viastreaming,eliminating the need to download 27 TB of data locally.On average, our modelsachieve state-of-the-art performance across 40 tasks - spanning pathology,radiology, ophthalmology, dermatology, surgery, molecular biology,parasitology, and cell biology - excelling inzero-shot classificationwith a6.56% average improvement (as high as 29.8% and 17.5% in dermatology andophthalmology, respectively), and strongerimage-text retrieval, all whileusing 10x less compute. To foster reproducibility and collaboration, we releaseour codebase and dataset for the broader research community.

OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

作者: Mingjie Pan, Jiyao Zhang, Tianshu Wu, Yinghao Zhao, Wenlong Gao, Hao Dong
日期: 2025-01-07
论文链接: https://arxiv.org/pdf/2501.03841

摘要

The development of general robotic systems capable of manipulating inunstructured environments is a significant challenge. While Vision-LanguageModels(VLM) excel in high-level commonsense reasoning, they lack thefine-grained 3D spatial understandingrequired for precise manipulation tasks.Fine-tuning VLM on robotic datasets to create Vision-Language-ActionModels(VLA) is a potential solution, but it is hindered by high data collectioncosts and generalization issues. To address these challenges, we propose anovel object-centric representation that bridges the gap between VLM’shigh-level reasoning and the low-level precision required for manipulation. Ourkey insight is that an object’scanonical space, defined by its functionalaffordances, provides a structured and semantically meaningful way to describeinteraction primitives, such as points and directions. These primitives act asa bridge, translating VLM’s commonsense reasoning into actionable 3D spatialconstraints. In this context, we introduce adual closed-loop,open-vocabularyrobotic manipulation system: one loop forhigh-level planningthrough primitiveresampling, interaction rendering and VLM checking, and another for low-levelexecution via6D pose tracking. This design ensures robust, real-time controlwithout requiring VLM fine-tuning. Extensive experiments demonstrate strongzero-shot generalizationacross diverse robotic manipulation tasks,highlighting the potential of this approach for automating large-scalesimulation data generation.

Transformer^2: Self-adaptive LLMs

作者: Qi Sun, Edoardo Cetin, Yujin Tang
日期: 2025-01-09
论文链接: https://arxiv.org/pdf/2501.06252

摘要

Self-adaptive large language models (LLMs)aim to solve the challenges posedbytraditional fine-tuning methods, which are often computationally intensiveand static in their ability to handle diverse tasks. We introduce \implname, anovel self-adaptation framework that adapts LLMs for unseen tasks in real-timeby selectively adjusting only the singular components of theirweight matrices.During inference, \implname employs atwo-pass mechanism: first, a dispatchsystem identifies thetask properties, and thentask-specific “expert” vectors,trained usingreinforcement learning, are dynamically mixed to obtain targetedbehavior for the incoming prompt. Our method outperforms ubiquitous approachessuch asLoRA, with fewer parameters and greater efficiency. \implnamedemonstrates versatility across different LLM architectures and modalities,including vision-language tasks. \implname represents a significant leapforward, offering a scalable, efficient solution for enhancing the adaptabilityand task-specific performance of LLMs, paving the way for truly dynamic,self-organizing AI systems.

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

作者: Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou
日期: 2025-01-10
论文链接: https://arxiv.org/pdf/2501.06282

摘要

Recent advancements inlarge language models (LLMs)and multimodalspeech-text models have laid the groundwork for seamlessvoice interactions,enabling real-time, natural, and human-like conversations. Previous models forvoice interactionsare categorized as native and aligned.Native modelsintegrate speech and text processing in one framework but struggle with issueslike differing sequence lengths and insufficient pre-training.Aligned modelsmaintain text LLM capabilities but are often limited by small datasets and anarrow focus onspeech tasks. In this work, we introduceMinMo, a MultimodalLarge Language Model with approximately 8B parameters for seamless voiceinteraction. We address the main limitations of prior aligned multimodalmodels. We trainMinMothrough multiple stages ofspeech-to-text alignment,text-to-speech alignment,speech-to-speech alignment, and duplex interactionalignment, on 1.4 million hours of diversespeech dataand a broad range ofspeech tasks. After the multi-stage training,MinMoachieves state-of-the-artperformance across variousbenchmarksforvoice comprehensionand generationwhile maintaining the capabilities of text LLMs, and also facilitatesfull-duplex conversation, that is, simultaneous two-way communication betweenthe user and the system. Moreover, we propose a novel and simplevoice decoderthat outperforms prior models invoice generation. The enhancedinstruction-following capabilitiesofMinMosupports controlling speechgeneration based on user instructions, with various nuances includingemotions,dialects, andspeaking rates, and mimickingspecific voices. ForMinMo, thespeech-to-text latencyis approximately 100ms,full-duplex latencyisapproximately 600ms in theory and 800ms in practice. TheMinMoproject web pageis https://funaudiollm.github.io/minmo, and the code and models will bereleased soon.

OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

作者: Zekun Xi, Wenbiao Yin, Jizhan Fang, Jialong Wu, Runnan Fang, Ningyu Zhang, Jiang Yong, Pengjun Xie, Fei Huang, Huajun Chen
日期: 2025-01-16
论文链接: https://arxiv.org/pdf/2501.09751

摘要

Machine writing withlarge language modelsoften relies onretrieval-augmented generation. However, these approaches remain confinedwithin the boundaries of the model’spredefined scope, limiting the generationof content with rich information. Specifically, vanilla-retrieved informationtends to lackdepth, utility, and suffers fromredundancy, which negativelyimpacts the quality ofgenerated articles, leading to shallow, repetitive, andunoriginal outputs. To address these issues, we propose OmniThink, a machinewriting framework that emulates the human-like process ofiterative expansionandreflection. The core idea behind OmniThink is to simulate the cognitivebehavior of learners as they progressively deepen their knowledge of thetopics. Experimental results demonstrate that OmniThink improves the knowledgedensity ofgenerated articleswithout compromising metrics such as coherenceanddepth. Human evaluations and expert feedback further highlight thepotential of OmniThink to address real-world challenges in the generation oflong-form articles.

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

作者: Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang
日期: 2025-01-09
论文链接: https://arxiv.org/pdf/2501.05510

摘要

Temporal Awareness, the ability to reason dynamically based on the timestampwhen a question is raised, is the key distinction between offline and onlinevideo LLMs. Unlike offline models, which rely on complete videos for static,post hoc analysis, online models processvideo streamsincrementally anddynamically adapt their responses based on the timestamp at which the questionis posed. Despite its significance,temporal awarenesshas not been adequatelyevaluated in existing benchmarks. To fill this gap, we presentOVO-Bench(Online-VideO-Benchmark), a novel video benchmark that emphasizes theimportance oftimestampsfor advancedonline video understandingcapabilitybenchmarking.OVO-Benchevaluates the ability ofvideo LLMsto reason andrespond to events occurring at specifictimestampsunder three distinctscenarios: (1)Backward tracing: trace back to past events to answer thequestion. (2)Real-time understanding: understand and respond to events as theyunfold at the current timestamp. (3)Forward active responding: delay theresponse until sufficient future information becomes available to answer thequestion accurately.OVO-Benchcomprises 12 tasks, featuring 644 unique videosand approximatelyhuman-curated2,800fine-grained meta-annotationswithprecisetimestamps. We combine automated generation pipelines with humancuration. With these high-quality samples, we further developed an evaluationpipeline to systematically queryvideo LLMsalong thevideo timeline.Evaluations of nine Video-LLMs reveal that, despite advancements on traditionalbenchmarks, current models struggle withonline video understanding, showing asignificant gap compared to human agents. We hopeOVO-Benchwill drive progressinvideo LLMsand inspire future research in online video reasoning. Ourbenchmark and code can be accessed at https://github.com/JoeLeelyf/OVO-Bench.

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

作者: Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, Yong Li
日期: 2025-01-16
论文链接: https://arxiv.org/pdf/2501.09686

摘要

Language has long been conceived as an essential tool for human reasoning.The breakthrough ofLarge Language Models (LLMs)has sparked significantresearch interest in leveraging these models to tackle complex reasoning tasks.Researchers have moved beyond simpleautoregressive token generationbyintroducing the concept of “thought” – a sequence of tokens representingintermediate steps in the reasoning process. This innovative paradigm enablesLLMs’ to mimic complex human reasoning processes, such astree searchandreflective thinking. Recently, an emerging trend of learning to reason hasappliedreinforcement learning (RL)to train LLMs to master reasoningprocesses. This approach enables the automatic generation of high-qualityreasoning trajectoriesthroughtrial-and-error search algorithms, significantlyexpanding LLMs’ reasoning capacity by providing substantially more trainingdata. Furthermore, recent studies demonstrate that encouraging LLMs to "think"with more tokens during test-time inference can further significantly boostreasoning accuracy. Therefore, the train-time andtest-time scalingcombined toshow a new research frontier – a path towardLarge Reasoning Model. Theintroduction ofOpenAI’s o1 seriesmarks a significant milestone in thisresearch direction. In this survey, we present a comprehensive review of recentprogress in LLM reasoning. We begin by introducing the foundational backgroundof LLMs and then explore the key technical components driving the developmentoflarge reasoning models, with a focus onautomated data construction,learning-to-reason techniques, andtest-time scaling. We also analyze popularopen-source projects at buildinglarge reasoning models, and conclude with openchallenges and future research directions.

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

作者: Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen
日期: 2025-01-16
论文链接: https://arxiv.org/pdf/2501.09755

摘要

Visualtokenizationviaauto-encodingempowers state-of-the-art image andvideo generative models by compressing pixels into alatent space. AlthoughscalingTransformer-based generatorshas been central to recent advances, thetokenizer component itself is rarely scaled, leaving open questions about howauto-encoderdesign choices influence both its objective ofreconstructionanddownstream generative performance. Our work aims to conduct an exploration ofscaling in auto-encoders to fill in this blank. To facilitate this exploration,we replace the typicalconvolutional backbonewith an enhanced VisionTransformer architecture forTokenization(ViTok). We trainViTokonlarge-scale image and video datasets far exceeding ImageNet-1K, removing dataconstraints on tokenizer scaling. We first study how scaling the auto-encoderbottleneck affects bothreconstructionandgeneration-- and find that while itis highly correlated withreconstruction, its relationship withgenerationismore complex. We next explored the effect of separately scaling theauto-encoders’encoderanddecoderonreconstructionandgenerationperformance. Crucially, we find that scaling theencoderyields minimal gainsfor eitherreconstructionorgeneration, while scaling thedecoderboostsreconstructionbut the benefits forgenerationare mixed. Building on ourexploration, we designViTokas a lightweight auto-encoderthat achievescompetitive performance with state-of-the-art auto-encoders on ImageNet-1K andCOCOreconstructiontasks (256p and 512p) while outperforming existingauto-encoders on 16-frame 128p videoreconstructionforUCF-101, all with 2-5xfewerFLOPs. When integrated withDiffusion Transformers,ViTokdemonstratescompetitive performance onimage generationfor ImageNet-1K and sets newstate-of-the-art benchmarksforclass-conditional video generationonUCF-101.

3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering

作者: Dewei Zhou, Ji Xie, Zongxin Yang, Yi Yang
日期: 2025-01-09
论文链接: https://arxiv.org/pdf/2501.05131

摘要

The growing demand for controllable outputs intext-to-image generationhasdriven significant advancements inmulti-instance generation (MIG), enablingusers to define both instance layouts and attributes. Currently, thestate-of-the-art methods in MIG are primarily adapter-based. However, thesemethods necessitate retraining a new adapter each time a more advanced model isreleased, resulting in significant resource consumption. A methodology namedDepth-Driven Decoupled Instance Synthesis (3DIS)has been introduced, whichdecouples MIG into two distinct phases: 1)depth-based scene constructionand2)detail renderingwith widely pre-traineddepth control models. The 3DISmethod requires adapter training solely during the scene construction phase,while enabling various models to perform training-freedetail rendering.Initially, 3DIS focused on rendering techniques utilizingU-Net architecturessuch asSD1.5,SD2, andSDXL, without exploring the potential of recentDiT-based modelslike FLUX. In this paper, we present 3DIS-FLUX, an extensionof the 3DIS framework that integrates theFLUX modelfor enhanced renderingcapabilities. Specifically, we employ theFLUX.1-Depth-dev modelfor depth mapcontrolled image generation and introduce a detail renderer that manipulatestheAttention Maskin FLUX’sJoint Attention mechanismbased on layoutinformation. This approach allows for the precise rendering of fine-grainedattributes of each instance. Our experimental results indicate that 3DIS-FLUX,leveraging theFLUX model, outperforms the original 3DIS method, which utilizedSD2andSDXL, and surpasses current state-of-the-artadapter-based methodsinterms of both performance andimage quality. Project Page:https://limuloo.github.io/3DIS/.

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

作者: Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, Ryo Hachiuma
日期: 2025-01-14
论文链接: https://arxiv.org/pdf/2501.08326

摘要

We present Omni-RGPT, amultimodal large language modeldesigned tofacilitateregion-level comprehensionfor both images and videos. To achieveconsistent region representation across spatio-temporal dimensions, weintroduceToken Mark, a set of tokens highlighting the target regions withinthevisual feature space. These tokens are directly embedded into spatialregions usingregion prompts(e.g., boxes or masks) and simultaneouslyincorporated into thetext promptto specify the target, establishing a directconnection between visual and text tokens. To further support robust videounderstanding without requiring tracklets, we introduce anauxiliary taskthatguidesToken Markby leveraging theconsistencyof the tokens, enabling stableregion interpretationacross the video. Additionally, we introduce alarge-scale region-level video instruction dataset(RegVID-300k). Omni-RGPTachieves state-of-the-art results on image and video-based commonsensereasoning benchmarks while showing strong performance incaptioningandreferring expression comprehensiontasks.

Do generative video models learn physical principles from watching videos?

作者: Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, Robert Geirhos
日期: 2025-01-14
论文链接: https://arxiv.org/pdf/2501.09038

摘要

AI video generation is undergoing a revolution, with quality and realismadvancing rapidly. These advances have led to a passionate scientific debate:Do video models learn ``world models’’ that discover laws of physics – or,alternatively, are they merely sophisticated pixel predictors that achievevisual realismwithout understanding thephysical principlesof reality? Weaddress this question by developingPhysics-IQ, a comprehensive benchmarkdataset that can only be solved by acquiring a deep understanding of variousphysical principles, likefluid dynamics,optics,solid mechanics,magnetismandthermodynamics. We find that across a range of current models (Sora,Runway, Pika, Lumiere,Stable Video Diffusion, andVideoPoet), physicalunderstanding is severely limited, and unrelated tovisual realism. At the sametime, some test cases can already be successfully solved. This indicates thatacquiring certainphysical principlesfrom observation alone may be possible,but significant challenges remain. While we expect rapid advances ahead, ourwork demonstrates thatvisual realismdoes not imply physical understanding.Our project page is at https://physics-iq.github.io; code athttps://github.com/google-deepmind/physics-IQ-benchmark.

Diffusion Adversarial Post-Training for One-Step Video Generation

作者: Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, Lu Jiang
日期: 2025-01-14
论文链接: https://arxiv.org/pdf/2501.08316

摘要

Thediffusion modelsare widely used for image and video generation, buttheir iterative generation process is slow and expansive. While existingdistillation approaches have demonstrated the potential forone-step generationin the image domain, they still suffer from significant quality degradation. Inthis work, we proposeAdversarial Post-Training(APT) against real datafollowing diffusion pre-training for one-step video generation. To improve thetraining stability and quality, we introduce several improvements to the modelarchitecture and training procedures, along with an approximated R1regularization objective. Empirically, our experiments show that ouradversarial post-trained model,Seaweed-APT, can generate 2-second, 1280x720,24fps videos in real time using a singleforward evaluation step. Additionally,our model is capable of generating 1024px images in a single step, achievingquality comparable to state-of-the-art methods.

VideoAuteur: Towards Long Narrative Video Generation

作者: Junfei Xiao, Feng Cheng, Lu Qi, Liangke Gui, Jiepeng Cen, Zhibei Ma, Alan Yuille, Lu Jiang
日期: 2025-01-10
论文链接: https://arxiv.org/pdf/2501.06173

摘要

Recent video generation models have shown promising results in producinghigh-quality video clips lasting several seconds. However, these models facechallenges in generating long sequences that convey clear and informativeevents, limiting their ability to support coherent narrations. In this paper,we present alarge-scale cooking video datasetdesigned to advance long-formnarrative generation in the cooking domain. We validate the quality of ourproposed dataset in terms of visual fidelity and textual caption accuracy usingstate-of-the-artVision-Language Models (VLMs)and video generation models,respectively. We further introduce aLong Narrative Video Directorto enhanceboth visual andsemantic coherenceingenerated videosand emphasize the roleof aligningvisual embeddingsto achieve improved overall video quality. Ourmethod demonstrates substantial improvements in generatingvisually detailedandsemantically aligned keyframes, supported byfinetuning techniquesthatintegratetext and image embeddingswithin thevideo generation process.Project page: https://videoauteur.github.io/

Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models

作者: Michael Toker, Ido Galil, Hadas Orgad, Rinon Gal, Yoad Tewel, Gal Chechik, Yonatan Belinkov
日期: 2025-01-12
论文链接: https://arxiv.org/pdf/2501.06751

摘要

Text-to-image (T2I)diffusion modelsrely onencoded promptsto guide theimage generation process. Typically, these prompts are extended to a fixedlength by addingpadding tokensbeforetext encoding. Despite being a defaultpractice, the influence ofpadding tokenson the image generation process hasnot been investigated. In this work, we conduct the first in-depth analysis ofthe rolepadding tokensplay in T2I models. We develop two causal techniques toanalyze how information is encoded in the representation of tokens acrossdifferent components of the T2I pipeline. Using these techniques, weinvestigate when and howpadding tokensimpact the image generation process.Our findings reveal three distinct scenarios:padding tokensmay affect themodel’s output duringtext encoding, during thediffusion process, or beeffectively ignored. Moreover, we identify key relationships between thesescenarios and the model’s architecture (cross orself-attention) and itstraining process (frozen or trained text encoder). These insights contribute toa deeper understanding of the mechanisms ofpadding tokens, potentiallyinforming future model design and training practices in T2I systems.

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

作者: Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu
日期: 2025-01-15
论文链接: https://arxiv.org/pdf/2501.08828

摘要

Multi-modal document retrievalis designed to identify and retrieve variousforms of multi-modal content, such asfigures,tables,charts, and layoutinformation from extensive documents. Despite its significance, there is anotable lack of a robustbenchmarkto effectively evaluate the performance ofsystems inmulti-modal document retrieval. To address this gap, this workintroduces a newbenchmark, named as MMDocIR, encompassing two distinct tasks:page-level andlayout-level retrieval. The former focuses on localizing themost relevant pages within a long document, while the latter targets thedetection of specific layouts, offering a more fine-grained granularity thanwhole-page analysis. A layout can refer to a variety of elements such astextual paragraphs,equations,figures,tables, orcharts. The MMDocIRbenchmarkcomprises a rich dataset featuring expertly annotated labels for1,685 questions and bootstrapped labels for 173,843 questions, making it apivotal resource for advancingmulti-modal document retrievalfor both trainingand evaluation. Through rigorous experiments, we reveal that (i) visualretrievers significantly outperform their text counterparts, (ii) MMDocIR trainset can effectively benefit the training process of multi-modal documentretrieval and (iii)text retrieversleveraging onVLM-textperform much betterthan those usingOCR-text. These findings underscores the potential advantagesof integrating visual elements formulti-modal document retrieval.

O1 Replication Journey – Part 3: Inference-time Scaling for Medical Reasoning

作者: Zhongzhen Huang, Gui Geng, Shengyi Hua, Zhen Huang, Haoyang Zou, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang
日期: 2025-01-11
论文链接: https://arxiv.org/pdf/2501.06458

摘要

Building upon our previous investigations of O1 replication (Part 1: JourneyLearning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]),this work explores the potential ofinference-time scalingin large languagemodels (LLMs) formedical reasoning tasks, ranging from diagnosticdecision-making totreatment planning. Through extensive experiments on medicalbenchmarks of varying complexity (MedQA,Medbullets, and JAMA ClinicalChallenges), our investigation reveals several key insights: (1) Increasinginference time does lead to improved performance. With a modest training set of500 samples, our model yields substantial performance improvements of 6%-11%.(2) Task complexity directly correlates with the required length of reasoningchains, confirming the necessity of extended thought processes for challengingproblems. (3) Thedifferential diagnosesgenerated by our model adhere to theprinciples of thehypothetico-deductive method, producing a list of potentialconditions that may explain a patient’s symptoms and systematically narrowingthese possibilities by evaluating the evidence. These findings demonstrate thepromising synergy betweeninference-time scalingand journey learning inadvancing LLMs’ real-world clinical reasoning capabilities.