【论文速递】2025年04周（Robotics/Embodied AI/LLM）-CSDN博客

本文链接：https://blog.csdn.net/maizousidemao/article/details/147378263

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- 摘要
Evolving Deeper LLM Thinking
- 摘要
Kimi k1.5: Scaling Reinforcement Learning with LLMs
- 摘要
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
- 摘要
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
- 摘要
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
- 摘要
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
- 摘要
SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
- 摘要
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
- 摘要
GameFactory: Creating New Games with Generative Interactive Videos
- 摘要
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
- 摘要
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
- 摘要
Improving Video Generation with Human Feedback
- 摘要
PaSa: An LLM Agent for Comprehensive Academic Paper Search
- 摘要
Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models
- 摘要
TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
- 摘要
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
- 摘要
Autonomy-of-Experts Models
- 摘要
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
- 摘要
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
- 摘要
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong
- 摘要
Reasoning Language Models: A Blueprint
- 摘要
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
- 摘要
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
- 摘要
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
- 摘要

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

作者: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang
日期: 2025-01-22
论文链接: https://arxiv.org/pdf/2501.12948

摘要

We introduce our first-generation reasoning models, DeepSeek-R1-Zero andDeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcementlearning (RL) withoutsupervised fine-tuning (SFT)as a preliminary step,demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zeronaturally emerges with numerous powerful and intriguingreasoning behaviors.However, it encounters challenges such as poor readability, and languagemixing. To address these issues and further enhance reasoning performance, weintroduce DeepSeek-R1, which incorporatesmulti-stage trainingand cold-startdata before RL. DeepSeek-R1 achieves performance comparable toOpenAI-o1-1217onreasoning tasks. To support the research community, we open-sourceDeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B,70B) distilled from DeepSeek-R1 based onQwenandLlama.

Evolving Deeper LLM Thinking

作者: Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, Xinyun Chen
日期: 2025-01-17
论文链接: https://arxiv.org/pdf/2501.09891

摘要

We explore anevolutionary search strategyfor scalinginference time computeinLarge Language Models. The proposed approach,Mind Evolution, uses alanguage model to generate, recombine and refinecandidate responses. Theproposed approach avoids the need to formalize the underlying inference problemwhenever asolution evaluatoris available. Controlling for inference cost, wefind thatMind Evolutionsignificantly outperforms other inference strategiessuch as Best-of-N and Sequential Revision innatural language planning tasks.In theTravelPlannerandNatural Plan benchmarks,Mind Evolutionsolves morethan 98% of the problem instances usingGemini 1.5 Prowithout the use of aformal solver.

Kimi k1.5: Scaling Reinforcement Learning with LLMs

作者: Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Mengnan Dong, Neo Zhang, Ningchen Ma, Qiwei Pan, Qucheng Gong, Shaowei Liu, Shengling Ma, Shupeng Wei, Sihan Cao, Siying Huang, Tao Jiang, Weihao Gao, Weimin Xiong, Weiran He, Weixiao Huang, Wenhao Wu, Wenyang He, Xianghui Wei, Xianqing Jia, Xingzhe Wu, Xinran Xu, Xinxing Zu, Xinyu Zhou, Xuehai Pan, Y. Charles, Yang Li, Yangyang Hu, Yangyang Liu, Yanru Chen, Yejie Wang, Yibo Liu, Yidao Qin, Yifeng Liu, Ying Yang, Yiping Bao, Yulun Du, Yuxin Wu, Yuzhi Wang, Zaida Zhou, Zhaoji Wang, Zhaowei Li, Zhen Zhu, Zheng Zhang, Zhexu Wang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Ziyao Xu, Zonghan Yang
日期: 2025-01-22
论文链接: https://arxiv.org/pdf/2501.12599

摘要

Language model pretraining with next token prediction has proved effectivefor scaling compute but is limited to the amount of available training data.Scaling reinforcement learning (RL) unlocks a new axis for the continuedimprovement of artificial intelligence, with the promise that large languagemodels (LLMs) can scale their training data by learning to explore withrewards. However, prior published work has not produced competitive results. Inlight of this, we report on the training practice of Kimi k1.5, our latestmulti-modalLLMtrained withRL, including itsRLtraining techniques,multi-modal data recipes, and infrastructure optimization.Long context scalingand improvedpolicy optimizationmethods are key ingredients of our approach,which establishes a simplistic, effectiveRLframework without relying on morecomplex techniques such asMonte Carlo tree search,value functions, andprocess reward models. Notably, our system achieves state-of-the-art reasoningperformance across multiple benchmarks and modalities – e.g., 77.5 onAIME,96.2 onMATH 500, 94-th percentile onCodeforces, 74.9 onMathVista-- matchingOpenAI’s o1. Moreover, we present effectivelong2shortmethods that uselong-CoTtechniques to improveshort-CoTmodels, yielding state-of-the-artshort-CoTreasoning results – e.g., 60.8 onAIME, 94.6 on MATH500, 47.3 onLiveCodeBench-- outperforming existingshort-CoTmodels such asGPT-4oandClaude Sonnet 3.5by a large margin (up to +550%).

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

作者: Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, Jiecao Chen
日期: 2025-01-20
论文链接: https://arxiv.org/pdf/2501.11425

摘要

Large Language Models (LLMs)agents are increasingly pivotal for addressingcomplex tasks ininteractive environments. Existing work mainly focuses onenhancing performance throughbehavior cloningfrom stronger experts, yet suchapproaches often falter in real-world applications, mainly due to the inabilityto recover from errors. However,step-level critique datais difficult andexpensive to collect. Automating and dynamically constructing self-critiquedatasets is thus crucial to empowering models with intelligent agentcapabilities. In this work, we propose aniterative self-training framework,Agent-R, that enables language Agent to Reflect on the fly. Unlike traditionalmethods thatreward or penalize actionsbased on correctness,Agent-RleveragesMCTS to construct training data that recovercorrect trajectoriesfromerroneous ones. A key challenge of agent reflection lies in the necessity fortimely revisionrather than waiting until the end of a rollout. To addressthis, we introduce a model-guided critique construction mechanism: the actormodel identifies the first error step (within its current capability) in afailed trajectory. Starting from it, we splice it with the adjacent correctpath, which shares the same parent node in the tree. This strategy enables themodel to learn reflection based on its current policy, therefore yieldingbetterlearning efficiency. To further explore the scalability of thisself-improvement paradigm, we investigateiterative refinementof both errorcorrection capabilities and dataset construction. Our findings demonstrate thatAgent-Rcontinuously improves the model’s ability to recover from errors andenablestimely error correction. Experiments on threeinteractive environmentsshow thatAgent-Reffectively equips agents to correct erroneous actions whileavoiding loops, achieving superior performance compared tobaseline methods(+5.59%).

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

作者: Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao
日期: 2025-01-22
论文链接: https://arxiv.org/pdf/2501.13106
项目链接: https://github.com/DAMO-NLP-SG/VideoLLaMA3

摘要

In this paper, we propose VideoLLaMA3, a more advanced multimodal foundationmodel for image and video understanding. The core design philosophy ofVideoLLaMA3 is vision-centric. The meaning of “vision-centric” is two-fold: thevision-centric training paradigmandvision-centric framework design. The keyinsight of ourvision-centric training paradigmis that high-quality image-textdata is crucial for both image and video understanding. Instead of preparingmassive video-text datasets, we focus on constructing large-scale andhigh-quality image-text datasets. VideoLLaMA3 has four training stages: 1)vision-centric alignment stage, which warms up thevision encoderandprojector; 2) vision-language pretraining stage, which jointly tunes the visionencoder,projector, and LLM with large-scale image-text data covering multipletypes (including scene images, documents, charts) as well as text-only data. 3)multi-task fine-tuning stage, which incorporates image-text SFT data fordownstream tasks and video-text data to establish a foundation for videounderstanding. 4) video-centric fine-tuning, which further improves the model’scapability in video understanding. As for the framework design, to bettercapture fine-grained details in images, the pretrainedvision encoderisadapted to encode images of varying sizes into vision tokens with correspondingnumbers, rather than a fixed number of tokens. For video inputs, we reduce thenumber of vision tokens according to their similarity so that therepresentation of videos will be more precise and compact. Benefit fromvision-centric designs, VideoLLaMA3 achieves compelling performances in bothimage and video understanding benchmarks.

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

作者: Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan
日期: 2025-01-21
论文链接: https://arxiv.org/pdf/2501.12380
项目链接: https://mmvu-benchmark.github.io/

摘要

We introduce MMVU, a comprehensive expert-level, multi-discipline benchmarkfor evaluating foundation models in video understanding. MMVU includes 3,000expert-annotated questions spanning 27 subjects across four core disciplines:Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared toprior benchmarks, MMVU features three key advancements. First, it challengesmodels to apply domain-specific knowledge and perform expert-level reasoning toanalyze specialized-domain videos, moving beyond the basic visual perceptiontypically assessed in current video benchmarks. Second, each example isannotated by human experts from scratch. We implement strict data qualitycontrols to ensure the high quality of the dataset. Finally, each example isenriched with expert-annotated reasoning rationals and relevant domainknowledge, facilitating in-depth analysis. We conduct an extensive evaluationof 32 frontier multimodal foundation models on MMVU. The latestSystem-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highestperformance among the tested models. However, they still fall short of matchinghuman expertise. Through in-depth error analyses and case studies, we offeractionable insights for future advancements in expert-level,knowledge-intensive video understanding for specialized domains.

FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

作者: Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, Min Zhang
日期: 2025-01-22
论文链接: https://arxiv.org/pdf/2501.12909

摘要

Virtual film production requires intricate decision-making processes,including scriptwriting, virtual cinematography, and precise actor positioningand actions. Motivated by recent advances in automated decision-making withlanguage agent-based societies, this paper introduces FilmAgent, a novelLLM-basedmulti-agentcollaborative frameworkforend-to-endfilm automationinour constructed 3D virtual spaces. FilmAgent simulates various crew roles,including directors, screenwriters, actors, and cinematographers, and coverskey stages of a film production workflow: (1) idea development transformsbrainstormed ideas into structured story outlines; (2) scriptwriting elaborateson dialogue and character actions for each scene; (3) cinematography determinesthe camera setups for each shot. A team of agents collaborates throughiterative feedbackand revisions, thereby verifying intermediate scripts andreducinghallucinations. We evaluate the generated videos on 15 ideas and 4 keyaspects.Human evaluationshows that FilmAgent outperforms all baselines acrossall aspects and scores 3.98 out of 5 on average, showing the feasibility ofmulti-agent collaborationin filmmaking. Further analysis reveals thatFilmAgent, despite using the less advancedGPT-4omodel, surpasses thesingle-agent o1, showing the advantage of a well-coordinatedmulti-agentsystem. Lastly, we discuss the complementary strengths and weaknesses ofOpenAI’stext-to-video modelSoraand our FilmAgent in filmmaking.

SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

作者: Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev
日期: 2025-01-22
论文链接: https://arxiv.org/pdf/2501.13200

摘要

Multi-agent reinforcement learning (MARL)demonstrates significant progressin solving cooperative and competitive multi-agent problems in variousenvironments. One of the principal challenges in MARL is the need for explicitprediction of the agents’ behavior to achieve cooperation. To resolve thisissue, we propose theShared Recurrent Memory Transformer (SRMT)which extendsmemory transformersto multi-agent settings by pooling and globallybroadcasting individual working memories, enabling agents to exchangeinformation implicitly and coordinate their actions. We evaluate SRMT on thePartially Observable Multi-Agent Pathfinding problem in a toy Bottlenecknavigation task that requires agents to pass through a narrow corridor and on aPOGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistentlyoutperforms a variety of reinforcement learning baselines, especially undersparse rewards, and generalizes effectively to longer corridors than those seenduring training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT iscompetitive with recent MARL, hybrid, and planning-based algorithms. Theseresults suggest that incorporating shared recurrent memory into thetransformer-based architectures can enhance coordination in decentralizedmulti-agent systems. The source code for training and evaluation is availableon GitHub: https://github.com/Aloriosa/srmt.

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

作者: Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin
日期: 2025-01-21
论文链接: https://arxiv.org/pdf/2501.11873

摘要

This paper revisits the implementation ofLoad-balancing Loss (LBL)when trainingMixture-of-Experts (MoEs)models. Specifically, LBL for MoEs is defined as N_Esum_{i=1}^{N_E} f_i p_i, where N_E is the total number ofexperts, f_irepresents thefrequencyof expert i being selected, and p_i denotes theaveragegating scoreof the expert i. Existing MoE training frameworksusually employ theparallel training strategyso that f_i and the LBL arecalculated within amicro-batchand then averaged across parallelgroups. In essence, amicro-batchfor training billion-scale LLMs normallycontains very fewsequences. So, themicro-batchLBL is almost at thesequencelevel, and therouteris pushed to distribute thetokenevenly within eachsequence. Under this strict constraint, eventokens from a domain-specificsequence(e.g., code) are uniformly routed to allexperts, therebyinhibiting expert specialization. In this work, we propose calculating LBLusing aglobal-batchto loose this constraint. Because aglobal-batchcontains much more diversesequences than amicro-batch, whichwill encourage load balance at thecorpus level. Specifically, we introduce anextracommunication stepto synchronize f_i acrossmicro-batches and then useit to calculate the LBL. Through experiments on training MoEs-based LLMs (up to42.8B total parameters and 400Btokens), we surprisinglyfind that theglobal-batchLBL strategy yields excellent performance gains inboth pre-training perplexity and downstream tasks. Our analysis reveals thattheglobal-batchLBL also greatly improves thedomain specializationof MoEexperts.

GameFactory: Creating New Games with Generative Interactive Videos

作者: Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu
日期: 2025-01-14
论文链接: https://arxiv.org/pdf/2501.08325
项目链接: https://yujiwen.github.io/gamefactory/

摘要

Generative game engines have the potential to revolutionize game developmentby autonomously creating new content and reducing manual workload. However,existing video-based game generation methods fail to address the criticalchallenge ofscene generalization, limiting their applicability to existinggames with fixed styles and scenes. In this paper, we present GameFactory, aframework focused on exploringscene generalizationin game video generation.To enable the creation of entirely new and diverse games, we leveragepre-trainedvideo diffusion modelstrained onopen-domain video data. To bridgethe domain gap between open-domain priors and small-scale game dataset, wepropose amulti-phase trainingstrategy that decouplesgame style learningfromaction control, preserving open-domain generalization while achieving actioncontrollability. Using Minecraft as our data source, we releaseGF-Minecraft, ahigh-quality and diversity action-annotated video dataset for research.Furthermore, we extend our framework to enable autoregressiveaction-controllable game video generation, allowing the production ofunlimited-length interactive game videos. Experimental results demonstrate thatGameFactory effectively generates open-domain, diverse, and action-controllablegame videos, representing a significant step forward in AI-driven gamegeneration. Our dataset and project page are publicly available athttps://vvictoryuki.github.io/gamefactory/.

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

作者: Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng
日期: 2025-01-22
论文链接: https://arxiv.org/pdf/2501.12895

摘要

Large language models (LLMs) demonstrate impressive performance but lack theflexibility to adapt to human preferences quickly without retraining. In thiswork, we introduceTest-time Preference Optimization (TPO), a framework thataligns LLM outputs with human preferences during inference, removing the needto update model parameters. Rather than relying on purely numerical rewards,TPO translates reward signals intotextual critiquesand uses them as textualrewards to iteratively refine its response. Evaluations on benchmarks coveringinstruction following,preference alignment,safety, andmathematicsrevealthat TPO progressively improves alignment with human preferences. Notably,after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model cansurpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPOscales efficiently with both thesearch widthand depth during inference.Through case studies, we illustrate how TPO exploits the innate capacity of LLMtointerpretandact upon reward signals. Our findings establish TPO as apractical, lightweight alternative for test-time preference optimization,achieving alignment on the fly. Our code is publicly available athttps://github.com/yafuly/TPO.

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

作者: Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi
日期: 2025-01-21
论文链接: https://arxiv.org/pdf/2501.12326

摘要

This paper introduces UI-TARS, a native GUI agent model that solely perceivesthe screenshots as input and performshuman-like interactions(e.g., keyboardandmouse operations). Unlike prevailing agent frameworks that depend onheavily wrapped commercial models (e.g.,GPT-4o) withexpert-crafted promptsandworkflows, UI-TARS is anend-to-end modelthat outperforms thesesophisticated frameworks. Experiments demonstrate its superior performance:UI-TARS achievesSOTA performancein 10+GUI agent benchmarksevaluatingperception,grounding, and GUI task execution. Notably, in the OSWorldbenchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld,UI-TARS achieves 46.6, surpassingGPT-4o(34.5). UI-TARS incorporates severalkey innovations: (1) Enhanced Perception: leveraging a large-scale dataset ofGUI screenshots for context-aware understanding of UI elements and precisecaptioning; (2)Unified Action Modeling, which standardizes actions into aunified spaceacross platforms and achieves precisegroundingand interactionthrough large-scale action traces; (3)System-2 Reasoning, which incorporatesdeliberate reasoninginto multi-step decision making, involving multiplereasoning patterns such astask decomposition,reflection thinking, milestonerecognition, etc. (4)Iterative Training with Reflective Online Traces, whichaddresses the data bottleneck by automatically collecting,filtering, andreflectively refining newinteraction traceson hundreds ofvirtual machines.Through iterative training and reflection tuning, UI-TARS continuously learnsfrom its mistakes and adapts to unforeseen situations with minimal humanintervention. We also analyze theevolution pathofGUI agentsto guide thefurther development of this domain.

Improving Video Generation with Human Feedback

作者: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang
日期: 2025-01-23
论文链接: https://arxiv.org/pdf/2501.13918

摘要

Video generation has achieved significant advances throughrectified flowtechniques, but issues like unsmoothmotionandmisalignmentbetween videos andprompts persist. In this work, we develop a systematic pipeline that harnesseshuman feedback to mitigate these problems and refine the video generationmodel. Specifically, we begin by constructing a large-scale human preferencedataset focused on modern video generation models, incorporating pairwiseannotations across multi-dimensions. We then introduceVideoReward, amulti-dimensional video reward model, and examine how annotations and variousdesign choices impact its rewarding efficacy. From a unified reinforcementlearning perspective aimed at maximizing reward withKL regularization, weintroduce threealignment algorithmsfor flow-based models by extending thosefromdiffusion models. These include two training-time strategies: directpreference optimization for flow (Flow-DPO) and reward weighted regression forflow (Flow-RWR), and an inference-time technique,Flow-NRG, which appliesreward guidance directly tonoisy videos. Experimental results indicate thatVideoRewardsignificantly outperforms existing reward models, andFlow-DPOdemonstrates superior performance compared to bothFlow-RWRand standardsupervised fine-tuning methods. Additionally,Flow-NRGlets users assign customweights to multiple objectives during inference, meeting personalized videoquality needs. Project page: https://gongyeliu.github.io/videoalign.

PaSa: An LLM Agent for Comprehensive Academic Paper Search

作者: Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E
日期: 2025-01-17
论文链接: https://arxiv.org/pdf/2501.10120

摘要

We introduce PaSa, an advanced Paper Search agent powered by large languagemodels. PaSa can autonomously make a series of decisions, including invokingsearch tools, reading papers, and selecting relevant references, to ultimatelyobtain comprehensive and accurate results for complex scholarly queries. Weoptimize PaSa using reinforcement learning with a synthetic dataset,AutoScholarQuery, which includes 35k fine-grained academic queries andcorresponding papers sourced from top-tier AI conference publications.Additionally, we develop RealScholarQuery, a benchmark collecting real-worldacademic queries to assess PaSa performance in more realistic scenarios.Despite being trained on synthetic data, PaSa significantly outperformsexisting baselines on RealScholarQuery, including Google, Google Scholar,Google with GPT-4 for paraphrased queries, chatGPT (search-enabled GPT-4o),GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably,PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78%in recall@20 and 39.90% in recall@50. It also exceeds PaSa-GPT-4o by 30.36% inrecall and 4.25% in precision. Model, datasets, and code are available athttps://github.com/bytedance/pasa.

Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models

作者: Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, Yu Yan, Xiao Liang, Shuai Lu, Yiming Huang, Zheheng Luo, Lei Qu, Xuan Feng, Yaoxiang Wang, Yuqing Xia, Feiyang Chen, Yuting Jiang, Yasen Hu, Hao Ni, Binyang Li, Guoshuai Zhao, Jui-Hao Chiang, Zhongxin Guo, Chen Lin, Kun Kuang, Wenjie Li, Yelong Shen, Jian Jiao, Peng Cheng, Mao Yang
日期: 2025-01-23
论文链接: https://arxiv.org/pdf/2501.13629

摘要

We introduce Sigma, an efficient large language model specialized for thesystem domain, empowered by a novel architecture includingDiffQKV attention,andpre-trained on our meticulously collectedsystem domain data. DiffQKVattention significantly enhances theinference efficiencyof Sigma byoptimizing theQuery (Q),Key (K), andValue (V)components in the attentionmechanism differentially, based on their varying impacts on the modelperformance andefficiency indicators. Specifically, we (1) conduct extensiveexperiments that demonstrate the model’svarying sensitivityto the compressionof K and V components, leading to the development of differentially compressedKV, and (2) proposeaugmented Qto expand the Q head dimension, which enhancesthe model’srepresentation capacitywith minimal impacts on the inferencespeed. Rigorous theoretical and empirical analyses reveal that DiffQKVattention significantly enhances efficiency, achieving up to a 33.36%improvement ininference speedover the conventional grouped-query attention(GQA) inlong-context scenarios. Wepre-trainSigma on 6T tokens from varioussources, including 19.5Bsystem domain datathat we carefully collect and 1Ttokens of synthesized and rewritten data. In general domains, Sigma achievescomparable performance to otherstate-of-arts models. In the system domain, weintroduce the first comprehensivebenchmark AIMicius, where Sigma demonstratesremarkable performance across all tasks, significantly outperformingGPT-4withan absolute improvement up to 52.5%.

TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

作者: Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, Tali Dekel
日期: 2025-01-21
论文链接: https://arxiv.org/pdf/2501.12224

摘要

We present TokenVerse – a method for multi-concept personalization,leveraging apre-trained text-to-image diffusion model. Our framework candisentangle complex visual elements and attributes from as little as a singleimage, while enabling seamless plug-and-play generation of combinations ofconcepts extracted from multiple images. As opposed to existing works,TokenVerse can handle multiple images with multiple concepts each, and supportsa wide-range of concepts, including objects, accessories, materials, pose, andlighting. Our work exploits aDiT-based text-to-image model, in which the inputtext affects the generation through bothattentionandmodulation(shift andscale). We observe that themodulation spaceis semantic and enables localizedcontrol over complex concepts. Building on this insight, we devise anoptimization-based frameworkthat takes as input an image and a textdescription, and finds for each word a distinct direction in themodulationspace. These directions can then be used to generate new images that combinethe learned concepts in a desired configuration. We demonstrate theeffectiveness of TokenVerse in challenging personalization settings, andshowcase its advantages over existing methods. project’s webpage inhttps://token-verse.github.io/

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

作者: Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang
日期: 2025-01-21
论文链接: https://arxiv.org/pdf/2501.12368

摘要

Despite the promising performance ofLarge Vision Language Models (LVLMs)invisual understanding, they occasionally generate incorrect outputs. Whilereward models (RMs)withreinforcement learningortest-time scalingoffer thepotential for improving generation quality, a critical gap remains: publiclyavailable multi-modal RMs for LVLMs are scarce, and the implementation detailsof proprietary models are often unclear. We bridge this gap withInternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effectivemulti-modal reward modelthat aligns LVLMs withhuman preferences. To ensurethe robustness and versatility of IXC-2.5-Reward, we set up a high-qualitymulti-modal preference corpusspanning text, image, and video inputs acrossdiverse domains, such asinstruction following, general understanding,text-rich documents, mathematical reasoning, and video understanding.IXC-2.5-Reward achieves excellent results on the latest multi-modal rewardmodel benchmark and shows competitive performance on text-only reward modelbenchmarks. We further demonstrate three key applications of IXC-2.5-Reward:(1) Providing a supervisory signal for RL training. We integrate IXC-2.5-RewardwithProximal Policy Optimization (PPO)yields IXC-2.5-Chat, which showsconsistent improvements ininstruction followingand multi-modal open-endeddialogue; (2) Selecting the best response fromcandidate responsesfortest-time scaling; and (3) Filteringoutlier or noisy samplesfrom existingimage and video instruction tuningtraining data. To ensure reproducibility andfacilitate further research, we have open-sourced all model weights andtraining recipes at https://github.com/InternLM/InternLM-XComposer

Autonomy-of-Experts Models

作者: Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan
日期: 2025-01-22
论文链接: https://arxiv.org/pdf/2501.13074

摘要

Mixture-of-Experts (MoE) modelsmostly use a router to assign tokens tospecificexpert modules, activating onlypartial parametersand oftenoutperformingdense models. We argue that the separation between the router’sdecision-making and the experts’ execution is a critical yet overlooked issue,leading to suboptimalexpert selectionand ineffective learning. To addressthis, we proposeAutonomy-of-Experts (AoE), a novel MoE paradigm in whichexperts autonomously selectthemselves to process inputs. AoE is based on theinsight that an expert is aware of its own capacity to effectively process atoken, an awareness reflected in the scale of itsinternal activations. In AoE,routersare removed; instead, experts pre-computeinternal activationsforinputs and are ranked based on theiractivation norms. Only the top-rankingexperts proceed with theforward pass, while the others abort. The overhead ofpre-computing activations is reduced through alow-rank weight factorization.Thisself-evaluating-then-partner-comparing approachensures improved expertselection andeffective learning. Wepre-train language modelshaving 700M upto 4B parameters, demonstrating that AoE outperforms traditional MoE modelswith comparable efficiency.

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

作者: Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Zhichao Hu, Lei Qin, Jianbing Peng, Zhan Li, Minghui Chen, Xipeng Zhang, Lin Niu, Paige Wang, Yingkai Wang, Haozhao Kuang, Zhongyi Fan, Xu Zheng, Weihao Zhuang, YingPing He, Tian Liu, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Jingwei Huang, Chunchao Guo
日期: 2025-01-21
论文链接: https://arxiv.org/pdf/2501.12202

摘要

We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system forgeneratinghigh-resolution textured 3Dassets. This system includes twofoundation components: alarge-scale shape generationmodel --Hunyuan3D-DiT,and alarge-scale texture synthesismodel --Hunyuan3D-Paint. The shapegenerative model, built on ascalable flow-based diffusion transformer, aims tocreate geometry that properly aligns with a given condition image, laying asolid foundation for downstream applications. The texture synthesis model,benefiting from strong geometric and diffusion priors, produces high-resolutionand vibrant texture maps for either generated or hand-crafted meshes.Furthermore, we buildHunyuan3D-Studio-- a versatile, user-friendly productionplatform that simplifies the re-creation process of 3D assets. It allows bothprofessional and amateur users to manipulate or even animate their meshesefficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0outperforms previous state-of-the-art models, including the open-source modelsand closed-source models ingeometry details,condition alignment, texturequality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gapsin the open-source 3D community for large-scale foundationgenerative models.The code and pre-trained weights of our models are available at:https://github.com/Tencent/Hunyuan3D-2

Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step

作者: Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng
日期: 2025-01-23
论文链接: https://arxiv.org/pdf/2501.13926

摘要

Chain-of-Thought (CoT) reasoning has been extensively explored in largemodels to tackle complex understanding tasks. However, it still remains an openquestion whether such strategies can be applied to verifying and reinforcingimage generation scenarios. In this paper, we provide the first comprehensiveinvestigation of the potential of CoT reasoning to enhance autoregressive imagegeneration. We focus on three techniques:scaling test-time computationforverification, aligning model preferences with Direct Preference Optimization(DPO), and integrating these techniques for complementary effects. Our resultsdemonstrate that these approaches can be effectively adapted and combined tosignificantly improve image generation performance. Furthermore, given thepivotal role ofreward modelsin our findings, we propose the PotentialAssessment Reward Model (PARM) andPARM++, specialized for autoregressive imagegeneration. PARM adaptively assesses each generation step through a potentialassessment approach, merging the strengths of existingreward models, andPARM++further introduces areflection mechanismto self-correct the generatedunsatisfactory image. Using our investigated reasoning strategies, we enhance abaseline model, Show-o, to achieve superior results, with a significant +24%improvement on theGenEval benchmark, surpassingStable Diffusion 3by +15%. Wehope our study provides unique insights and paves a new path for integratingCoT reasoning withautoregressive image generation. Code and models arereleased at https://github.com/ZiyuGuo99/Image-Generation-CoT

Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong

作者: Tairan Fu, Javier Conde, Gonzalo Martínez, María Grandury, Pedro Reviriego
日期: 2025-01-16
论文链接: https://arxiv.org/pdf/2501.09775

摘要

One of the most widely used methods to evaluate LLMs are Multiple ChoiceQuestion (MCQ) tests.MCQ benchmarksenable the testing of LLM knowledge onalmost any topic at scale as the results can be processed automatically. Tohelp the LLM answer, a few examples called few shots can be included in theprompt. Moreover, the LLM can be asked to answer the question directly with theselected option or to first provide the reasoning and then the selected answer,which is known aschain of thought. In addition to checking whether theselected answer is correct, the evaluation can look at the LLM-estimatedprobability of its response as an indication of the confidence of the LLM inthe response. In this paper, we study how theLLM confidencein its answerdepends on whether the model has been asked to answer directly or to providethe reasoning before answering. The results of the evaluation of questions on awide range of topics in seven different models show that LLMs are moreconfident in their answers when they provide reasoning before the answer. Thisoccurs regardless of whether the selected answer is correct. Our hypothesis isthat this behavior is due to the reasoning that modifies the probability of theselected answer, as the LLM predicts the answer based on the input question andthe reasoning that supports the selection made. Therefore, LLM estimatedprobabilities seem to haveintrinsic limitationsthat should be understood inorder to use them in evaluation procedures. Interestingly, the same behaviorhas been observed in humans, for whom explaining an answer increases confidencein its correctness.

Reasoning Language Models: A Blueprint

作者: Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler
日期: 2025-01-20
论文链接: https://arxiv.org/pdf/2501.11223

摘要

Reasoning language models (RLMs), also known as Large Reasoning Models(LRMs), such asOpenAI’s o1and o3,DeepSeek-V3, andAlibaba’s QwQ, haveredefined AI’s problem-solving capabilities by extending large language models(LLMs) with advanced reasoning mechanisms. Yet, their high costs, proprietarynature, and complex architectures - uniquely combining Reinforcement Learning(RL),search heuristics, and LLMs - present accessibility and scalabilitychallenges. To address these, we propose a comprehensive blueprint thatorganizes RLM components into a modular framework, based on a survey andanalysis of all RLM works. This blueprint incorporates diverse reasoningstructures (chains, trees, graphs, and nested forms), reasoning strategies(e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy,value modelsand others), and supervision schemes (Output-Based and Process-BasedSupervision). We also provide detailed mathematical formulations andalgorithmic specifications to simplify RLM implementation. By showing howschemes likeLLaMA-Berry, QwQ,Journey Learning, andGraph of Thoughtsfit asspecial cases, we demonstrate the blueprint’s versatility and unifyingpotential. To illustrate its utility, we introduce x1, a modular implementationfor rapid RLM prototyping and experimentation. Using x1 and a literaturereview, we provide key insights, such asmulti-phase trainingforpolicyandvalue models, and the importance offamiliar training distributions. Finally,we outline how RLMs can integrate with a broaderLLM ecosystem, including toolsand databases. Our work demystifies RLM construction, democratizes advancedreasoning capabilities, and fosters innovation, aiming to mitigate the gapbetween “rich AI” and “poor AI” by lowering barriers to RLM development andexperimentation.

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

作者: Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji
日期: 2025-01-20
论文链接: https://arxiv.org/pdf/2501.11733

摘要

Smartphones have become indispensable in modern life, yet navigating complextasks on mobile devices often remains frustrating. Recent advancements in largemultimodal model (LMM)-basedmobile agentshave demonstrated the ability toperceive and act in mobile environments. However, current approaches facesignificant limitations: they fall short in addressing real-world human needs,struggle with reasoning-intensive andlong-horizontasks, and lack mechanismsto learn and improve from prior experiences. To overcome these challenges, weintroduce Mobile-Agent-E, ahierarchical multi-agent frameworkcapable ofself-evolutionthrough past experience. By hierarchical, we mean an explicitseparation ofhigh-level planningandlow-level action execution. The frameworkcomprises aManager, responsible for devising overall plans by breaking downcomplex tasks into subgoals, and four subordinate agents–Perceptor,Operator,Action Reflector, andNotetaker–which handle fine-grained visual perception,immediate action execution, error verification, and information aggregation,respectively. Mobile-Agent-E also features a novelself-evolution modulewhichmaintains a persistentlong-term memorycomprisingTipsandShortcuts.Tipsaregeneral guidance and lessons learned from prior tasks on how to effectivelyinteract with the environment.Shortcutsare reusable, executable sequences ofatomic operations tailored for specific subroutines. The inclusion ofTipsandShortcutsfacilitates continuous refinement in performance and efficiency.Alongside this framework, we introduceMobile-Eval-E, a new benchmark featuringcomplex mobile tasks requiringlong-horizon,multi-app interactions. Empiricalresults show that Mobile-Agent-E achieves a 22% absolute improvement overprevious state-of-the-art approaches across threefoundation model backbones.Project page: https://x-plug.github.io/MobileAgent.

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

作者: Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin
日期: 2025-01-16
论文链接: https://arxiv.org/pdf/2501.09781

摘要

This work explores whether adeep generative modelcan learn complexknowledge solely from visual input, in contrast to the prevalent focus ontext-based models like large language models (LLMs). We develop VideoWorld, anauto-regressive video generation modeltrained on unlabeled video data, andtest itsknowledge acquisitionabilities in video-based Go and robotic controltasks. Our experiments reveal two key findings: (1)video-only trainingprovides sufficient information for learning knowledge, including rules,reasoning and planning capabilities, and (2) the representation of visualchange is crucial forknowledge acquisition. To improve both the efficiency andefficacy of this process, we introduce theLatent Dynamics Model (LDM)as a keycomponent of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professionallevel in theVideo-GoBenchwith just a 300-million-parameter model, withoutrelying onsearch algorithmsorreward mechanismstypical in reinforcementlearning. In robotic tasks, VideoWorld effectively learns diverse controloperations and generalizes across environments, approaching the performance oforacle models inCALVINandRLBench. This study opens new avenues for knowledgeacquisition from visual data, with all code, data, and models open-sourced forfurther research.

O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning

作者: Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao
日期: 2025-01-22
论文链接: https://arxiv.org/pdf/2501.12570

摘要

Recently,long-thought reasoningLLMs, such as OpenAI’s O1, adopt extendedreasoning processes similar to how humans ponder over complex problems. Thisreasoning paradigm significantly enhances the model’s problem-solving abilitiesand has achieved promising results. However,long-thought reasoningprocessleads to a substantial increase in inference time. A pressing challenge isreducing theinference overheadof long-thought LLMs while ensuring accuracy.In this paper, we experimentally demonstrate thatlong-thought reasoningmodelsstruggle to effectively allocatetoken budgetsbased on problem difficulty andreasoning redundancies. To address this, we propose Length-HarmonizingFine-Tuning (O1-Pruner), aiming at minimizing reasoning overhead whilemaintaining accuracy. This effective fine-tuning method first estimates theLLM’s baseline performance throughpre-samplingand then uses RL-stylefine-tuning to encourage the model to generate shorter reasoning processesunder accuracy constraints. This allows the model to achieve efficientreasoning with lower redundancy while maintaining accuracy. Experiments onvarious mathematical reasoning benchmarks show thatO1-Prunernot onlysignificantly reducesinference overheadbut also achieves higher accuracy,providing a novel and promising solution to this challenge. Our code is comingsoon at https://github.com/StarDewXXX/O1-Pruner