【论文速递】2025年02周（Robotics/Embodied AI/LLM）

万俟淋曦

已于 2025-04-21 23:42:31 修改

阅读量955

点赞数 18

分类专栏：论文速递文章标签：机器人人工智能具身智能 AI 大模型 Robotics 论文

于 2025-04-21 00:09:53 首次发布

本文链接：https://blog.csdn.net/maizousidemao/article/details/147378080

版权

论文速递专栏收录该内容

9 篇文章

订阅专栏

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
- 摘要
Search-o1: Agentic Search-Enhanced Large Reasoning Models
- 摘要
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
- 摘要
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though
- 摘要
The GAN is dead; long live the GAN! A Modern GAN Baseline
- 摘要
Agent Laboratory: Using LLM Agents as Research Assistants
- 摘要
Cosmos World Foundation Model Platform for Physical AI
- 摘要
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
- 摘要
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation
- 摘要
Enhancing Human-Like Responses in Large Language Models
- 摘要
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
- 摘要
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
- 摘要
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
- 摘要
Test-time Computing: from System-1 Thinking to System-2 Thinking
- 摘要
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
- 摘要
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
- 摘要
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
- 摘要
An Empirical Study of Autoregressive Pre-training from Videos
- 摘要
LLM4SR: A Survey on Large Language Models for Scientific Research
- 摘要
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
- 摘要
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
- 摘要
Personalized Graph-Based Retrieval for Large Language Models
- 摘要
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
- 摘要
TransPixar: Advancing Text-to-Video Generation with Transparency
- 摘要
Scaling Laws for Floating Point Quantization Training
- 摘要

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

作者: Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Mao Yang
日期: 2025-01-08
论文链接: https://arxiv.org/pdf/2501.04519

摘要

We present rStar-Math to demonstrate that small language models (SLMs) canrival or even surpass the math reasoning capability of OpenAI o1, withoutdistillation from superior models. rStar-Math achieves this by exercising “deepthinking” throughMonte Carlo Tree Search(MCTS), where amath policy SLMperforms test-time search guided by an SLM-basedprocess reward model.rStar-Math introduces three innovations to tackle the challenges in trainingthe two SLMs: (1) a novel code-augmented CoT data sythesis method, whichperforms extensiveMCTS rolloutsto generate step-by-step verified reasoningtrajectories used to train the policy SLM; (2) a novelprocess reward modeltraining method that avoids na"ivestep-level score annotation, yielding amore effectiveprocess preference model(PPM); (3) aself-evolution recipeinwhich the policy SLM and PPM are built from scratch and iteratively evolved toimprove reasoning capabilities. Through 4 rounds of self-evolution withmillions of synthesized solutions for 747k math problems, rStar-Math boostsSLMs’ math reasoning to state-of-the-art levels. On the MATH benchmark, itimproves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad(AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking amongthe top 20% the brightest high school math students. Code and data will beavailable at https://github.com/microsoft/rStar.

Search-o1: Agentic Search-Enhanced Large Reasoning Models

作者: Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, Zhicheng Dou
日期: 2025-01-09
论文链接: https://arxiv.org/pdf/2501.05366

摘要

Large reasoning models (LRMs) like OpenAI-o1 have demonstrated impressivelong stepwise reasoning capabilities through large-scale reinforcementlearning. However, their extended reasoning processes often suffer fromknowledge insufficiency, leading to frequent uncertainties and potentialerrors. To address this limitation, we introduce Search-o1, aframework that enhances LRMs with an agentic retrieval-augmented generation(RAG) mechanism and aReason-in-Documents modulefor refining retrieveddocuments. Search-o1 integrates anagentic search workflowinto the reasoningprocess, enablingdynamic retrievalofexternal knowledgewhen LRMs encounteruncertain knowledge points. Additionally, due to theverbose natureofretrieved documents, we design a separateReason-in-Documents moduleto deeplyanalyze the retrieved information before injecting it into the reasoning chain,minimizing noise and preservingcoherent reasoning flow. Extensive experimentson complex reasoning tasks in science, mathematics, and coding, as well as sixopen-domain QA benchmarks, demonstrate the strong performance of Search-o1.This approach enhances the trustworthiness and applicability of LRMs in complexreasoning tasks, paving the way for more reliable and versatile intelligentsystems. The code is available athttps://github.com/sunnynexus/Search-o1.

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

作者: Jian Hu
日期: 2025-01-04
论文链接: https://arxiv.org/pdf/2501.03262

摘要

Reinforcement Learning from Human Feedback (RLHF)has emerged as a criticalapproach for aligning large language models with human preferences, witnessingrapid algorithmic evolution through methods such as Proximal PolicyOptimization (PPO),Direct Preference Optimization (DPO), REINFORCE LeaveOne-Out (RLOO),ReMax, andGroup Relative Policy Optimization (GRPO). WepresentREINFORCE++, an enhanced variant of the classical REINFORCE algorithmthat incorporates key optimization techniques from PPO while eliminating theneed for acritic network.REINFORCE++achieves three primary objectives: (1)simplicity (2) enhanced training stability, and (3) reduced computationaloverhead. Through extensiveempirical evaluation, we demonstrate thatREINFORCE++exhibits superior stability compared to GRPO and achieves greatercomputational efficiencythan PPO while maintaining comparable performance. Theimplementation is available at https://github.com/OpenRLHF/OpenRLHF.

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though

作者: Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn
日期: 2025-01-08
论文链接: https://arxiv.org/pdf/2501.04682

摘要

We propose a novel framework,Meta Chain-of-Thought (Meta-CoT), which extendstraditionalChain-of-Thought (CoT)by explicitly modeling the underlyingreasoning required to arrive at a particular CoT. We present empirical evidencefrom state-of-the-art models exhibiting behaviors consistent with in-contextsearch, and explore methods for producing Meta-CoT viaprocess supervision,synthetic data generation, andsearch algorithms. Finally, we outline aconcrete pipeline for training a model to produce Meta-CoTs, incorporatinginstruction tuningwithlinearized search tracesandreinforcement learningpost-training. Finally, we discuss open research questions, including scalinglaws,verifier roles, and the potential for discovering novel reasoningalgorithms. This work provides a theoretical and practical roadmap to enableMeta-CoT in LLMs, paving the way for more powerful and human-like reasoning inartificial intelligence.

The GAN is dead; long live the GAN! A Modern GAN Baseline

作者: Yiwen Huang, Aaron Gokaslan, Volodymyr Kuleshov, James Tompkin
日期: 2025-01-09
论文链接: https://arxiv.org/pdf/2501.05441

摘要

There is a widely-spread claim that GANs are difficult to train, and GANarchitectures in the literature are littered with empirical tricks. We provideevidence against this claim and build a modern GAN baseline in a moreprincipled manner. First, we derive awell-behaved regularized relativistic GANloss that addresses issues ofmode droppingandnon-convergencethat werepreviously tackled via a bag ofad-hoc tricks. We analyze our lossmathematically and prove that it admitslocal convergence guarantees, unlikemost existingrelativistic losses. Second, our new loss allows us to discardallad-hoc tricksand replaceoutdated backbonesused in common GANs withmodern architectures. UsingStyleGAN2as an example, we present a roadmap ofsimplification and modernization that results in a newminimalist baseline–R3GAN. Despite being simple, our approach surpassesStyleGAN2onFFHQ,ImageNet,CIFAR, andStacked MNISTdatasets, and compares favorably againststate-of-the-art GANsanddiffusion models.

Agent Laboratory: Using LLM Agents as Research Assistants

作者: Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, Emad Barsoum
日期: 2025-01-08
论文链接: https://arxiv.org/pdf/2501.04227

摘要

Historically, scientific discovery has been a lengthy and costly process,demanding substantial time and resources from initial conception to finalresults. To accelerate scientific discovery, reduce research costs, and improveresearch quality, we introduceAgent Laboratory, an autonomousLLM-basedframework capable of completing the entire research process. This frameworkaccepts a human-provided research idea and progresses through threestages–literature review,experimentation, andreport writingto producecomprehensiveresearch outputs, including acode repositoryand a researchreport, while enabling users to provide feedback and guidance at each stage. WedeployAgent Laboratorywith variousstate-of-the-art LLMsand invite multipleresearchers to assess its quality by participating in a survey, providing humanfeedback to guide the research process, and then evaluate the final paper. Wefound that: (1)Agent Laboratorydriven by o1-preview generates the bestresearch outcomes; (2) The generatedmachine learning codeis able to achievestate-of-the-art performancecompared to existing methods; (3) Humaninvolvement, providing feedback at each stage, significantly improves theoverall quality of research; (4)Agent Laboratorysignificantly reducesresearch expenses, achieving an 84% decrease compared to previous autonomousresearch methods. We hopeAgent Laboratoryenables researchers to allocate moreeffort towardcreative ideationrather thanlow-level codingandwriting,ultimately accelerating scientific discovery.

Cosmos World Foundation Model Platform for Physical AI

作者: NVIDIA, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mousavian, Seungjun Nah, Sriharsha Niverty, David Page, Despoina Paschalidou, Zeeshan Patel, Lindsey Pavao, Morteza Ramezanali, Fitsum Reda, Xiaowei Ren, Vasanth Rao Naik Sabavat, Ed Schmerling, Stella Shi, Bartosz Stefaniak, Shitao Tang, Lyne Tchapmi, Przemek Tredak, Wei-Cheng Tseng, Jibin Varghese, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Xinyue Wei, Jay Zhangjie Wu, Jiashu Xu, Wei Yang, Lin Yen-Chen, Xiaohui Zeng, Yu Zeng, Jing Zhang, Qinsheng Zhang, Yuxuan Zhang, Qingqing Zhao, Artur Zolkowski
日期: 2025-01-07
论文链接: https://arxiv.org/pdf/2501.03575

摘要

Physical AI needs to be trained digitally first. It needs adigital twinofitself, thepolicy model, and adigital twinof the world, theworld model. Inthis paper, we present theCosmos World Foundation Model Platformto helpdevelopers build customizedworld models for their Physical AI setups. Weposition aworld foundation modelas a general-purposeworld modelthat can befine-tuned into customizedworld models for downstream applications. Ourplatform covers avideo curation pipeline,pre-trained world foundation models,examples of post-training ofpre-trained world foundation models, and videotokenizers. To help Physical AI builders solve the most critical problems ofour society, we make our platform open-source and our models open-weight withpermissive licenses available via https://github.com/NVIDIA/Cosmos.

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

作者: Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, Ying Tai
日期: 2025-01-06
论文链接: https://arxiv.org/pdf/2501.02976

摘要

Image diffusion modelshave been adapted for real-world videosuper-resolution to tackle over-smoothing issues in GAN-based methods. However,these models struggle to maintaintemporal consistency, as they are trained onstatic images, limiting their ability to capture temporal dynamics effectively.Integratingtext-to-video (T2V) modelsinto video super-resolution for improvedtemporal modeling is straightforward. However, two key challenges remain:artifacts introduced bycomplex degradationsinreal-world scenarios, andcompromised fidelity due to the strong generative capacity of powerful T2Vmodels (e.g., CogVideoX-5B). To enhance thespatio-temporal qualityofrestored videos, we introduce~\name(Spatial-Temporal Augmentation with T2V models forReal-world video super-resolution), a novel approach that leveragesT2V models forreal-world video super-resolution, achieving realistic spatialdetails and robusttemporal consistency. Specifically, we introduce a LocalInformation Enhancement Module (LIEM) before theglobal attention blocktoenrich local details and mitigatedegradation artifacts. Moreover, we propose aDynamic Frequency (DF) Lossto reinforce fidelity, guiding the model to focuson differentfrequency componentsacrossdiffusion steps. Extensive experimentsdemonstrate_\nameoutperforms state-of-the-art methods on bothsynthetic and real-world datasets.

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

作者: Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Peng Gao, Hongsheng Li, Maoqing Yao, Guanghui Ren
日期: 2025-01-03
论文链接: https://arxiv.org/pdf/2501.01895

摘要

We introduce EnerVerse, a comprehensive framework for embodied future spacegeneration specifically designed for robotic manipulation tasks. EnerVerseseamlessly integratesconvolutionalandbidirectional attention mechanismsforinner-chunk space modeling, ensuring low-level consistency and continuity.Recognizing the inherent redundancy in video data, we propose a sparse memorycontext combined with achunkwise unidirectional generative paradigmto enablethe generation of infinitely long sequences. To further augment roboticcapabilities, we introduce theFree Anchor View (FAV) space, which providesflexible perspectives to enhance observation and analysis. The FAV spacemitigates motion modeling ambiguity, removes physical constraints in confinedenvironments, and significantly improves the robot’s generalization andadaptability across various tasks and settings. To address the prohibitivecosts and labor intensity of acquiring multi-camera observations, we present adata engine pipeline that integrates agenerative modelwith 4D GaussianSplatting (4DGS). This pipeline leverages thegenerative model’s robustgeneralization capabilities and the spatial constraints provided by 4DGS,enabling an iterative enhancement of data quality and diversity, thus creatinga data flywheel effect that effectively narrows the sim-to-real gap. Finally,our experiments demonstrate that the embodied future space generation priorsubstantially enhances policy predictive capabilities, resulting in improvedoverall performance, particularly in long-range robotic manipulation tasks.

Enhancing Human-Like Responses in Large Language Models

作者: Ethem Yağız Çalık, Talha Rüzgar Akkuş
日期: 2025-01-09
论文链接: https://arxiv.org/pdf/2501.05032

摘要

This paper explores the advancements in makinglarge language models (LLMs)more human-like. We focus on techniques that enhance natural languageunderstanding,conversational coherence, andemotional intelligencein AIsystems. The study evaluates various approaches, includingfine-tuningwithdiverse datasets, incorporatingpsychological principles, and designing modelsthat better mimichuman reasoning patterns. Our findings demonstrate that theseenhancements not only improve user interactions but also open new possibilitiesfor AI applications across different domains. Future work will address theethical implications and potential biases introduced by these human-likeattributes.

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

作者: Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, Yujiu Yang
日期: 2025-01-08
论文链接: https://arxiv.org/pdf/2501.04686
项目链接: https://ursa-math.github.io/

摘要

Chain-of-thought (CoT) reasoning has been widely applied in the mathematicalreasoning ofLarge Language Models (LLMs). Recently, the introduction ofderivative process supervisiononCoT trajectorieshas sparked discussions onenhancing scaling capabilities during test time, thereby boosting the potentialof these models. However, inmultimodal mathematical reasoning, the scarcity ofhigh-quality CoT training data has hindered existing models from achievinghigh-precisionCoT reasoningand has limited the realization of reasoningpotential during test time. In this work, we propose a three-module synthesisstrategy that integratesCoT distillation,trajectory-format rewriting, andformat unification. It results in a high-qualityCoT reasoninginstructionfine-tuning dataset in multimodal mathematics,MMathCoT-1M. We comprehensivelyvalidate thestate-of-the-art (SOTA)performance of the trainedURSA-7B modelon multiple multimodal mathematical benchmarks. For test-time scaling, weintroduce a data synthesis strategy that automatically generates processannotation datasets, known asDualMath-1.1M, focusing on both interpretationand logic. By further training URSA-7B onDualMath-1.1M, we transition from CoTreasoning capabilities to robust supervision abilities. The trainedURSA-RM-7Bacts as a verifier, effectively enhancing the performance of URSA-7B at testtime.URSA-RM-7Balso demonstrates excellentout-of-distribution (OOD)verifying capabilities, showcasing its generalization. Model weights, trainingdata and code will be open-sourced.

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

作者: Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng
日期: 2025-01-07
论文链接: https://arxiv.org/pdf/2501.03895

摘要

The advent of real-timelarge multimodal models (LMMs)likeGPT-4ohassparked considerable interest in efficient LMMs. LMM frameworks typicallyencode visual inputs intovision tokens(continuous representations) andintegrate them and textual instructions into the context of large languagemodels (LLMs), where large-scale parameters and numerouscontext tokens(predominantlyvision tokens) result in substantial computational overhead.Previous efforts towards efficient LMMs always focus on replacing the LLMbackbone with smaller models, while neglecting the crucial issue of tokenquantity. In this paper, we introduceLLaVA-Mini, an efficient LMM with minimalvision tokens. To achieve a high compression ratio ofvision tokenswhilepreserving visual information, we first analyze how LMMs understand visiontokens and find that mostvision tokensonly play a crucial role in the earlylayers of LLM backbone, where they mainly fuse visual information into texttokens. Building on this finding,LLaVA-Miniintroducesmodality pre-fusiontofuse visual information into text tokens in advance, thereby facilitating theextreme compression ofvision tokensfed to LLM backbone into one token.LLaVA-Miniis a unified large multimodal model that can support theunderstanding of images,high-resolution images, andvideosin an efficientmanner. Experiments across 11 image-based and 7 video-based benchmarksdemonstrate thatLLaVA-MinioutperformsLLaVA-v1.5with just 1 vision tokeninstead of 576. Efficiency analyses reveal thatLLaVA-Minican reduceFLOPsby77%, deliver low-latency responses within 40 milliseconds, and process over10,000 frames of video on theGPU hardwarewith 24GB of memory.

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

作者: Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang
日期: 2025-01-07
论文链接: https://arxiv.org/pdf/2501.04001

摘要

This work presents Sa2VA, the first unified model for dense groundedunderstanding of both images and videos. Unlike existing multi-modal largelanguage models, which are often limited to specific modalities and tasks,Sa2VA supports a wide range of image andvideo tasks, including referringsegmentation andconversation, with minimalone-shot instruction tuning. Sa2VAcombinesSAM-2, a foundation video segmentation model, with LLaVA, an advancedvision-language model, and unifies text, image, and video into a shared LLMtoken space. Using the LLM, Sa2VA generatesinstruction tokensthat guideSAM-2in producingprecise masks, enabling a grounded, multi-modal understanding ofboth static anddynamic visual content. Additionally, we introduceRef-SAV, anauto-labeled datasetcontaining over 72k object expressions in complex videoscenes, designed to boostmodel performance. We also manually validate 2k videoobjects in theRef-SAVdatasets to benchmark referring video objectsegmentation in complex environments. Experiments show that Sa2VA achievesstate-of-the-artacross multiple tasks, particularly in referring video objectsegmentation, highlighting its potential for complex real-world applications.

Test-time Computing: from System-1 Thinking to System-2 Thinking

作者: Yixin Ji, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, Min Zhang
日期: 2025-01-05
论文链接: https://arxiv.org/pdf/2501.02497

摘要

The remarkable performance of the o1 model in complex reasoning demonstratesthattest-time computingscaling can further unlock the model’s potential,enabling powerfulSystem-2 thinking. However, there is still a lack ofcomprehensive surveys fortest-time computingscaling. We trace the concept oftest-time computingback toSystem-1 models. InSystem-1 models, test-timecomputing addresses distribution shifts and improves robustness andgeneralization throughparameter updating,input modification, representationediting, andoutput calibration. In System-2 models, it enhances the model’sreasoning ability to solve complex problems throughrepeated sampling,self-correction, andtree search. We organize this survey according to thetrend of System-1 toSystem-2 thinking, highlighting the key role of test-timecomputing in the transition fromSystem-1 modelstoweak System-2 models, andthen tostrong System-2 models. We also point out a few possible futuredirections.

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

作者: Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He
日期: 2025-01-03
论文链接: https://arxiv.org/pdf/2501.01957

摘要

RecentMultimodal Large Language Models (MLLMs)have typically focused onintegrating visual and textual modalities, with less emphasis placed on therole of speech in enhancing interaction. However, speech plays a crucial roleinmultimodal dialogue systems, and implementing high-performance in bothvision and speech tasks remains a significant challenge due to the fundamentalmodality differences. In this paper, we propose a carefully designedmulti-stage training methodologythat progressively trains LLM to understandboth visual and speech information, ultimately enabling fluent vision andspeech interaction. Our approach not only preserves strong vision-languagecapacity, but also enables efficientspeech-to-speech dialogue capabilitieswithout separate ASR and TTS modules, significantly accelerating multimodalend-to-end response speed. By comparing our method against state-of-the-artcounterparts across benchmarks for image, video, and speech tasks, wedemonstrate that our model is equipped with both strong visual and speechcapabilities, makingnear real-time vision and speech interaction.

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

作者: Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang
日期: 2025-01-06
论文链接: https://arxiv.org/pdf/2501.03226

摘要

Cutting-edgelarge language models (LLMs)demonstrate promising performancein solving complex math problems with adivide-and-conquer pipelineand theassistance ofin-context learning (ICL)examples. However, their potential forimprovement is limited by two critical problems within their ICL examples:granularity-mismatchand the ensuingnegative-effect noiseproblem.Specifically, the LLMs are capable of the dividing process yet mostly failed byinaccurate reasoning within a few conquer steps, while the ICL examplesretrieved inquestion-grainedsometimes lack relevant steps for a specificchallenging reasoning step. Further, this disconnect may hinder the correctreasoning due to itsirrelevance. To this end, we focus on improving thereasoning quality within each step and presentBoostStep.BoostStepaligns thegranularity between the retrieving and reasoning on step grained, and provideshighly related ICL examples for each reasoning step with a novel `first-try’strategy.BoostStepprovides more relevant examples than the coarsequestion-grainedstrategy, enhancing the model reasoning quality within eachstep steadily.BoostStepis a general and robustreasoning-enhancing methodthat not only improves standalone reasoning performance but also integratesseamlessly with Monte Carlo Tree Search methods (MCTS) to refine both candidategeneration anddecision-making. Quantitatively, it improvesGPT-4oandQwen2.5-Math-72Bby 3.6% and 2.0% respectively on various mathematicalbenchmarks, and 7.5% gain combined with MCTS.

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

作者: Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang
日期: 2025-01-06
论文链接: https://arxiv.org/pdf/2501.02955

摘要

In recent years,vision language models (VLMs)have made significantadvancements in video understanding. However, a crucial capability -fine-grained motion comprehension- remains under-explored in currentbenchmarks. To address this gap, we proposeMotionBench, a comprehensiveevaluation benchmark designed to assess thefine-grained motion comprehensionof video understanding models.MotionBenchevaluates models’ motion-levelperception through six primary categories ofmotion-oriented question typesandincludes data collected from diverse sources, ensuring a broad representationof real-world video content. Experimental results reveal that existing VLMsperform poorly in understanding fine-grained motions. To enhance VLM’s abilityto perceive fine-grained motion within a limited sequence length of LLM, weconduct extensive experiments reviewing VLM architectures optimized for videofeature compression and propose a novel and efficient Through-Encoder (TE)Fusion method. Experiments show that higher frame rate inputs and TE Fusionyield improvements in motion understanding, yet there is still substantial roomfor enhancement. Our benchmark aims to guide and motivate the development ofmore capable video understanding models, emphasizing the importance offine-grained motion comprehension. Project page: https://motion-bench.github.io .

An Empirical Study of Autoregressive Pre-training from Videos

作者: Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik
日期: 2025-01-09
论文链接: https://arxiv.org/pdf/2501.05453

摘要

We empirically studyautoregressive pre-trainingfrom videos. To perform ourstudy, we construct a series ofautoregressive video models, called Toto. Wetreat videos as sequences ofvisual tokensand traintransformer modelstoautoregressively predict future tokens. Our models are pre-trained on a diversedataset of videos and images comprising over 1 trillionvisual tokens. Weexplore differentarchitectural, training, and inference design choices. Weevaluate the learnedvisual representationson a range of downstream tasksincludingimage recognition,video classification,object tracking, androbotics. Our results demonstrate that, despite minimal inductive biases,autoregressive pre-trainingleads tocompetitive performanceacross allbenchmarks. Finally, we find that scaling our video models results in similarscaling curves to those seen inlanguage models, albeit with a different rate.More details at https://brjathu.github.io/toto/

LLM4SR: A Survey on Large Language Models for Scientific Research

作者: Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, Xinya Du
日期: 2025-01-08
论文链接: https://arxiv.org/pdf/2501.04306

摘要

In recent years, the rapid advancement ofLarge Language Models (LLMs)hastransformed the landscape of scientific research, offering unprecedentedsupport across various stages of the research cycle. This paper presents thefirst systematic survey dedicated to exploring how LLMs are revolutionizing thescientific research process. We analyze the unique roles LLMs play across fourcritical stages of research:hypothesis discovery, experiment planning andimplementation,scientific writing, andpeer reviewing. Our reviewcomprehensively showcases thetask-specific methodologiesand evaluationbenchmarks. By identifying current challenges and proposing future researchdirections, this survey not only highlights the transformative potential ofLLMs, but also aims to inspire and guide researchers and practitioners inleveraging LLMs to advance scientific inquiry. Resources are available at thefollowing repository: https://github.com/du-nlp-lab/LLM4SR

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

作者: Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
日期: 2025-01-06
论文链接: https://arxiv.org/pdf/2501.03218

摘要

ActiveReal-time interactionwithvideo LLMsintroduces a new paradigm forhuman-computer interaction, where the model not only understandsuser intentbut also responds while continuously processingstreaming videoon the fly.Unlike offlinevideo LLMs, which analyze the entire video before answeringquestions, activereal-time interactionrequires three capabilities: 1)Perception: real-time video monitoring and interaction capturing. 2) Decision:raisingproactive interactionin proper situations, 3) Reaction: continuousinteraction with users. However, inherent conflicts exist among the desiredcapabilities. The Decision and Reaction require a contrary Perception scale andgrain, and theautoregressive decodingblocks the real-time Perception andDecision during the Reaction. To unify the conflicted capabilities within aharmonious system, we presentDispider, a system that disentangles Perception,Decision, and Reaction.Dispiderfeatures a lightweight proactive streamingvideo processing module that tracks the video stream and identifies optimalmoments for interaction. Once the interaction is triggered, an asynchronousinteraction module provides detailed responses, while the processing modulecontinues to monitor the video in the meantime. Our disentangled andasynchronous design ensures timely, contextually accurate, and computationallyefficient responses, makingDispiderideal for activereal-time interactionforlong-duration video streams. Experiments show thatDispidernot only maintainsstrong performance in conventional video QA tasks, but also significantlysurpasses previous online models in streaming scenario responses, therebyvalidating the effectiveness of our architecture. The code and model arereleased at https://github.com/Mark12Ding/Dispider.

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

作者: Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen
日期: 2025-01-03
论文链接: https://arxiv.org/pdf/2501.01904

摘要

Recently,slow-thinking reasoning systems, built upon large language models(LLMs), have garnered widespread attention by scaling the thinking time duringinference. There is also growing interest in adapting this capability tomultimodal large language models (MLLMs). Given that MLLMs handle more complexdata semantics across different modalities, it is intuitively more challengingto implementmultimodal slow-thinking systems. To address this issue, in this paper, we explore a straightforward approachbyfine-tuninga capable MLLM with a small amount of textual long-form thoughtdata, resulting in a multimodal slow-thinking system, Virgo (Visual reasoningwith long thought). We find that theselong-form reasoning processes, expressedinnatural language, can be effectively transferred to MLLMs. Moreover, itseems that such textual reasoning data can be even more effective than visualreasoning data in eliciting theslow-thinking capacitiesof MLLMs. While thiswork is preliminary, it demonstrates thatslow-thinking capacitiesarefundamentally associated with the language model component, which can betransferred across modalities or domains. This finding can be leveraged toguide the development of more powerfulslow-thinking reasoning systems. Werelease our resources at https://github.com/RUCAIBox/Virgo.

Personalized Graph-Based Retrieval for Large Language Models

作者: Steven Au, Cameron J. Dimacali, Ojasmitha Pedirappagari, Namyong Park, Franck Dernoncourt, Yu Wang, Nikos Kanakaris, Hanieh Deilamsalehy, Ryan A. Rossi, Nesreen K. Ahmed
日期: 2025-01-04
论文链接: https://arxiv.org/pdf/2501.02157

摘要

Aslarge language models (LLMs)evolve, their ability to deliver personalizedand context-aware responses offers transformative potential for improving userexperiences. Existing personalization approaches, however, often rely solely onuser historyto augment the prompt, limiting their effectiveness in generatingtailored outputs, especially in cold-start scenarios with sparse data. Toaddress these limitations, we propose Personalized Graph-basedRetrieval-Augmented Generation (PGraphRAG), a framework that leveragesuser-centric knowledge graphsto enrich personalization. By directlyintegrating structured user knowledge into theretrieval processand augmentingprompts withuser-relevant context, PGraphRAG enhances contextual understandingand output quality. We also introduce the Personalized Graph-based Benchmarkfor Text Generation, designed to evaluate personalized text generation tasks inreal-world settings whereuser historyis sparse or unavailable. Experimentalresults show that PGraphRAG significantly outperforms state-of-the-artpersonalization methods across diverse tasks, demonstrating the uniqueadvantages of graph-based retrieval for personalization.

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

作者: Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, Liang Pan
日期: 2025-01-07
论文链接: https://arxiv.org/pdf/2501.04003

摘要

Recent advancements inVision-Language Models (VLMs)have sparked interest intheir use for autonomous driving, particularly in generating interpretabledriving decisions through natural language. However, the assumption that VLMsinherently providevisually grounded, reliable, and interpretable explanationsfor driving remains largely unexamined. To address this gap, we introduceDriveBench, abenchmark datasetdesigned to evaluate VLM reliability across 17settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames,20,498 question-answer pairs, three question types, four mainstream drivingtasks, and a total of 12 popular VLMs. Our findings reveal that VLMs oftengenerate plausible responses derived from general knowledge ortextual cuesrather than true visual grounding, especially under degraded or missing visualinputs. This behavior, concealed bydataset imbalancesand insufficientevaluation metrics, poses significant risks in safety-critical scenarios likeautonomous driving. We further observe that VLMs struggle with multi-modalreasoning and display heightened sensitivity to input corruptions, leading toinconsistencies in performance. To address these challenges, we propose refinedevaluation metricsthat prioritize robust visual grounding and multi-modalunderstanding. Additionally, we highlight the potential of leveraging VLMs’awareness of corruptions to enhance their reliability, offering a roadmap fordeveloping more trustworthy and interpretable decision-making systems inreal-world autonomous driving contexts. The benchmark toolkit is publiclyaccessible.

TransPixar: Advancing Text-to-Video Generation with Transparency

作者: Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen
日期: 2025-01-06
论文链接: https://arxiv.org/pdf/2501.03006

摘要

Text-to-video generative models have made significant strides, enablingdiverse applications in entertainment, advertising, and education. However,generating RGBA video, which includes alpha channels for transparency, remainsa challenge due to limited datasets and the difficulty of adapting existingmodels. Alpha channels are crucial for visual effects (VFX), allowingtransparent elements like smoke and reflections to blend seamlessly intoscenes. We introduce TransPixar, a method to extend pretrained video models forRGBA generation while retaining the original RGB capabilities. TransPixarleverages adiffusion transformer (DiT)architecture, incorporatingalpha-specific tokensand usingLoRA-based fine-tuningto jointly generate RGBand alpha channels with high consistency. By optimizingattention mechanisms,TransPixar preserves the strengths of the original RGB model and achievesstrong alignment between RGB and alpha channels despite limited training data.Our approach effectively generates diverse and consistent RGBA videos,advancing the possibilities for VFX and interactive content creation.

Scaling Laws for Floating Point Quantization Training

作者: Xingwu Sun, Shuaipeng Li, Ruobing Xie, Weidong Han, Kan Wu, Zhen Yang, Yixing Li, An Wang, Shuai Li, Jinbao Xue, Yu Cheng, Yangyu Tao, Zhanhui Kang, Chengzhong Xu, Di Wang, Jie Jiang
日期: 2025-01-05
论文链接: https://arxiv.org/pdf/2501.02423

摘要

Low-precision trainingis considered an effective strategy for reducing bothtraining and downstream inference costs. Previous scaling laws for precisionmainly focus oninteger quantization, which pay less attention to theconstituents infloating-point quantizationand thus cannot well fit the LLMlosses in this scenario. In contrast, whilefloating-point quantizationtraining is more commonly implemented in production, the research on it hasbeen relatively superficial. In this paper, we thoroughly explore the effectsoffloating-point quantizationtargets,exponent bits,mantissa bits, and thecalculation granularityof thescaling factorinfloating-point quantizationtraining performance of LLM models. While presenting an accurate floating-pointquantization unified scaling law, we also provide valuable suggestions for thecommunity: (1)Exponent bitscontribute slightly more to themodel performancethanmantissa bits. We provide the optimal exponent-mantissa bit ratio fordifferent bit numbers, which is available for future reference by hardwaremanufacturers; (2) We discover the formation of thecritical data sizeinlow-precision LLM training. Too much training data exceeding the critical datasize will inversely bring in degradation of LLM performance; (3) The optimalfloating-point quantizationprecision is directly proportional to thecomputational power, but within a widecomputational powerrange, we estimatethat the bestcost-performance precisionlies between 4-8 bits.