【论文速递】2025年05周 (Robotics/Embodied AI/LLM)

目录

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

  • 作者: Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi Ma

  • 日期: 2025-01-28

  • 论文链接: https://arxiv.org/pdf/2501.17161

摘要

Supervised fine-tuning (SFT)andreinforcement learning (RL)are widely usedpost-training techniques for foundation models. However, their roles inenhancing modelgeneralizationcapabilities remain unclear. This paper studiesthe difference between SFT and RL ongeneralizationandmemorization, focusingon text-based rule variants and visual variants. We introduceGeneralPoints, anarithmetic reasoning card game, and adoptV-IRL, a real-world navigationenvironment, to assess how models trained with SFT and RL generalize to unseenvariants in both textual and visual domains. We show that RL, especially whentrained with anoutcome-based reward, generalizes across bothrule-basedtextual and visual variants. SFT, in contrast, tends to memorize training dataand struggles to generalize out-of-distribution scenarios. Further analysisreveals that RL improves the model’s underlyingvisual recognitioncapabilities, contributing to its enhancedgeneralizationin the visual domain.Despite RL’s superiorgeneralization, we show that SFT remains essential foreffective RL training; SFT stabilizes the model’s output format, enablingsubsequent RL to achieve its performance gains. These findings demonstrates thecapability of RL for acquiring generalizable knowledge in complex, multi-modaltasks.


GuardReasoner: Towards Reasoning-based LLM Safeguards

  • 作者: Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi

  • 日期: 2025-01-30

  • 论文链接: https://arxiv.org/pdf/2501.18492

摘要

As LLMs increasingly impact safety-critical applications, ensuring theirsafety using guardrails remains a key challenge. This paper proposesGuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn toreason. Concretely, we first create theGuardReasonerTrain dataset, whichconsists of 127K samples with 460K detailed reasoning steps. Then, we introducereasoning SFTto unlock the reasoning capability of guard models. In addition,we presenthard sample DPOto further strengthen their reasoning ability. Inthis manner,GuardReasonerachieves better performance, explainability, andgeneralizability. Extensive experiments and analyses on 13 benchmarks of 3guardrail tasks demonstrate its superiority. Remarkably,GuardReasoner8Bsurpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84%F1 scoreonaverage. We release the training data, code, and models with different scales(1B, 3B, 8B) ofGuardReasoner: https://github.com/yueliu1999/GuardReasoner/.


Humanity’s Last Exam

  • 作者: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Nguyen, Mobeen Mahmood, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Jessica P. Wang, Pawan Kumar, Oleksandr Pokutnyi, Robert Gerbicz, Serguei Popov, John-Clark Levin, Mstyslav Kazakov, Johannes Schmitt, Geoff Galgon, Alvaro Sanchez, Yongki Lee, Will Yeadon, Scott Sauers, Marc Roth, Chidozie Agu, Søren Riis, Fabian Giska, Saiteja Utpala, Zachary Giboney, Gashaw M. Goshu, Joan of Arc Xavier, Sarah-Jane Crowson, Mohinder Maheshbhai Naiya, Noah Burns, Lennart Finke, Zerui Cheng, Hyunwoo Park, Francesco Fournier-Facio, John Wydallis, Mark Nandor, Ankit Singh, Tim Gehrunger, Jiaqi Cai, Ben McCarty, Darling Duclosel, Jungbae Nam, Jennifer Zampese, Ryan G. Hoerr, Aras Bacho, Gautier Abou Loume, Abdallah Galal, Hangrui Cao, Alexis C Garretson, Damien Sileo, Qiuyu Ren, Doru Cojoc, Pavel Arkhipov, Usman Qazi, Lianghui Li, Sumeet Motwani, Christian Schroeder de Witt, Edwin Taylor, Johannes Veith, Eric Singer, Taylor D. Hartman, Paolo Rissone, Jaehyeok Jin, Jack Wei Lun Shi, Chris G. Willcocks, Joshua Robinson, Aleksandar Mikov, Ameya Prabhu, Longke Tang, Xavier Alapont, Justine Leon Uro, Kevin Zhou, Emily de Oliveira Santos, Andrey Pupasov Maksimov, Edward Vendrow, Kengo Zenitani, Julien Guillod, Yuqi Li, Joshua Vendrow, Vladyslav Kuchkin, Ng Ze-An, Pierre Marion, Denis Efremov, Jayson Lynch, Kaiqu Liang, Andrew Gritsevskiy, Dakotah Martinez, Ben Pageler, Nick Crispino, Dimitri Zvonkine, Natanael Wildner Fraga, Saeed Soori, Ori Press, Henry Tang, Julian Salazar, Sean R. Green, Lina Brüssel, Moon Twayana, Aymeric Dieuleveut, T. Ryan Rogers, Wenjin Zhang, Bikun Li, Jinzhou Yang, Arun Rao, Gabriel Loiseau, Mikhail Kalinin, Marco Lukas, Ciprian Manolescu, Subrata Mishra, Ariel Ghislain Kemogne Kamdoum, Tobias Kreiman, Tad Hogg, Alvin Jin, Carlo Bosio, Gongbo Sun, Brian P Coppola, Tim Tarver, Haline Heidinger, Rafael Sayous, Stefan Ivanov, Joseph M Cavanagh, Jiawei Shen, Joseph Marvin Imperial, Philippe Schwaller, Shaipranesh Senthilkuma, Andres M Bran, Ali Dehghan, Andres Algaba, Brecht Verbeken, David Noever, Ragavendran P V, Lisa Schut, Ilia Sucholutsky, Evgenii Zheltonozhskii, Derek Lim, Richard Stanley, Shankar Sivarajan, Tong Yang, John Maar, Julian Wykowski, Martí Oller, Jennifer Sandlin, Anmol Sahu, Yuzheng Hu, Sara Fish, Nasser Heydari, Archimedes Apronti, Kaivalya Rawal, Tobias Garcia Vilchis, Yuexuan Zu, Martin Lackner, James Koppel, Jeremy Nguyen, Daniil S. Antonenko, Steffi Chern, Bingchen Zhao, Pierrot Arsene, Alan Goldfarb, Sergey Ivanov, Rafał Poświata, Chenguang Wang, Daofeng Li, Donato Crisostomi, Andrea Achilleos, Benjamin Myklebust, Archan Sen, David Perrella, Nurdin Kaparov, Mark H Inlow, Allen Zang, Elliott Thornley, Daniil Orel, Vladislav Poritski, Shalev Ben-David, Zachary Berger, Parker Whitfill, Michael Foster, Daniel Munro, Linh Ho, Dan Bar Hava, Aleksey Kuchkin, Robert Lauff, David Holmes, Frank Sommerhage, Keith Schneider, Zakayo Kazibwe, Nate Stambaugh, Mukhwinder Singh, Ilias Magoulas, Don Clarke, Dae Hyun Kim, Felipe Meneguitti Dias, Veit Elser, Kanu Priya Agarwal, Victor Efren Guadarrama Vilchis, Immo Klose, Christoph Demian, Ujjwala Anantheswaran, Adam Zweiger, Guglielmo Albani, Jeffery Li, Nicolas Daans, Maksim Radionov, Václav Rozhoň, Ziqiao Ma, Christian Stump, Mohammed Berkani, Jacob Platnick, Volodymyr Nevirkovets, Luke Basler, Marco Piccardo, Ferenc Jeanplong, Niv Cohen, Josef Tkadlec, Paul Rosu, Piotr Padlewski, Stanislaw Barzowski, Kyle Montgomery, Aline Menezes, Arkil Patel, Zixuan Wang, Jamie Tucker-Foltz, Jack Stade, Tom Goertzen, Fereshteh Kazemi, Jeremiah Milbauer, John Arnold Ambay, Abhishek Shukla, Yan Carlos Leyva Labrador, Alan Givré, Hew Wolff, Vivien Rossbach, Muhammad Fayez Aziz, Younesse Kaddar, Yanxu Chen, Robin Zhang, Jiayi Pan, Antonio Terpin, Niklas Muennighoff, Hailey Schoelkopf, Eric Zheng, Avishy Carmi, Adam Jones, Jainam Shah, Ethan D. L. Brown, Kelin Zhu, Max Bartolo, Richard Wheeler, Andrew Ho, Shaul Barkan, Jiaqi Wang, Martin Stehberger, Egor Kretov, Kaustubh Sridhar, Zienab EL-Wasif, Anji Zhang, Daniel Pyda, Joanna Tam, David M. Cunningham, Vladimir Goryachev, Demosthenes Patramanis, Michael Krause, Andrew Redenti, Daniel Bugas, David Aldous, Jesyin Lai, Shannon Coleman, Mohsen Bahaloo, Jiangnan Xu, Sangwon Lee, Sandy Zhao, Ning Tang, Michael K. Cohen, Micah Carroll, Orr Paradise, Jan Hendrik Kirchner, Stefan Steinerberger, Maksym Ovchynnikov, Jason O. Matos, Adithya Shenoy, Benedito Alves de Oliveira Junior, Michael Wang, Yuzhou Nie, Paolo Giordano, Philipp Petersen, Anna Sztyber-Betley, Priti Shukla, Jonathan Crozier, Antonella Pinto, Shreyas Verma, Prashant Joshi, Zheng-Xin Yong, Allison Tee, Jérémy Andréoletti, Orion Weller, Raghav Singhal, Gang Zhang, Alexander Ivanov, Seri Khoury, Hamid Mostaghimi, Kunvar Thaman, Qijia Chen, Tran Quoc Khánh, Jacob Loader, Stefano Cavalleri, Hannah Szlyk, Zachary Brown, Jonathan Roberts, William Alley, Kunyang Sun, Ryan Stendall, Max Lamparth, Anka Reuel, Ting Wang, Hanmeng Xu, Sreenivas Goud Raparthi, Pablo Hernández-Cámara, Freddie Martin, Dmitry Malishev, Thomas Preu, Tomek Korbak, Marcus Abramovitch, Dominic Williamson, Ziye Chen, Biró Bálint, M Saiful Bari, Peyman Kassani, Zihao Wang, Behzad Ansarinejad, Laxman Prasad Goswami, Yewen Sun, Hossam Elgnainy, Daniel Tordera, George Balabanian, Earth Anderson, Lynna Kvistad, Alejandro José Moyano, Rajat Maheshwari, Ahmad Sakor, Murat Eron, Isaac C. McAlister, Javier Gimenez, Innocent Enyekwe, Andrew Favre D. O., Shailesh Shah, Xiaoxiang Zhou, Firuz Kamalov, Ronald Clark, Sherwin Abdoli, Tim Santens, Khalida Meer, Harrison K Wang, Kalyan Ramakrishnan, Evan Chen, Alessandro Tomasiello, G. Bruno De Luca, Shi-Zhuo Looi, Vinh-Kha Le, Noam Kolt, Niels Mündler, Avi Semler, Emma Rodman, Jacob Drori, Carl J Fossum, Milind Jagota, Ronak Pradeep, Honglu Fan, Tej Shah, Jonathan Eicher, Michael Chen, Kushal Thaman, William Merrill, Carter Harris, Jason Gross, Ilya Gusev, Asankhaya Sharma, Shashank Agnihotri, Pavel Zhelnov, Siranut Usawasutsakorn, Mohammadreza Mofayezi, Sergei Bogdanov, Alexander Piperski, Marc Carauleanu, David K. Zhang, Dylan Ler, Roman Leventov, Ignat Soroko, Thorben Jansen, Pascal Lauer, Joshua Duersch, Vage Taamazyan, Wiktor Morak, Wenjie Ma, William Held, Tran Đuc Huy, Ruicheng Xian, Armel Randy Zebaze, Mohanad Mohamed, Julian Noah Leser, Michelle X Yuan, Laila Yacar, Johannes Lengler, Hossein Shahrtash, Edson Oliveira, Joseph W. Jackson, Daniel Espinosa Gonzalez, Andy Zou, Muthu Chidambaram, Timothy Manik, Hector Haffenden, Dashiell Stander, Ali Dasouqi, Alexander Shen, Emilien Duc, Bita Golshani, David Stap, Mikalai Uzhou, Alina Borisovna Zhidkovskaya, Lukas Lewark, Mátyás Vincze, Dustin Wehr, Colin Tang, Zaki Hossain, Shaun Phillips, Jiang Muzhen, Fredrik Ekström, Angela Hammon, Oam Patel, Nicolas Remy, Faraz Farhidi, George Medley, Forough Mohammadzadeh, Madellene Peñaflor, Haile Kassahun, Alena Friedrich, Claire Sparrow, Taom Sakal, Omkar Dhamane, Ali Khajegili Mirabadi, Eric Hallman, Mike Battaglia, Mohammad Maghsoudimehrabani, Hieu Hoang, Alon Amit, Dave Hulbert, Roberto Pereira, Simon Weber, Stephen Mensah, Nathan Andre, Anton Peristyy, Chris Harjadi, Himanshu Gupta, Stephen Malina, Samuel Albanie, Will Cai, Mustafa Mehkary, Frank Reidegeld, Anna-Katharina Dick, Cary Friday, Jasdeep Sidhu, Wanyoung Kim, Mariana Costa, Hubeyb Gurdogan, Brian Weber, Harsh Kumar, Tong Jiang, Arunim Agarwal, Chiara Ceconello, Warren S. Vaz, Chao Zhuang, Haon Park, Andrew R. Tawfeek, Daattavya Aggarwal, Michael Kirchhof, Linjie Dai, Evan Kim, Johan Ferret, Yuzhou Wang, Minghao Yan, Krzysztof Burdzy, Lixin Zhang, Antonio Franca, Diana T. Pham, Kang Yong Loh, Joshua Robinson, Shreen Gul, Gunjan Chhablani, Zhehang Du, Adrian Cosma, Colin White, Robin Riblet, Prajvi Saxena, Jacob Votava, Vladimir Vinnikov, Ethan Delaney, Shiv Halasyamani, Syed M. Shahid, Jean-Christophe Mourrat, Lavr Vetoshkin, Renas Bacho, Vincent Ginis, Aleksandr Maksapetyan, Florencia de la Rosa, Xiuyu Li, Guillaume Malod, Leon Lang, Julien Laurendeau, Fatimah Adesanya, Julien Portier, Lawrence Hollom, Victor Souza, Yuchen Anna Zhou, Yiğit Yalın, Gbenga Daniel Obikoya, Luca Arnaboldi, Rai, Filippo Bigi, Kaniuar Bacho, Pierre Clavier, Gabriel Recchia, Mara Popescu, Nikita Shulga, Ngefor Mildred Tanwie, Thomas C. H. Lux, Ben Rank, Colin Ni, Alesia Yakimchyk, Huanxu, Liu, Olle Häggström, Emil Verkama, Himanshu Narayan, Hans Gundlach, Leonor Brito-Santana, Brian Amaro, Vivek Vajipey, Rynaa Grover, Yiyang Fan, Gabriel Poesia Reis e Silva, Linwei Xin, Yosi Kratish, Jakub Łucki, Wen-Ding Li, Justin Xu, Kevin Joseph Scaria, Freddie Vargus, Farzad Habibi, Long, Lian, Emanuele Rodolà, Jules Robins, Vincent Cheng, Declan Grabb, Ida Bosio, Tony Fruhauff, Ido Akov, Eve J. Y. Lo, Hao Qi, Xi Jiang, Ben Segev, Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Michael P. Brenner, Mao Mao, Yibo Jiang, Xinyu Zhang, David Avagian, Eshawn Jessica Scipio, Muhammad Rehan Siddiqi, Alon Ragoler, Justin Tan, Deepakkumar Patil, Rebeka Plecnik, Aaron Kirtland, Roselynn Grace Montecillo, Stephane Durand, Omer Faruk Bodur, Zahra Adoul, Mohamed Zekry, Guillaume Douville, Ali Karakoc, Tania C. B. Santos, Samir Shamseldeen, Loukmane Karim, Anna Liakhovitskaia, Nate Resman, Nicholas Farina, Juan Carlos Gonzalez, Gabe Maayan, Sarah Hoback, Rodrigo De Oliveira Pena, Glen Sherman, Hodjat Mariji, Rasoul Pouriamanesh, Wentao Wu, Gözdenur Demir, Sandra Mendoza, Ismail Alarab, Joshua Cole, Danyelle Ferreira, Bryan Johnson, Hsiaoyun Milliron, Mohammad Safdari, Liangti Dai, Siriphan Arthornthurasuk, Alexey Pronin, Jing Fan, Angel Ramirez-Trinidad, Ashley Cartwright, Daphiny Pottmaier, Omid Taheri, David Outevsky, Stanley Stepanic, Samuel Perry, Luke Askew, Raúl Adrián Huerta Rodríguez, Abdelkader Dendane, Sam Ali, Ricardo Lorena, Krishnamurthy Iyer, Sk Md Salauddin, Murat Islam, Juan Gonzalez, Josh Ducey, Russell Campbell, Maja Somrak, Vasilios Mavroudis, Eric Vergo, Juehang Qin, Benjámin Borbás, Eric Chu, Jack Lindsey, Anil Radhakrishnan, Antoine Jallon, I. M. J. McInnis, Alex Hoover, Sören Möller, Song Bian, John Lai, Tejal Patwardhan, Summer Yue, Alexandr Wang, Dan Hendrycks

  • 日期: 2025-01-24

  • 论文链接: https://arxiv.org/pdf/2501.14249

摘要

Benchmarksare important tools for tracking the rapid advancements in largelanguage model (LLM) capabilities. However,benchmarksare not keeping pace indifficulty: LLMs now achieve over 90%accuracyon popularbenchmarkslikeMMLU, limiting informed measurement ofstate-of-the-artLLM capabilities. Inresponse, we introduce Humanity’s Last Exam (HLE), amulti-modal benchmarkatthe frontier ofhuman knowledge, designed to be the finalclosed-endedacademicbenchmark of its kind with broadsubject coverage. HLE consists of 3,000questions across dozens of subjects, including mathematics, humanities, and thenatural sciences. HLE is developed globally by subject-matterexperts andconsists ofmultiple-choiceandshort-answerquestions suitable for automatedgrading. Each question has aknown solutionthat is unambiguous and easilyverifiable, but cannot be quickly answered viainternet retrieval.State-of-the-artLLMs demonstrate lowaccuracyandcalibrationon HLE,highlighting a significant gap between current LLM capabilities and theexperthuman frontier onclosed-endedacademic questions. To informresearchandpolicymakingupon aclear understandingofmodel capabilities, we publiclyrelease HLE at https://lastexam.ai.


Qwen2.5-1M Technical Report

  • 作者: An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, Zipeng Zhang

  • 日期: 2025-01-26

  • 论文链接: https://arxiv.org/pdf/2501.15383

摘要

We introduce Qwen2.5-1M, a series of models that extend the context length to1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M serieshave significantly enhanced long-context capabilities through long-contextpre-training and post-training. Key techniques such aslong data synthesis,progressive pre-training, andmulti-stage supervised fine-tuningare employedto effectively enhancelong-context performancewhile reducing training costs. To promote the use of long-context models among a broader user base, wepresent and open-source our inference framework. This framework includes alength extrapolationmethod that can expand the model context lengths by atleast four times, or even more, without additional training. To reduceinference costs, we implement asparse attentionmethod along with chunkedprefill optimization for deployment scenarios and asparsity refinementmethodto improve precision. Additionally, we detail our optimizations in theinference engine, includingkernel optimization,pipeline parallelism, andscheduling optimization, which significantly enhance overall inferenceperformance. By leveraging our inference framework, the Qwen2.5-1M modelsachieve a remarkable 3x to 7xprefill speedupin scenarios with 1 milliontokens of context. This framework provides an efficient and powerful solutionfor developing applications that require long-context processing usingopen-source models. The Qwen2.5-1M series currently includes the open-source modelsQwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessedmodel Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatlyimproved inlong-context taskswithout compromising performance inshort-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M modelsignificantly outperforms GPT-4o-mini inlong-context tasksand supportscontexts eight times longer.


Baichuan-Omni-1.5 Technical Report

  • 作者: Yadong Li, Jun Liu, Tao Zhang, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li, Yuqi Huo, Zheng Liang, Shusen Zhang, Xin Wu, Shuai Zhao, Linchu Xiong, Yozhen Wu, Jiahui Ye, Wenhao Lu, Bowen Li, Yan Zhang, Yaqi Zhou, Xin Chen, Lei Su, Hongda Zhang, Fuzhong Chen, Xuezhen Dong, Na Nie, Zhiying Wu, Bin Xiao, Ting Li, Shunya Dang, Ping Zhang, Yijia Sun, Jincheng Wu, Jinjie Yang, Xionghai Lin, Zhi Ma, Kegeng Wu, Jia li, Aiyuan Yang, Hui Liu, Jianqiang Zhang, Xiaoxi Chen, Guangwei Ai, Wentao Zhang, Yicong Chen, Xiaoqin Huang, Kun Li, Wenjing Luo, Yifei Duan, Lingling Zhu, Ran Xiao, Zhe Su, Jiani Pu, Dian Wang, Xu Jia, Tianyu Zhang, Mengyu Ai, Mang Wang, Yujing Qiao, Lei Zhang, Yanjun Shen, Fan Yang, Miao Zhen, Yijie Zhou, Mingyang Chen, Fei Li, Chenzheng Zhu, Keer Lu, Yaqi Zhao, Hao Liang, Youquan Li, Yanzhao Qin, Linzhuang Sun, Jianhua Xu, Haoze Sun, Mingan Lin, Zenan Zhou, Weipeng Chen

  • 日期: 2025-01-26

  • 论文链接: https://arxiv.org/pdf/2501.15368

摘要

We introduce Baichuan-Omni-1.5, an omni-modal model that not only hasomni-modal understanding capabilities but also provides end-to-end audiogeneration capabilities. To achieve fluent and high-quality interaction acrossmodalities without compromising the capabilities of any modality, weprioritized optimizing three key aspects. First, we establish a comprehensivedata cleaning and synthesis pipeline formultimodal data, obtaining about 500Bhigh-quality data (text, audio, and vision). Second, an audio-tokenizer(Baichuan-Audio-Tokenizer) has been designed to capture both semantic andacoustic information from audio, enabling seamless integration and enhancedcompatibility with MLLM. Lastly, we designed amulti-stage training strategythat progressively integratesmultimodal alignmentandmultitask fine-tuning,ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leadscontemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms ofcomprehensive omni-modal capabilities. Notably, it achieves results comparableto leading models such as Qwen2-VL-72B across various multimodal medicalbenchmarks.


Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

  • 作者: Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu

  • 日期: 2025-01-30

  • 论文链接: https://arxiv.org/pdf/2501.18585

摘要

Large language models (LLMs)such as OpenAI’s o1 have demonstrated remarkableabilities incomplex reasoning tasksbyscaling test-time computeandexhibitinghuman-like deep thinking. However, we identify a phenomenon we termunderthinking, where o1-like LLMs frequently switch between different reasoningthoughts without sufficiently exploring promising paths to reach a correctsolution. This behavior leads to inadequate depth of reasoning and decreasedperformance, particularly on challenging mathematical problems. Tosystematically analyze this issue, we conduct experiments on three challengingtest sets and two representative open-source o1-like models, revealing thatfrequent thought switching correlates with incorrect responses. We introduce anovel metric to quantifyunderthinkingby measuringtoken efficiencyinincorrect answers. To addressunderthinking, we propose a decoding strategywith thought switching penalty TIP that discourages premature transitionsbetween thoughts, encouraging deeper exploration of each reasoning path.Experimental results demonstrate that our approach improves accuracy acrosschallenging datasets without requiring model fine-tuning. Our findingscontribute to understanding reasoning inefficiencies in o1-like LLMs and offera practical solution to enhance their problem-solving capabilities.


Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

摘要

Supervised Fine-Tuning (SFT)is commonly used to trainlanguage modelstoimitateannotated responsesfor given instructions. In this paper, we challengethis paradigm and proposeCritique Fine-Tuning (CFT), a strategy where modelslearn to critiquenoisy responsesrather than simply imitate correct ones.Inspired byhuman learning processesthat emphasizecritical thinking, CFTencourages deeper analysis andnuanced understanding-traits often overlooked bystandard SFT. To validate the effectiveness of CFT, we construct a 50K-sampledataset fromWebInstruct, usingGPT-4oas the teacher to generate critiques inthe form of (input=[query; noisy response], output=critique). CFT on thisdataset yields a consistent 4-10% improvement over SFT on six math benchmarkswith different base models likeQwen2.5,Qwen2.5-MathandDeepSeek-Math. Wefurther expand toMetaMathandNuminaMathdatasets and observe similar gainsover SFT. Notably, ourQwen2.5-Math-CFT model-trained on just 50Ksamples-matches or outperforms competitive models such asAceMathandQwen2.5-Math-Instructon most benchmarks, both of which use over 2M samples.Ablation studies show that CFT is robust to the source of noisy response andteacher critique model. Through these findings, we argue that critique-basedtraining offers a more effective alternative to advance the reasoning oflanguage models.


Chain-of-Retrieval Augmented Generation

摘要

This paper introduces an approach for training o1-likeRAG modelsthatretrieve and reason over relevant information step by step before generatingthe final answer.Conventional RAG methodsusually perform a singleretrievalstep before thegenerationprocess, which limits their effectiveness inaddressingcomplex queriesdue to imperfectretrievalresults. In contrast, ourproposed method,CoRAG (Chain-of-Retrieval Augmented Generation), allows themodel to dynamically reformulate the query based on theevolving state. Totrain CoRAG effectively, we utilizerejection samplingto automaticallygenerateintermediate retrieval chains, thereby augmenting existing RAGdatasets that only provide the correct final answer. At test time, we proposevariousdecoding strategiesto scale the model’stest-time computebycontrolling the length and number ofsampled retrieval chains. Experimentalresults across multiple benchmarks validate the efficacy of CoRAG, particularlyinmulti-hop question answering tasks, where we observe more than 10 pointsimprovement inEM scorecompared to strong baselines. On theKILT benchmark,CoRAG establishes a new state-of-the-art performance across a diverse range ofknowledge-intensive tasks. Furthermore, we offer comprehensive analyses tounderstand the scaling behavior of CoRAG, laying the groundwork for futureresearch aimed at developingfactual and grounded foundation models.


Optimizing Large Language Model Training Using FP4 Quantization

  • 作者: Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng

  • 日期: 2025-01-28

  • 论文链接: https://arxiv.org/pdf/2501.17116

摘要

The growing computational demands of training large language models (LLMs)necessitate more efficient methods.Quantized trainingpresents a promisingsolution by enablinglow-bit arithmetic operationsto reduce these costs. WhileFP8 precisionhas demonstrated feasibility, leveraging FP4 remains a challengedue to significantquantization errorsand limitedrepresentational capacity.This work introduces the first FP4 training framework for LLMs, addressingthese challenges with two key innovations: a differentiable quantizationestimator for preciseweight updatesand anoutlier clampingand compensationstrategy to preventactivation collapse. To ensure stability, the frameworkintegrates amixed-precision trainingscheme andvector-wise quantization.Experimental results demonstrate that our FP4 framework achieves accuracycomparable to BF16 and FP8, with minimal degradation, scaling effectively to13B-parameter LLMstrained on up to100B tokens. With the emergence ofnext-generation hardware supporting FP4, our framework sets a foundation forefficientultra-low precision training.


Atla Selene Mini: A General Purpose Evaluation Model

  • 作者: Andrei Alexandru, Antonia Calvi, Henry Broomfield, Jackson Golden, Kyle Dai, Mathias Leys, Maurice Burger, Max Bartolo, Roman Engeler, Sashank Pisupati, Toby Drane, Young Sun Park

  • 日期: 2025-01-27

  • 论文链接: https://arxiv.org/pdf/2501.17195

摘要

We introduce Atla Selene Mini, a state-of-the-art small languagemodel-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator thatoutperforms the best SLMJs andGPT-4o-minion overall performance across 11out-of-distribution benchmarks, spanning absolute scoring,classification, andpairwise preference tasks. It is the highest-scoring 8B generative model onRewardBench, surpassing strong baselines likeGPT-4oand specialized judges. Toachieve this, we develop a principled data curation strategy that augmentspublic datasets with synthetically generated critiques and ensures high qualitythrough filtering anddataset ablations. We train our model on a combineddirect preference optimization (DPO)andsupervised fine-tuning (SFT)loss, andproduce a highlypromptable evaluatorthat excels in real-world scenarios.Selene Mini shows dramatically improvedzero-shot agreementwith human expertevaluations on financial andmedical industry datasets. It is also robust tovariations in prompt format. Preliminary results indicate that Selene Mini isthe top-ranking evaluator in a live, community-driven Judge Arena. We releasethe model weights on HuggingFace(https://hf.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B) and Ollama to encouragewidespread community adoption.


RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

  • 作者: Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin

  • 日期: 2025-01-24

  • 论文链接: https://arxiv.org/pdf/2501.14492

摘要

Critiques are important for enhancing the performance of Large LanguageModels (LLMs), enabling both self-improvement and constructive feedback forothers by identifying flaws and suggesting improvements. However, evaluatingthe critique capabilities of LLMs presents a significant challenge due to theopen-ended nature of the task. In this work, we introduce a new benchmarkdesigned to assess the critique capabilities of LLMs. Unlike existingbenchmarks, which typically function in an open-loop fashion, our approachemploys aclosed-loop methodologythat evaluates the quality of correctionsgenerated from critiques. Moreover, the benchmark incorporates features such asself-critique,cross-critique, anditerative critique, which are crucial fordistinguishing the abilities of advanced reasoning models from more classicalones. We implement this benchmark using eight challengingreasoning tasks. Wehave several interesting findings. First, despite demonstrating comparableperformance in directchain-of-thought generation, classical LLMs significantlylag behind theadvanced reasoning-based modelo1-mini across all critiquescenarios. Second, inself-critiqueanditerative critiquesettings, classicalLLMs may even underperform relative to theirbaseline capabilities. We hopethat this benchmark will serve as a valuable resource to guide futureadvancements. The code and data are available athttps://github.com/tangzhy/RealCritic.


Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

  • 作者: Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey, Ross McIlroy, Jiajun Shen, Alexandre Ramé, Arthur Szlam, Marc’Aurelio Ranzato, Paul Barham

  • 日期: 2025-01-30

  • 论文链接: https://arxiv.org/pdf/2501.18512

摘要

Training oflarge language models (LLMs)is typically distributed across alarge number ofacceleratorsto reduce training time. Since internal states andparameter gradients need to be exchanged at each and every single gradientstep, all devices need to be co-located using low-latency high-bandwidthcommunication links to support the required high volume of exchanged bits.Recently, distributed algorithms likeDiLoCohave relaxed such co-locationconstraint:acceleratorscan be grouped into ``workers’', wheresynchronizationsbetweenworkersonly occur infrequently. This in turn meansthatworkerscan afford being connected bylower bandwidth communication linkswithout affecting learning quality. However, in these methods, communicationacrossworkersstill requires the samepeak bandwidthas before, as thesynchronizationsrequire all parameters to be exchanged across allworkers. Inthis paper, we improveDiLoCoin three ways. First, we synchronize only subsetsof parameters in sequence, rather than all at once, which greatly reduces peakbandwidth. Second, we allowworkersto continue training while synchronizing,which decreases wall clock time. Third, we quantize the data exchanged byworkers, which further reduces bandwidth acrossworkers. By properly combiningthese modifications, we show experimentally that we can distribute training ofbillion-scale parameters and reach similar quality as before, but reducingrequired bandwidth by two orders of magnitude.


Towards General-Purpose Model-Free Reinforcement Learning

摘要

Reinforcement learning(RL) promises a framework for near-universalproblem-solving. In practice however, RL algorithms are often tailored tospecific benchmarks, relying on carefully tuned hyperparameters and algorithmicchoices. Recently, powerfulmodel-based RLmethods have shown impressivegeneral results across benchmarks but come at the cost of increased complexityand slow run times, limiting their broader applicability. In this paper, weattempt to find a unifyingmodel-free deep RLalgorithm that can address adiverse class of domains and problem settings. To achieve this, we leveragemodel-based representations that approximately linearize thevalue function,taking advantage of the densertask objectivesused bymodel-based RLwhileavoiding the costs associated withplanningorsimulated trajectories. Weevaluate our algorithm, MR.Q, on a variety of common RL benchmarks with asingle set of hyperparameters and show a competitive performance againstdomain-specific and general baselines, providing a concrete step towardsbuilding general-purposemodel-free deep RLalgorithms.


Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

摘要

Tokenization is a fundamental component of large language models (LLMs), yetits influence on model scaling and performance is not fully explored. In thispaper, we introduceOver-Tokenized Transformers, a novel framework thatdecouples input and output vocabularies to improve language modelingperformance. Specifically, our approach scales up input vocabularies toleveragemulti-gram tokens. Through extensive experiments, we uncover alog-linear relationship betweeninput vocabulary sizeandtraining loss,demonstrating that larger input vocabularies consistently enhance modelperformance, regardless of model size. Using a large input vocabulary, weachieve performance comparable to double-sized baselines with no additionalcost. Our findings highlight the importance of tokenization inscaling lawsandprovide practical insight fortokenizer design, paving the way for moreefficient and powerful LLMs.


Redundancy Principles for MLLMs Benchmarks

  • 作者: Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, Guangtao Zhai

  • 日期: 2025-01-20

  • 论文链接: https://arxiv.org/pdf/2501.13953

摘要

With the rapid iteration of Multi-modality Large Language Models (MLLMs) andthe evolving demands of the field, the number of benchmarks produced annuallyhas surged into the hundreds. The rapid growth has inevitably led tosignificant redundancy among benchmarks. Therefore, it is crucial to take astep back and critically assess the current state of redundancy and proposetargeted principles for constructing effective MLLM benchmarks. In this paper,we focus on redundancy from three key perspectives: 1) Redundancy of benchmarkcapability dimensions, 2) Redundancy in the number of test questions, and 3)Cross-benchmark redundancy within specific domains. Through the comprehensiveanalysis over hundreds of MLLMs’ performance across more than 20 benchmarks, weaim to quantitatively measure the level of redundancy lies in existing MLLMevaluations, provide valuable insights to guide the future development of MLLMbenchmarks, and offer strategies to refine and address redundancy issueseffectively.


RL + Transformer = A General-Purpose Problem Solver

摘要

What if artificial intelligence could not only solve problems for which itwas trained but also learn to teach itself to solve new problems (i.e.,meta-learn)? In this study, we demonstrate that a pre-trained transformerfine-tuned withreinforcement learningover multipleepisodesdevelops theability to solve problems that it has never encountered before - an emergentability calledIn-Context Reinforcement Learning (ICRL). This powerfulmeta-learner not only excels in solving unseen in-distribution environmentswith remarkablesample efficiency, but also shows strong performance inout-of-distribution environments. In addition, we show that it exhibitsrobustnessto the quality of its training data, seamlessly stitches togetherbehaviors from itscontext, and adapts tonon-stationary environments. Thesebehaviors demonstrate that an RL-trained transformer caniteratively improveupon its own solutions, making it an excellent general-purpose problem solver.


ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

摘要

As is known,hybrid quadratic and subquadratic attention modelsin multi-headarchitectures have surpassed bothTransformerandLinear RNN models, withthese works primarily focusing on reducingKV complexityand improvingefficiency. For further research on expressiveness, we introduce our series ofmodels distilled from Qwen 2.5, based on pure nativeRWKV-7 attention, whichaims to makeRNNmore expressive and demonstratesstate tracking abilitybeyondtransformers. We work withQRWK 32Bbased onRWKV-6 architecture, anotherapproach that reduces the entireknowledge processing timeto just 8 hoursusing 16AMD MI300X GPUswhile maintaining Qwen 2.5’s performance. In fact, thedistillation process can utilize anyLLM, not just Qwen, and enables knowledgetransfer from largerLLMs to smaller ones with more fewer tokens. We willexplain the detailed process and share our insights on building more powerfulfoundation models. Please note that this is an ongoing work that will beupdated continuously. Themodel checkpointsandsource codeare available athttps://github.com/yynil/RWKVInside{https://github.com/yynil/RWKVInside},https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.


o3-mini vs DeepSeek-R1: Which One is Safer?

摘要

The irruption ofDeepSeek-R1constitutes a turning point for the AI industryin general and theLLMsin particular. Its capabilities have demonstratedoutstanding performance in several tasks, includingcreative thinking, codegeneration, maths andautomated program repair, at apparently lower executioncost. However,LLMsmust adhere to an important qualitative property, i.e.,theiralignment with safety and human values. A clear competitor ofDeepSeek-R1is its American counterpart, OpenAI’so3-minimodel, which is expected to sethigh standards in terms of performance, safety and cost. In this paper weconduct a systematic assessment of the safety level of both,DeepSeek-R1(70bversion) and OpenAI’so3-mini(beta version). To this end, we make use of ourrecently releasedautomated safety testing tool, namedASTRAL. By leveragingthis tool, we automatically and systematically generate and execute a total of1260unsafe test inputson both models. After conducting a semi-automatedassessment of the outcomes provided by bothLLMs, the results indicate thatDeepSeek-R1is highly unsafe as compared to OpenAI’so3-mini. Based on ourevaluation,DeepSeek-R1answered unsafely to 11.98% of the executedpromptswhereaso3-minionly to 1.19%.


Large Language Models Think Too Fast To Explore Effectively

摘要

Large Language Models have emerged many intellectual capacities. Whilenumerous benchmarks assess their intelligence, limited attention has been givento their ability to explore, an essential capacity for discovering newinformation and adapting to novel environments in both natural and artificialsystems. The extent to which LLMs can effectively explore, particularly inopen-ended tasks, remains unclear. This study investigates whether LLMs cansurpass humans in exploration during an open-ended task, usingLittle Alchemy 2as a paradigm, where agents combine elements to discover new ones. Results showmost LLMs underperform compared to humans, except for the o1 model, with thosetraditional LLMs relying primarily on uncertainty driven strategies, unlikehumans who balance uncertainty and empowerment. Representational analysis ofthe models withSparse Autoencodersrevealed that uncertainty and choices arerepresented at earliertransformer blocks, while empowerment values areprocessed later, causing LLMs to think too fast and make premature decisions,hindering effective exploration. These findings shed light on the limitationsof LLM exploration and suggest directions for improving their adaptability.


MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

  • 作者: Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou

  • 日期: 2025-01-30

  • 论文链接: https://arxiv.org/pdf/2501.18362

摘要

We introduce MedXpertQA, a highly challenging and comprehensivebenchmarktoevaluateexpert-level medical knowledgeandadvanced reasoning. MedXpertQAincludes 4,460questionsspanning 17specialtiesand 11body systems. Itincludes two subsets, Text fortext evaluationand MM for multimodalevaluation. Notably, MM introducesexpert-level exam questionswith diverseimagesand richclinical information, includingpatient recordsand examinationresults, setting it apart fromtraditional medical multimodal benchmarkswithsimpleQA pairsgenerated fromimage captions. MedXpertQA applies rigorousfiltering and augmentation to address the insufficient difficulty of existingbenchmarks like MedQA, and incorporates specialty boardquestionsto improveclinical relevance and comprehensiveness. We performdata synthesisto mitigatedata leakage riskand conduct multiple rounds ofexpert reviewsto ensureaccuracy and reliability. We evaluate 16 leading models on MedXpertQA.Moreover, medicine is deeply connected to real-world decision-making, providinga rich and representative setting for assessingreasoning abilitiesbeyondmathematics and code. To this end, we develop areasoning-oriented subsettofacilitate the assessment ofo1-like models.


DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation

摘要

Recent advancements in 3D content generation from text or a single imagestruggle with limited high-quality 3D datasets and inconsistency from 2Dmulti-view generation. We introduce DiffSplat, a novel 3D generative frameworkthat natively generates3D Gaussian splatsby taming large-scale text-to-imagediffusion models. It differs from previous 3D generative models by effectivelyutilizingweb-scale 2D priorswhile maintaining 3D consistency in a unifiedmodel. To bootstrap the training, a lightweight reconstruction model isproposed to instantly produce multi-viewGaussian splat gridsfor scalabledataset curation. In conjunction with the regulardiffusion losson thesegrids, a3D rendering lossis introduced to facilitate3D coherenceacrossarbitrary views. The compatibility withimage diffusion modelsenables seamlessadaptions of numerous techniques for image generation to the 3D realm.Extensive experiments reveal the superiority of DiffSplat in text- andimage-conditioned generation tasks and downstream applications. Thoroughablation studies validate the efficacy of each critical design choice andprovide insights into the underlying mechanism.


WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

摘要

Language model (LLM) post-training, from DPO to distillation, can refinebehaviors and unlock new skills, but the open science supporting thesepost-training techniques is still in its infancy. One limiting factor has beenthe difficulty of conducting large-scale comparative analyses of synthetic datagenerating models and LLM judges. To close this gap, we introduceWILDCHAT-50M,the largest public chat dataset to date. We extend the existing WildChatdataset to include responses not only from GPT, but from over 50 differentopen-weight models, ranging in size from 0.5B to 104B parameters. We conduct anextensive comparative analysis and demonstrate the potential of this dataset bycreatingRE-WILD, our own publicSFT mix, which outperforms the recentTulu-3SFT mixture from Allen AI with only 40% as many samples. Our dataset, samplesand code are available at https://github.com/penfever/wildchat-50m.


Exploring the sustainable scaling of AI dilemma: A projective study of corporations’ AI environmental impacts

  • 作者: Clément Desroches, Martin Chauvin, Louis Ladan, Caroline Vateau, Simon Gosset, Philippe Cordier

  • 日期: 2025-01-24

  • 论文链接: https://arxiv.org/pdf/2501.14334

摘要

The rapid growth of artificial intelligence (AI), particularly Large LanguageModels (LLMs), has raised concerns regarding its globalenvironmental impactthat extends beyondgreenhouse gas emissionsto include consideration ofhardware fabricationandend-of-life processes. The opacity from majorproviders hinders companies’ abilities to evaluate their AI-relatedenvironmental impacts and achieve net-zero targets. In this paper, we propose a methodology to estimate theenvironmental impactof a company’s AI portfolio, providing actionable insights withoutnecessitating extensive AI andLife-Cycle Assessment (LCA)expertise. Resultsconfirm that largegenerative AI modelsconsume up to 4600x more energy thantraditional models. Our modelling approach, which accounts for increased AIusage, hardware computing efficiency, and changes in electricity mix in linewithIPCC scenarios, forecasts AIelectricity useup to 2030. Under a highadoption scenario, driven by widespread Generative AI and agents adoptionassociated to increasingly complex models and frameworks, AIelectricity useisprojected to rise by a factor of 24.4. Mitigating theenvironmental impactof Generative AI by 2030 requirescoordinated efforts across theAI value chain. Isolated measures in hardwareefficiency,model efficiency, orgrid improvementsalone are insufficient. Weadvocate for standardizedenvironmental assessment frameworks, greatertransparency from the all actors of the value chain and the introduction of a"Return on Environment" metric to align AI development with net-zero goals.


SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

  • 作者: Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, Song Han

  • 日期: 2025-01-30

  • 论文链接: https://arxiv.org/pdf/2501.18427

摘要

This paper presents SANA-1.5, a linearDiffusion Transformerfor efficientscaling in text-to-image generation. Building upon SANA-1.0, we introduce threekey innovations: (1) Efficient Training Scaling: Adepth-growth paradigmthatenables scaling from 1.6B to 4.8B parameters with significantly reducedcomputational resources, combined with amemory-efficient 8-bit optimizer. (2)Model Depth Pruning: Ablock importance analysistechnique for efficient modelcompression to arbitrary sizes with minimal quality loss. (3) Inference-timeScaling: Arepeated sampling strategythat trades computation for modelcapacity, enabling smaller models to match larger model quality at inferencetime. Through these strategies, SANA-1.5 achieves atext-image alignment scoreof 0.72 onGenEval, which can be further improved to 0.80 through inferencescaling, establishing a newSoTAonGenEvalbenchmark. These innovations enableefficient model scaling across different compute budgets while maintaining highquality, making high-quality image generation more accessible.


Open Problems in Mechanistic Interpretability

  • 作者: Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath

  • 日期: 2025-01-27

  • 论文链接: https://arxiv.org/pdf/2501.16496

摘要

Mechanistic interpretabilityaims to understand thecomputational mechanismsunderlyingneural networks’ capabilitiesin order to accomplish concretescientific and engineering goals. Progress in this field thus promises toprovide greater assurance overAI system behaviorand shed light on excitingscientific questionsabout the nature of intelligence. Despite recent progresstoward these goals, there are manyopen problemsin the field that requiresolutions before many scientific and practical benefits can be realized: Ourmethods require both conceptual and practical improvements to reveal deeperinsights; we must figure out how best to apply our methods in pursuit ofspecific goals; and the field must grapple withsocio-technical challengesthatinfluence and are influenced by our work. This forward-facing review discussesthe current frontier ofmechanistic interpretabilityand theopen problemsthatthe field may benefit from prioritizing.


PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

摘要

Understanding the physical world is a fundamental challenge in embodied AI,critical for enabling agents to perform complex tasks and operate safely inreal-world environments. WhileVision-Language Models (VLMs)have shown greatpromise in reasoning and task planning for embodied agents, their ability tocomprehend physical phenomena remains extremely limited. To close this gap, weintroducePhysBench, a comprehensivebenchmarkdesigned to evaluate VLMs’physical world understanding capability across a diverse set of tasks.PhysBenchcontains 10,002 entries of interleavedvideo-image-text data,categorized into four major domains:physical object properties, physicalobject relationships,physical scene understanding, andphysics-based dynamics,further divided into 19 subclasses and 8 distinctcapability dimensions. Ourextensive experiments, conducted on 75 representative VLMs, reveal that whilethese models excel in common-sense reasoning, they struggle with understandingthe physical world – likely due to the absence of physical knowledge in theirtraining data and the lack of embedded physical priors. To tackle theshortfall, we introducePhysAgent, a novel framework that combines thegeneralization strengthsof VLMs with the specialized expertise of visionmodels, significantly enhancing VLMs’ physical understanding across a varietyof tasks, including an 18.4% improvement onGPT-4o. Furthermore, our resultsdemonstrate that enhancing VLMs’ physical world understanding capabilities canhelp embodied agents such asMOKA. We believe thatPhysBenchandPhysAgentoffer valuable insights and contribute to bridging the gap between VLMs andphysical world understanding.


Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

  • 作者: Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu

  • 日期: 2025-01-27

  • 论文链接: https://arxiv.org/pdf/2501.15907

摘要

Recent advancements in speech generation have been driven by the large-scaletraining datasets. However, current models fall short of capturing thespontaneity and variability inherent in real-world human speech, due to theirreliance on audiobook datasets limited to formal read-aloud speech styles. Tobridge this gap, we introduce Emilia-Pipe, an open-source preprocessingpipeline to extract high-quality training data from valuable yet underexploredin-the-wilddata that capture spontaneous human speech in real-world contexts.By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speechgeneration dataset derived fromin-the-wildspeech data. This dataset comprisesover 101k hours of speech across six languages: English, Chinese, German,French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, adataset exceeding 216k hours, making it the largest open-source speechgeneration dataset available. Extensive experiments demonstrate that Emiliasignificantly outperforms traditional audiobook datasets in generatingspontaneous and human-like speech, showcasing superior performance in capturingdiverse speaker timbre and speaking styles of real-world human speech.Furthermore, this work underscores the importance of scaling dataset size toadvance speech generation research and validates the effectiveness of Emiliafor both multilingual andcrosslingual speech generation.


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

万俟淋曦

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值