LLMs之Benchmark:《BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese》翻译与解读
导读:这篇论文介绍了 BrowseComp-ZH,一个用于评估大型语言模型 (LLM) 在中文网络环境下网页浏览能力的基准测试集。BrowseComp-ZH 基准测试集通过逆向设计和严格的质量控制,构建了 289 个高难度、多约束的中文网页浏览问题,用于评估 LLM 的网页浏览和推理能力。 实验结果表明,大多数 LLM 在该基准测试集上表现不佳,只有少数模型能够达到 20% 以上的准确率,这突显了在中文网络环境下进行有效信息检索和推理的挑战。 该基准测试集为评估 LLM 的真实世界能力提供了重要的参考。
>> 背景痛点
● 现有的网页浏览能力评估基准测试集主要集中在英文,忽略了其他主要信息生态系统(特别是中文)的语言、基础设施和审查相关的复杂性。
● 直接将英文基准测试集翻译成中文并不能产生有意义的评估框架,因为这会忽略中文网络环境中独特的检索和推理挑战,例如信息分散在不同的平台上、命名约定不一致、搜索引擎难以可靠地索引深度或特定领域的內容,以及中文语言本身的特性(例如隐含的指代、习语表达和上下文相关的语法)等。
>> 具体的解决方案
● 论文提出了 BrowseComp-ZH,一个专门为评估中文网络环境下 LLM 代理设计的基准测试集。
● 采用逆向设计方法构建数据集:从一个简短、客观且易于验证的答案出发,设计出需要多步推理和信息整合才能回答的问题。
● 进行了两阶段的质量控制:第一阶段筛选容易通过关键词检索的问题;第二阶段确保答案的唯一性,通过人工审核多个顶尖 AI 代理的答案来验证。
>> 核心思路步骤
● 选择主题和答案:专家标注者从预定义的 11 个领域中选择主题,并选择客观、具体且易于验证的事实性答案。
● 逆向设计问题:基于选择的答案,设计出需要多步推理和信息整合才能回答的问题,并确保这些问题在主流搜索引擎上难以直接检索到答案。
● 质量控制:进行两阶段的质量控制,确保问题的难度和答案的唯一性。
● 模型评估:使用超过 20 个最先进的语言模型和代理搜索系统对 BrowseComp-ZH 进行基准测试,并评估其准确性和校准误差。
>> 优势
● 首个针对中文网络环境的 LLM 网页浏览能力基准测试集。
● 问题设计具有挑战性,需要多步推理和信息整合。
● 答案唯一且易于验证。
● 涵盖了多种类型的 LLM 和 AI 搜索产品。
>> 结论和观点
● 大多数独立的 LLM 在 BrowseComp-ZH 上表现不佳,准确率通常低于 10%。
● 具有强大推理能力的模型表现更好。
● 结合外部检索机制的代理系统往往能取得更高的分数。
● 并非所有系统都能从检索功能的集成中获益。
● BrowseComp-ZH 不仅挑战了模型的浏览和推理能力,也暴露了它们在处理、评估和将检索到的信息与内部表示对齐方面的局限性。
目录
《BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese》翻译与解读
《BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese》翻译与解读
地址 | 论文地址:[2504.19314] BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese |
时间 | 2025年4月27日 最新版本:2025年5月11日 |
作者 | 香港科技大学,北京大学,智谱AI,阿里巴巴集团,浙江大学等 |
Abstract
As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at this https URL. | 随着大型语言模型(LLM)逐渐演变为能够使用工具的智能体,实时浏览网页的能力已成为衡量其推理和检索能力的关键标准。现有的基准测试,如 BrowseComp,主要集中在英语上,而忽略了其他主要信息生态系统(尤其是中文)在语言、基础设施和审查制度方面的复杂性。为了解决这一空白,我们推出了 BrowseComp-ZH,这是一个专门用于全面评估 LLM 智能体在中国网络上表现的高难度基准测试。BrowseComp-ZH 包含 289 个跨 11 个不同领域的多跳问题。每个问题都是从一个简短、客观且易于验证的答案(例如日期、数字或专有名词)反向推导出来的。我们采用两阶段的质量控制协议,力求提高问题难度和答案的独特性。我们在提出的 BrowseComp-ZH 上对 20 多种最先进的语言模型和代理搜索系统进行了基准测试。尽管这些模型在对话和检索方面表现出色,但大多数模型的表现都很糟糕:大量模型的准确率低于 10%,只有少数超过 20%。即使是表现最佳的系统——OpenAI 的 DeepResearch,准确率也仅为 42.9%。这些结果表明了 BrowseComp-ZH 的难度之大,要取得成功不仅需要有效的检索策略,还需要复杂的推理和信息整合能力——而这些正是当前模型仍难以掌握的。我们的数据集、构建指南和基准测试结果已公开发布在以下网址。 |
Figure 1:Two data samples from BrowseComp-ZH with their English translations.图 1:来自 BrowseComp-ZH 的两个数据样本及其英文翻译。
1、Introduction
As large language models (LLMs) evolve from static knowledge repositories to dynamic agents capable of using external tools, tasks that involve web browsing have emerged as a critical lens through which to evaluate their real-world reasoning and information-seeking capabilities Xi et al. (2025). By interacting with search engines and navigating live web content, LLMs can augment their internal knowledge with up-to-date external evidence, retrieve context-specific information, and perform multi-hop reasoning across heterogeneous sources. This browsing ability extends the temporal scope of LLMs while enabling them to tackle questions that lie beyond the reach of pretraining, such as time-sensitive facts or obscure entity relations that require targeted retrieval. While an increasing number of studies Li et al. (2023a); Fernández-Pichel et al. (2024); Fan et al. (2024); Vu et al. (2023); Lai et al. (2025); He et al. (2024); Lai et al. (2024); Xiong et al. (2024) demonstrate that web browsing greatly improves LLM performance on downstream tasks, there remains a surprising lack of direct evaluation of browsing capabilities themselves—i.e., the ability of LLMs to effectively retrieve, filter, and reason over information from the web. This evaluation is crucial for assessing the true web-browsing competence of LLMs and understanding their potential to tackle real-world tasks that require dynamic information retrieval. To address this gap, Wei et al. (2025) introduced a benchmark of reverse-designed, evidence-grounded queries that challenge English-language agents to search and reason over difficult-to-access information. | 随着大型语言模型(LLM)从静态知识库演变为能够使用外部工具的动态代理,涉及网络浏览的任务已成为评估其现实世界推理和信息检索能力的关键视角(Xi 等人,2025 年)。通过与搜索引擎交互并浏览实时网络内容,LLM 可以用最新的外部证据扩充其内部知识,检索特定情境的信息,并在异构来源之间进行多跳推理。这种浏览能力不仅扩大了 LLM 的时间范围,还使它们能够解决超出预训练范围的问题,例如时效性事实或需要有针对性检索的冷门实体关系。 尽管越来越多的研究(Li 等人,2023a;Fernandez-Pichel 等人,2024;Fan 等人,2024;Vu 等人,2023;Lai 等人,2025;He 等人,2024;Lai 等人,2024;Xiong 等人,2024)表明网络浏览极大地提高了 LLM 在下游任务中的表现,但令人惊讶的是,对浏览能力本身的直接评估却很少。即评估大型语言模型(LLM)从网络中有效检索、筛选和推理信息的能力至关重要,这有助于评估其真正的网络浏览能力,并了解其在需要动态信息检索的实际任务中的潜力。为解决这一空白,魏等人(2025 年)引入了一个反向设计、基于证据的查询基准,挑战英语语言代理在难以获取的信息上进行搜索和推理。 |
Nonetheless, Wei et al. (2025) primarily operates within the English-language web, missing the linguistic, structural, and cultural complexities inherent to other language environments. In particular, the Chinese web poses unique challenges for retrieval and reasoning: information is scattered across heterogeneous platforms (e.g., Baidu Baike, Zhihu, government portals), naming conventions are inconsistent, and search engines often fail to reliably index deep or domain-specific content. Moreover, the linguistic properties of Chinese, such as implicit referents, idiomatic expressions, and context-dependent syntax, frequently break standard keyword-based retrieval paths. Crucially, directly translating English browsing benchmarks into Chinese does not yield a meaningful evaluation framework. This approach fails on several fronts: structural and idiomatic differences render many translated queries unnatural or ineffective in actual search; canonical information pathways in English (e.g., Wikipedia, IMDb) lack equivalents or exhibit vastly different structures in the Chinese web; and translated queries often collapse into keyword matches, trivializing the intended challenge. Moreover, the Chinese web presents its own set of retrieval obstacles—fragmented and platform-specific content, inconsistent naming conventions, and linguistic traits such as ellipsis, cultural references, and implicit reasoning that undermine linear search strategies. As a result, existing English-centric benchmarks struggle to generalize, leaving a critical gap in assessing real-world browsing capabilities in non-English environments. We argue that web-based benchmarks must be natively constructed within the Chinese information ecosystem, where the search logic, content structure, and linguistic context authentically reflect the challenges faced by agents operating in Chinese-language settings. | 然而,魏等人(2025 年)的研究主要局限于英语网络环境,忽略了其他语言环境中固有的语言、结构和文化复杂性。特别是中文网络在检索和推理方面存在独特挑战:信息分散在异构平台上(例如百度百科、知乎、政府门户网站),命名惯例不一致,搜索引擎往往无法可靠地索引深度或领域特定的内容。此外,中文的语言特性,如隐含指代、习语表达和依赖上下文的语法,常常破坏标准基于关键词的检索路径。 关键在于,直接将英文浏览基准测试翻译成中文并不能形成有意义的评估框架。这种方法在多个方面都行不通:结构和习惯用语上的差异使得许多翻译后的查询在实际搜索中显得不自然或无效;英文中的标准信息路径(如维基百科、IMDb)在中文网络中缺乏对应物或结构差异极大;而且翻译后的查询常常退化为关键词匹配,从而降低了原本的挑战难度。此外,中文网络自身也存在一系列检索障碍——内容碎片化、平台特定化、命名惯例不一致以及省略、文化典故和隐性推理等语言特征,这些都削弱了线性搜索策略的效果。因此,现有的以英语为中心的基准测试难以推广,导致在评估非英语环境中的实际浏览能力方面存在重大缺口。我们认为,基于网络的基准测试必须原生构建于中国的信息生态系统中,其中的搜索逻辑、内容结构和语言环境要真实反映在中文环境中运作的智能体所面临的挑战。 |
To fill this gap, we introduce BrowseComp-ZH, the first benchmark specifically designed to evaluate web-enabled LLMs in the Chinese information environment. Mirroring the philosophy of the original BrowseComp, each item is created by reverse design: expert annotators (all holding at least a master’s degree and possessing both LLM expertise and topical domain knowledge) begin with a single, factual answer and craft a multi-constraint query that is hard to retrieve yet trivially verifiable. BrowseComp-ZH extends the original framework in two crucial directions: (1) Native Chinese construction. All queries, evidence chains, and browsing steps are authored in Chinese and fully localized. Every candidate query is verified on three widely-used search engines: and is accepted only if the answer is not surfaced in the first-page results of any engine. (2) Two-stage quality control. Stage 1 screens for keyword retrievability as above. Stage 2 focuses on ensuring the uniqueness of the answers. We employ a human-in-the-loop approach, where multiple top-performing AI agents first perform reasoning and provide answers based on the designed questions. These results are then manually verified by annotators to check if any additional answers satisfy the constraints of the question. The final BrowseComp-ZH dataset consists of 289 complex questions, each with multiple constraints and unique answers, spanning 11 diverse domains, including film, technology, law, and medicine. These questions require multi-step reasoning and are difficult to answer directly through search engines, making them ideal for evaluating the ability of Chinese LLM agents to conduct multi-hop retrieval, perform factual reasoning, and integrate online information. Fig. 1 illustrates two representative example drawn from BrowseComp-ZH. Based on this dataset, we benchmark the performance of more than 20 systems, encompassing open-source models (e.g., DeepSeek R1 Guo et al. (2025), Qwen-2.5-72B-Instruct Yang et al. (2024)), closed-source APIs (e.g., GPT-4o OpenAI (2024a), Claude-3.7 Anthropic (2025), Gemini-2.5-Pro Google (2024a)), as well as AI search agents (e.g., DeepResearch OpenAI (2025a), DeepSeek DeepSeek (2025), Perplexity Perplexity (2025), and Doubao ByteDance (2025)). | 为填补这一空白,我们推出了 BrowseComp-ZH,这是首个专门用于评估中文信息环境中具备网络功能的大型语言模型的基准测试。与原始的 BrowseComp 一样,BrowseComp-ZH 中的每个项目都是通过逆向设计创建的:专家标注员(均拥有至少硕士学位,并具备大型语言模型专业知识和特定领域知识)从一个单一的事实性答案出发,精心设计出一个难以检索但又易于验证的多约束查询。BrowseComp-ZH 在两个关键方向上对原始框架进行了扩展:(1)原生中文构建。所有查询、证据链和浏览步骤均以中文编写,并完全本地化。每个候选查询都要在三个广泛使用的搜索引擎上进行验证:只有当答案未出现在任何搜索引擎的首页结果中时,才会被接受。(2)两阶段质量控制。第一阶段的筛选如上文所述,针对关键词的可检索性进行。第二阶段则侧重于确保答案的唯一性。我们采用人工辅助的方式,即多个表现优异的人工智能代理首先根据设计的问题进行推理并给出答案。然后由标注员手动验证这些结果,以检查是否有其他答案符合问题的约束条件。 最终的 BrowseComp-ZH 数据集包含 289 个复杂问题,每个问题都有多个约束条件和唯一的答案,涵盖了电影、科技、法律和医学等 11 个不同的领域。这些问题需要多步骤推理,难以直接通过搜索引擎获得答案,因此非常适合用于评估中文大语言模型代理进行多跳检索、事实推理和整合在线信息的能力。图 1 展示了来自 BrowseComp-ZH 的两个代表性示例。基于此数据集,我们对超过 20 个系统进行了性能基准测试,涵盖了开源模型(例如,DeepSeek R1 郭等人(2025 年),Qwen-2.5-72B-Instruct 杨等人(2024 年))、闭源 API(例如,GPT-4o OpenAI(2024a),Claude-3.7 Anthropic(2025 年),Gemini-2.5-Pro Google(2024a))以及 AI 搜索代理(例如,DeepResearch OpenAI(2025a),DeepSeek DeepSeek(2025 年),Perplexity Perplexity(2025 年)和 Doubao ByteDance(2025 年))。 |
Our evaluation offers a nuanced characterization of model capabilities across different system designs. Based on our analysis, we highlight several key findings: 1. Naive large language models, irrespective of parameter scale or training data size, exhibit consistently poor performance on the benchmark, with accuracies often remaining below 10%. Representative examples include Qwen2.5-72B-Instruct, Llama 4, and GPT-4o. 2. Models endowed with inherent reasoning capabilities exhibit consistent improvements in accuracy. For example, DeepSeek-R1 surpasses DeepSeek-V3 by 14.5%, while Claude-3.7-Sonnet outperforms Claude-3.5 by 12.2%. 3. Agentic systems augmented with external retrieval mechanisms tend to achieve comparatively higher scores. Notably, OpenAI’s DeepResearch and Doubao (Deep Search) demonstrate that well-orchestrated retrieval and reasoning pipelines can substantially enhance performance, achieving accuracies of 42.9% and 26.0%, respectively. 4. While well-designed retrieval pipelines can significantly improve performance, not all systems benefit from retrieval integration. For instance, enabling web search for DeepSeek-R1 results in a substantial decline in performance, with accuracy dropping from 23.2% in the direct-answer (no web access) setting to 7.6% when web search is enabled. These findings suggest that BrowseComp-ZH not only challenges models’ browsing and reasoning capabilities but also exposes their limitations in processing, evaluating, and aligning retrieved information with internal representations. Detailed performance statistics across model categories are summarized in Table 1. | 我们的评估对不同系统设计下的模型能力进行了细致的描述。根据我们的分析,我们强调了几个关键发现: 1. 无论参数规模还是训练数据量如何,天真型大型语言模型在基准测试中的表现始终不佳,准确率通常低于 10%。具有代表性的例子包括 Qwen2.5-72B-Instruct、Llama 4 和 GPT-4o。 2. 具备内在推理能力的模型在准确性方面持续提升。例如,DeepSeek-R1 比 DeepSeek-V3 高出 14.5%,而 Claude-3.7-Sonnet 比 Claude-3.5 高出 12.2%。 3. 借助外部检索机制增强的代理系统往往能取得相对较高的分数。值得注意的是,OpenAI 的 DeepResearch 和 Doubao(深度搜索)表明,精心编排的检索与推理流程能够显著提升性能,准确率分别达到 42.9% 和 26.0%。 4. 虽然设计良好的检索管道能够显著提升性能,但并非所有系统都能从检索集成中获益。例如,为 DeepSeek-R1 启用网络搜索会导致性能大幅下降,准确率从直接回答(无网络访问)设置下的 23.2% 降至启用网络搜索后的 7.6%。 这些发现表明,BrowseComp-ZH 不仅挑战了模型的浏览和推理能力,还暴露了它们在处理、评估和将检索到的信息与内部表示对齐方面的局限性。不同模型类别的详细性能统计数据总结在表 1 中。 |
Conclusion
This study introduces BrowseComp-ZH, the first benchmark specifically designed to evaluate the web browsing and reasoning capabilities of large language models (LLMs) in the Chinese information environment. Inspired by the BrowseComp benchmark, we construct challenging question-answer pairs that require multi-hop retrieval, information filtering, and logical reasoning to derive concise, factual answers. To ensure the high difficulty of each question and the uniqueness of each answer, we implement a rigorous two-stage quality control pipeline that includes a three-engine keyword validation process and human-in-the-loop verification, ensuring that answers are both difficult to retrieve and unambiguous. We construct 289 high-quality samples across 11 diverse topics, including Film & TV, History, Technology, Medicine, and more. Using these tasks, we conduct extensive evaluations on over 20 models and AI search products, encompassing open-source LLMs, closed-source APIs, and AI search products. As shown in Tab. 1, most standalone LLMs—such as GPT-4o, Qwen2.5-72B, and Llama-4—achieve limited accuracy, highlighting the difficulty of the benchmark. Models with stronger reasoning abilities, such as O1 (29.1%) and Gemini-2.5-Pro (27.3%), demonstrate substantial improvements, underscoring the critical role of reasoning for complex question answering. AI search products employing multi-turn retrieval, including DeepResearch (42.9%) and Doubao (Deep Search) (26.0%), further outperform purely parametric models, illustrating the effectiveness of test-time scaling through iterative retrieval. These results reflect the inherent challenges of the Chinese web environment, where fragmented information and inconsistent indexing complicate single-shot search strategies. | 本研究介绍了 BrowseComp-ZH,这是首个专门用于评估大型语言模型(LLM)在中文信息环境中网络浏览和推理能力的基准测试。受 BrowseComp 基准测试的启发,我们构建了具有挑战性的问答对,这些问答对需要多跳检索、信息过滤和逻辑推理才能得出简洁、准确的答案。为了确保每个问题的高难度和每个答案的唯一性,我们实施了严格的两阶段质量控制流程,其中包括三引擎关键词验证过程和人工干预验证,以确保答案既难以检索又明确无误。 我们构建了涵盖 11 个不同主题的 289 个高质量样本,包括影视、历史、科技、医学等。利用这些任务,我们对超过 20 个模型和 AI 搜索产品进行了广泛的评估,包括开源 LLM、闭源 API 和 AI 搜索产品。如表 1 所示,大多数独立的 LLM,如 GPT-4o、Qwen2.5-72B 和 Llama-4,准确率有限,这凸显了该基准测试的难度。具有更强推理能力的模型,如 O1(29.1%)和 Gemini-2.5-Pro(27.3%),表现出显著的提升,突显了推理对于复杂问题回答的关键作用。采用多轮检索的 AI 搜索产品,包括 DeepResearch(42.9%)和 Doubao(Deep 搜索)(26.0%),进一步超越了纯参数模型,表明了通过迭代检索进行测试时扩展的有效性。这些结果反映了中国网络环境固有的挑战,其中信息碎片化和索引不一致使单次搜索策略变得复杂。 |
Limitations. Despite the innovations in BrowseComp-ZH’s design and quality control, several limitations remain. First, the current dataset is relatively small; increasing both the sample size and the diversity of question types would improve its representativeness. Second, although we apply rigorous validation to ensure answer uniqueness, it cannot be fully guaranteed. In particular, the dynamic nature of the web means that factual answers may evolve or become inconsistent over time, making stability and reproducibility an ongoing challenge. | 限制。尽管 BrowseComp-ZH 在设计和质量控制方面有所创新,但仍存在一些局限性。首先,当前的数据集相对较小;增加样本量和问题类型的多样性将提高其代表性。其次,尽管我们采用了严格的验证来确保答案的唯一性,但无法完全保证。特别是,网络的动态特性意味着事实性答案可能会随时间演变或变得不一致,这使得稳定性和可重复性成为持续的挑战。 |
Future Work In future work, we plan to incorporate a broader range of questions to enable more comprehensive and accurate evaluations of the models. We will also conduct an in-depth analysis of their reasoning mechanisms and search strategies. Furthermore, additional case studies will be carried out to examine failure cases across different models. Finally, we aim to explore methods to further improve the models’ browsing and reasoning capabilities, such as leveraging post-training techniques like reinforcement learning. | 未来工作 在未来的工作中,我们计划纳入更广泛的问题类型,以实现对模型更全面和准确的评估。我们还将深入分析它们的推理机制和搜索策略。此外,还将开展更多的案例研究,以考察不同模型的失败案例。最后,我们旨在探索进一步提升模型浏览和推理能力的方法,例如利用强化学习等后训练技术。 |