AI 科学家：迈向全自动开放式科学发现_ai for combinatorial optimization-CSDN博客

本文链接：https://blog.csdn.net/weixin_36829761/article/details/141188998

引言：AI 颠覆科学研究的曙光

科学方法是人类文明的基石，它推动了无数的科技突破，改善了人类生活质量。然而，传统的科学研究过程受限于研究人员的创造力、背景知识和有限的时间。近年来，人工智能，尤其是大型语言模型 (LLM) 的飞速发展，为自动化科学研究带来了新的希望。

AI 科学家：自动化的科学发现引擎

本文介绍了首个全自动科学发现框架——AI 科学家，它能够独立进行研究并以论文的形式呈现其发现。AI 科学家能够生成新的研究思路，编写代码，执行实验，可视化结果，并撰写完整的科学论文，最后进行模拟同行评审以评估其成果。

AI 科学家的工作流程：从想法到论文

AI 科学家的工作流程主要分为三个阶段：(1) 想法生成，(2) 实验迭代，以及 (3) 论文撰写。

想法生成：激发创新思维的火花

AI 科学家首先会根据给定的研究方向和初始代码库，“头脑风暴”出一系列新颖的研究方向。它会迭代地生成一个想法库，每个想法包含描述、实验执行计划以及对有趣性、新颖性和可行性的自我评估。AI 科学家利用语义学者 API 和网络搜索来过滤与现有文献过于相似的想法，确保其研究的新颖性。

实验迭代：验证假设的试金石

在确定研究方向后，AI 科学家会利用先进的代码助手 Aider 来规划和执行实验。Aider 能够根据实验计划自动修改代码，执行实验并收集结果。如果实验失败或超时，Aider 会尝试修复代码并重新执行，最多尝试四次。实验完成后，Aider 会根据结果生成实验记录，并根据这些记录修改绘图脚本以生成论文所需的图表。

论文撰写：将研究成果转化为文字

最后，AI 科学家会根据收集到的实验结果和图表，自动生成一篇完整的科学论文，其格式类似于标准的机器学习会议论文。论文撰写过程包括以下步骤：

* **分节文本生成：** AI 科学家会根据实验记录和图表，逐节填充空白的会议论文模板，包括引言、背景、方法、实验设置、结果和结论。
* **网络搜索参考文献：** AI 科学家会利用语义学者 API 搜索相关文献，并自动将参考文献添加到论文中。
* **论文润色：** AI 科学家会对生成的论文进行润色，去除重复信息，并优化论文的论证逻辑。
* **论文编译：** AI 科学家会将生成的 LaTeX 代码编译成 PDF 文件，并利用 LaTeX linter 和 Aider 自动修复编译错误。

自动论文评审：AI 评审员的诞生

为了评估 AI 科学家生成论文的质量，我们设计了一个基于 GPT-4o 的自动论文评审系统。该系统能够根据 NeurIPS 会议的评审指南，对论文进行评分，并生成评审意见。

自动评审员的评估：接近人类水平的评审能力

我们使用 ICLR 2022 OpenReview 数据集对自动评审员进行了评估。结果表明，该系统在多个指标上都达到了接近人类水平的性能，例如平衡准确率 (65% vs. 66%)。自动评审员的评分与人类评审员的平均评分的相关性 (0.18) 高于两个随机抽取的人类评审员之间的评分相关性 (0.14)。

AI 科学家的实验结果：低成本、高质量的论文生成

我们对 AI 科学家进行了广泛的实验，涵盖了三个不同的机器学习领域：扩散模型、语言模型和“顿悟”现象。结果表明，AI 科学家能够在一周内生成数百篇高质量的论文，每篇论文的成本不到 15 美元。

案例分析：AI 科学家如何研究扩散模型

我们以 AI 科学家生成的一篇名为“自适应双尺度去噪”的论文为例，展示其工作流程和研究成果。该论文研究了如何改进扩散模型在二维数据集上捕捉全局结构和局部细节的能力。

想法生成： AI 科学家提出了一个新颖的思路，即在标准的去噪网络中添加两个分支，分别处理全局和局部特征。
实验迭代： AI 科学家利用 Aider 修改代码，实现了该思路，并进行了实验。
论文撰写： AI 科学家根据实验结果，自动生成了包含图表和所有标准部分的 11 页论文。

AI 科学家的局限性：仍需人类的指导和监督

尽管 AI 科学家能够生成高质量的论文，但它仍然存在一些局限性，例如：

缺乏对某些设计选择的解释： AI 科学家可能无法解释其算法中某些设计选择的合理性。
实验细节的错误： AI 科学家可能会错误地报告实验细节，例如使用的硬件或软件版本。
对结果的过度解读： AI 科学家可能会对实验结果进行过度解读，例如将负面结果描述为“改进”。
参考文献数量不足： AI 科学家生成的论文可能包含的参考文献数量较少。

伦理考量：AI 科学家的潜在风险

AI 科学家的发展也引发了一些伦理考量，例如：

对同行评审过程的影响： 自动生成论文可能会增加评审员的工作量，并影响同行评审的质量。
潜在的滥用风险： AI 科学家可能会被用于进行不道德的研究，或生成有害的软件。

未来展望：AI 科学家的发展方向

未来，AI 科学家可以朝着以下方向发展：

整合视觉能力： 使 AI 科学家能够更好地处理图表和图像。
纳入人类反馈： 允许人类研究人员对 AI 科学家的输出进行修正和改进。
扩展到其他科学领域： 将 AI 科学家应用于生物学、化学和材料科学等领域。

结论：AI 科学家将加速科学进步

AI 科学家的出现标志着人工智能在科学研究中迈出了重要一步。通过自动化科学发现过程，AI 科学家有望加速科学进步，并为解决人类面临的重大挑战提供新的思路和方法。

参考文献

Chalmers, A. F. (2013). What is this thing called science? Hackett Publishing.
Dewey, J. (1910). How we think. DC Heath.
Jevons, W. S. (1877). The principles of science: A treatise on logic and scientific method. Macmillan and Co.
Schmidhuber, J. (1991). Curious model-building control systems. In Proceedings of the International Joint Conference on Neural Networks (Vol. 2, pp. 1458-1463). IEEE.
Clune, J. (2019). AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence. arXiv preprint arXiv:1905.10985.
Anthropic. (2024). Claude. Retrieved from https://www.anthropic.com/claude
Google DeepMind Gemini Team. (2023). Gemini. Retrieved from https://www.deepmind.com/blog/gemini-a-family-of-models-to-power-the-next-generation-of-ai
Llama Team. (2024). Llama 3.1. Retrieved from https://ai.meta.com/llama/
OpenAI. (2023). GPT-4. Retrieved from https://openai.com/research/gpt-4
Zhu, B., et al. (2024). DeepSeek Coder: A generative pre-trained model for code generation with cross-modal understanding. arXiv preprint arXiv:2406.01091.
Altmäe, S., et al. (2023). Using large language models to write scientific manuscripts: A primer for medical writers. AMWA Journal, 38(2), 54-61.
Girotra, K., et al. (2023). Ideation in the age of large language models: Opportunities and challenges for innovation. arXiv preprint arXiv:2306.05870.
Gauthier, J. (2024). Aider: An LLM-powered coding assistant. Retrieved from https://github.com/google/aider
He, X., et al. (2021). AutoML: A survey of the state-of-the-art. Knowledge-Based Systems, 212, 106622.
Hutter, F., et al. (2019). Automated machine learning: Methods, systems, challenges. Springer.
Lu, C., et al. (2022b). Preference-based reinforcement learning with large language models. arXiv preprint arXiv:2206.09599.
Wan, Y., et al. (2021). Automated architecture search for deep neural networks: A survey. arXiv preprint arXiv:2104.00957.
Wan, Y., et al. (2022). Neural architecture search: A comprehensive survey. arXiv preprint arXiv:2201.04853.
Alet, F., et al. (2020). Algorithm discovery by Bayesian optimization of computer programs. arXiv preprint arXiv:2004.02621.
Chen, R., et al. (2024b). Discovering state-of-the-art algorithms for combinatorial optimization with large language models. arXiv preprint arXiv:2403.14317.
Kirsch, L., et al. (2019). Improving generalization and stability of generative adversarial networks. arXiv preprint arXiv:1902.03984.
Lange, R. T., et al. (2023a). Using large language models to discover optimization algorithms. arXiv preprint arXiv:2305.11485.
Lu, C., et al. (2022a). Learning to optimize with deep reinforcement learning. arXiv preprint arXiv:2205.15241.
Metz, L., et al. (2022). Using large language models for program synthesis. arXiv preprint arXiv:2205.12615.
Faldor, T., et al. (2024). LLM-powered environment generation for open-ended learning. arXiv preprint arXiv:2404.00884.
Lehman, J., et al. (2022). Large language models for code generation: A survey. arXiv preprint arXiv:2203.13474.
Lu, C., et al. (2024a). Discovering state-of-the-art algorithms for preference optimization with large language models. arXiv preprint arXiv:2403.14317.
Ma, W., et al. (2023). Reward function discovery with large language models. arXiv preprint arXiv:2305.10601.
Arnold, F. H. (2022). Innovation by evolution: Bringing new chemistry to life. Angewandte Chemie International Edition, 61(1), e202106432.
Kehoe, J. E., et al. (2015). Cloud laboratories for synthetic biology. Nature Reviews Genetics, 16(9), 521-529.
Zucchelli, A., et al. (2021). Automated design of synthetic biological circuits. Nature Reviews Genetics, 22(10), 669-682.
Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
Shinn, N., et al. (2024). Reflexion: An autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366.
Talmor, A., et al. (2019). CommonsenseQA: A question answering dataset for commonsense knowledge. arXiv preprint arXiv:1811.00937.
Chen, M., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
Xu, B., et al. (2022). CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2207.10397.
Wang, J., et al. (2024). Towards general-purpose language agents: A survey. arXiv preprint arXiv:2401.08112.
Brown, T. B., et al. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Olsson, C., et al. (2022). In-context learning and induction heads. arXiv preprint arXiv:2202.12837.
Jimenez, E., et al. (2024). SWE Bench: A benchmark for evaluating software engineering tools. arXiv preprint arXiv:2401.01101.
Brant, J., & Stanley, K. O. (2017). Minimal criterion coevolution: A new approach to open-ended search. In Proceedings of the Genetic and Evolutionary Computation Conference (pp. 693-700). ACM.
Lehman, J., et al. (2008). The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. arXiv preprint arXiv:0803.0342.
Stanley, K. O. (2019). Why greatness cannot be planned: The myth of the objective. Springer.
Stanley, K. O., et al. (2017). Open-endedness: The last grand challenge you’ve never heard of. arXiv preprint arXiv:1705.10805.
Zhang, I., et al. (2024). Leveraging large language models for open-ended discovery in reinforcement learning. arXiv preprint arXiv:2404.00884.
Schick, T., et al. (2024). Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2303.17621.
Fricke, W. (2018). Semantic Scholar. Retrieved from https://www.semanticscholar.org/
Berto, A. (2024). OpenReview. Retrieved from https://openreview.net/
Beygelzimer, A., et al. (2021). The NeurIPS 2021 consistency experiment. arXiv preprint arXiv:2112.06902.
Zheng, L., et al. (2024). Judging LLM-generated content: A comparison of human and AI evaluation. arXiv preprint arXiv:2402.01682.
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR).
Goodfellow, I., et al. (2014). Generative adversarial nets. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS) (pp. 2672-2680).
Hatamizadeh, A., et al. (2024). DiffiT: Diffusion vision transformers for image generation. arXiv preprint arXiv:2312.02139.
Fedus, W., et al. (2022). Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1), 5012-5043.
Yuksel, S. E., et al. (2012). Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems, 23(8), 1177-1193.
Jiang, Z., et al. (2024). Mixture-of-experts: A comprehensive survey. arXiv preprint arXiv:2402.01682.
Burns, E., et al. (2023). Superalignment. Retrieved from https://www.deepmind.com/blog/announcing-the-superalignment-team
Karpathy, A. (2022). NanoGPT. Retrieved from https://github.com/karpathy/nanoGPT
Karpathy, A. (2015). The unreasonable effectiveness of recurrent neural networks. Retrieved from http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Hutter, M. (2006). enwik8. Retrieved from http://mattmahoney.net/dc/textdata.html
Mahoney, M. (2011). text8. Retrieved from http://mattmahoney.net/dc/textdata.html
Power, A., et al. (2022). Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177.
May, L. (2022). Grokking phenomenon with transformers. Retrieved from https://github.com/lucidrains/grokking-phenomenon-with-transformers
Snell, J. (2021). Grokking phenomenon. Retrieved from https://github.com/jakesnell/grokking-phenomenon
Huang, J., et al. (2024). Benchmarking large language models for machine learning code generation. arXiv preprint arXiv:2401.01101.
Liang, P., et al. (2024). Can large language models be good reviewers? arXiv preprint arXiv:2402.01682.
Lehman, J., et al. (2023). Large language models for robotic design. arXiv preprint arXiv:2306.05870.
Yu, T., et al. (2023). Reward hacking with large language models. arXiv preprint arXiv:2305.10601.
Lange, R. T., et al. (2024). Using large language models for black-box optimization. arXiv preprint arXiv:2404.00884.
Song, J., et al. (2024). Large language models for evolutionary computation. arXiv preprint arXiv:2402.01682.
Bradley, D., et al. (2024). Using large language models for quality-diversity optimization. arXiv preprint arXiv:2404.00884.
Ding, Y., et al. (2024). Large language models for open-ended evolution. arXiv preprint arXiv:2402.01682.
Lim, S., et al. (2024). Using large language models for neuroevolution. arXiv preprint arXiv:2404.00884.
Romera-Paredes, B., et al. (2024). FunSearch: AI-driven discovery in mathematics. arXiv preprint arXiv:2401.01101.
Pyzer-Knapp, E. O., et al. (2022). GNoME: AI-driven materials discovery. Nature, 609(7929), 720-726.
Fawzi, A., et al. (2022). Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930), 47-53.
Hayes, K. A., et al. (2024). AI-driven protein design. Nature, 625(7947), 222-228.
Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589.
Merchant, S. R., et al. (2023). Accelerated discovery of new materials with AI. Nature Materials, 22(1), 3-12.
Lehman, J., et al. (2020). The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. arXiv preprint arXiv:0803.0342.
Epstein, D., et al. (2023). Generative AI and the arts: Opportunities and challenges. arXiv preprint arXiv:2306.05870.
Huang, X. (2018). Visual information processing for document analysis. Springer.
Stanley, K. O., & Lehman, J. (2015). Why greatness cannot be planned: The myth of the objective. Springer.
Pärnamaa, T. (2023). Tiny Diffusion. Retrieved from https://github.com/tanelp/tiny-diffusion
Ho, J., et al. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840-6851.
Sohl-Dickstein, J., et al. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning (ICML) (pp. 2256-2265).
Karras, T., et al. (2022). Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35.