材料科学极大地受益于机器学习和深度学习技术的进步。这些技术彻底改变了对分子性质的预测,促使传统计算方法得以改变。机器学习/深度学习技术作为数据驱动材料科学领域中不可或缺的工具,其性能预测的准确性和速度都在逐步提高。
Fig. 1 Overview of extrapolative prediction of molecular property based on the range of molecular properties and the diversity of molecular structures.
但在机器学习/深度学习技术中仍然存在一个关于其固有外推困难的基本矛盾,即对于超越现有数据的预测能力。数据驱动材料探索的主要目标是识别尚未在数据库中出现的高性能分子/材料。因此,机器学习/深度学习模型必须具有仅从现有数据中推断未知数据的能力。
Fig. 2 Model description used for the benchmark.
然而,材料数据集通常由小型实验结果组成,因而不可避免地会存在偏差。确定机器学习/深度学习模型能否克服这些偏差,并有效地推断分子性质至关重要。
Fig. 3 Evaluation methods for assessing interpolation and extrapolative performance.
来自日本东京大学工程学院电气工程与信息系统系的Hajime Shimakawa等,提出了一个全面的基准来评估12种有机分子性质的外推性能。他们的大规模基准测试显示,传统的机器学习模型在属性范围和分子结构的训练分布之外表现出显著的性能下降,特别是对于小型数据属性。
Fig. 4 Evaluation results of the interpolation test using all data points of each dataset and extrapolation tests of property range and molecular structure (cluster) at data size for interpolation Nin = 200 (50 for EBD) with RMSE relative to σall, where σall represents the standard deviation of each dataset as listed in Table 1.
为解决这一挑战,他们引入了一个称为QMex的量子力学描述符数据集,以及包含量子力学描述符和分子结构分类信息之间交互项的交互式线性回归。基于QMex的交互式线性回归在保持其可解释性的同时,实现了最先进的外推性能。
Fig. 5 Ratio of models ranking within the top three for each data size Nin.
他们的基准结果、QMex数据集和所提出的模型对于改进小型实验数据集的外推预测,并发现超越现有候选材料的新材料/分子极具价值。该文近期发布于npj Computational Materials 10: 11 (2024).
Fig. 6 Model performance comparison for extrapolation tests.
Editorial Summary
Extrapolative prediction of small-data molecular property:Quantum mechanics-assisted machine learning
Materials science has greatly benefited from advancements in machine learning (ML) and deep learning (DL) techniques. These techniques have revolutionized the prediction of molecular properties, leveraging traditional computational approaches.ML/DL techniques continue to enhance the accuracy and speed of property prediction, serving as indispensable tools for data-driven materials science.
Fig. 7 Summary of ML/DL model selection for interpolation and extrapolation of molecular property prediction.
However, a fundamental contradiction persists in ML/DL techniques regarding their inherent extrapolation difficulty, i.e., the ability to predict beyond the available data. The primary objective of data-driven materials exploration is to identify high-performance molecules/materials that are not yet represented in databases. Hence, ML/DL models must possess the capability to extrapolate unexplored data solely from the available data. However, materials datasets often consist of small experimental results, which inevitably carries biases. It is crucial to determine whether ML/DL models can overcome these biases and effectively extrapolate molecular properties.
Fig. 8 Model performance comparison between QMex-LR and QMex-ILR.
Hajime Shimakawa et al. from the Department of Electrical Engineering & Information Systems, School of Engineering, University of Tokyo, presented a comprehensive benchmark for assessing extrapolative performance across 12 organic molecular properties. Their large-scale benchmark revealed that conventional ML models exhibit remarkable performance degradation beyond the training distribution of property range and molecular structures, particularly for small-data properties. To address this challenge, they introduced a quantum-mechanical (QM) descriptor dataset, called QMex, and an interactive linear regression (ILR), which incorporates interaction terms between QM descriptors and categorical information pertaining to molecular structures. The QMex-based ILR achieved state-of-the-art extrapolative performance while preserving its interpretability. Their benchmark results, QMex dataset, and proposed model serve as valuable assets for improving extrapolative predictions with small experimental datasets and for the discovery of novel materials/molecules that surpass existing candidates. This article was recently published in npj Computational Materials 10: 11 (2024).
原文Abstract及其翻译 Extrapolative prediction of small-data molecular property using quantum mechanics-assisted machine learning (量子力学辅助机器学习对小数据分子性质外推预测) Hajime Shimakawa, Akiko Kumada & Masahiro Sato Abstract Data-driven materials science has realized a new paradigm by integrating materials domain knowledge and machine-learning (ML) techniques. However, ML-based research has often overlooked the inherent limitation in predicting unknown data: extrapolative performance, especially when dealing with small-scale experimental datasets. Here, we present a comprehensive benchmark for assessing extrapolative performance across 12 organic molecular properties. Our large-scale benchmark reveals that conventional ML models exhibit remarkable performance degradation beyond the training distribution of property range and molecular structures, particularly for small-data properties. To address this challenge, we introduce a quantum-mechanical (QM) descriptor dataset, called QMex, and an interactive linear regression (ILR), which incorporates interaction terms between QM descriptors and categorical information pertaining to molecular structures. The QMex-based ILR achieved state-of-the-art extrapolative performance while preserving its interpretability. Our benchmark results, QMex dataset, and proposed model serve as valuable assets for improving extrapolative predictions with small experimental datasets and for the discovery of novel materials/molecules that surpass existing candidates.
摘要 数据驱动材料科学通过整合材料领域知识和机器学习(ML)技术,实现了一种新的范式。然而,基于机器学习的研究往往忽略了其预测未知数据的固有局限性:即外推性能,特别是在处理小规模实验数据集时。在这里,我们提出了一个全面的基准来评估12种有机分子性质的外推性能。我们的大规模基准测试显示,传统的机器学习模型在属性范围和分子结构的训练分布之外表现出显著的性能下降,特别是对小数据属性。为解决这一挑战,我们引入了一个称为QMex的量子力学(QM)描述符数据集,以及包含量子力学描述符和分子结构分类信息之间交互项的交互式线性回归(ILR)。基于QMex的交互式线性回归在保持其可解释性的同时,实现了最先进的外推性能。我们的基准结果、QMex数据集和所提出的模型对于改进小型实验数据集的外推预测,并发现超越现有候选材料的新材料/分子极具价值。