论文笔记 Deep learning allows genome-scale prediction of Michaelis constants from structural features

本文链接：https://blog.csdn.net/2201_75349501/article/details/130320552

论文原文：Deep learning allows genome-scale prediction of Michaelis constants from structural features | PLOS Biology

Abstract部分

The Michaelis constant KM describes the affinity of an enzyme for a specific substrate and is a central parameter in studies of enzyme kinetics and cellular physiology. As measurements of KM are often difficult and time-consuming, experimental estimates exist for only a minority of enzyme–substrate combinations even in model organisms. Here, we build and train an organism-independent model that successfully predicts KM values for natural enzyme–substrate combinations using machine and deep learning methods. Predictions are based on a task-specific molecular fingerprint of the substrate, generated using a graph neural network, and on a deep numerical representation of the enzyme’s amino acid sequence. We provide genome-scale KM predictions for 47 model organisms, which can be used to approximately relate metabolite concentrations to cellular physiology and to aid in the parameterization of kinetic models of cellular metabolism.

米氏常数KM描述了酶对特定底物的亲和力，是酶动力学和细胞生理学研究的中心参数。由于KM的测量通常困难且耗时，即使在模型生物中，也只存在少数酶-底物组合的实验估计。在这里，我们建立并训练了一个与生物体无关的模型，该模型使用机器和深度学习方法成功预测了天然酶-底物组合的KM值。预测基于底物的任务特异性分子指纹，使用图形神经网络生成，并基于酶氨基酸序列的深度数字表示。我们提供了47种模式生物的基因组尺度KM预测，可用于将代谢物浓度与细胞生理学近似相关，并有助于细胞代谢动力学模型的参数化。

Introduction部分

The Michaelis constant, KM, is defined as the concentration of a substrate at which an enzyme operates at half of its maximal catalytic rate; it hence describes the affinity of an enzyme for a specific substrate. Knowledge of KM values is crucial for a quantitative understanding of enzymatic and regulatory interactions between enzymes and metabolites: It relates the intracellular concentration of a metabolite to the rate of its consumption, linking the metabolome to cellular physiology.

米氏常数KM被定义为底物的浓度，在该浓度下，酶以其最大催化速率的一半运行；因此它描述了酶对特定底物的亲和力。KM值的知识对于定量理解酶和代谢物之间的酶促和调节相互作用至关重要：它将代谢物的细胞内浓度与其消耗速率联系起来，将代谢组与细胞生理学联系起来。

As experimental measurements of KM and kcat are difficult and time-consuming, no experimental estimates exist for many enzymes even in model organisms. For example, in Escherichia coli, the biochemically best characterized organism, in vitro KM measurements exist for less than 30% of natural substrates (see Methods, “Download and processing of KM values”), and turnover numbers have been measured in vitro for only about 10% of the approximately 2,000 enzymatic reactions.

由于KM和kcat的实验测量既困难又耗时，因此即使在模型生物中，也不存在许多酶的实验估计。例如，在生物化学特征最好的生物体大肠杆菌中，不到30%的天然底物存在体外KM测量（见方法，“KM值的下载和处理”），而在大约2000个酶促反应中，只有大约10%的转化数是在体外测得的。

KM values, together with enzyme turnover numbers, kcat, are required for models of cellular metabolism that account for the concentrations of metabolites. The current standard approach in large-scale kinetic modeling is to estimate kinetic parameters in an optimization process. These optimizations typically attempt to estimate many more unknown parameters than they have measurements as inputs, and, hence, the resulting KM and kcat values have wide confidence ranges and show little connection to experimentally observed values. Therefore, predictions of these values from artificial intelligence, even if only up to an order of magnitude, would represent a major step toward more realistic models of cellular metabolism and could drastically increase the biological understanding provided by such models.

KM值以及酶周转数kcat是解释代谢物浓度的细胞代谢模型所必需的。目前大规模动力学建模的标准方法是在优化过程中估计动力学参数。这些优化通常试图估计比测量值作为输入多得多的未知参数，因此，所得到的KM和kcat值具有宽的置信范围，并且与实验观测值几乎没有联系。因此，人工智能对这些值的预测，即使只有一个数量级，也将代表着朝着更现实的细胞代谢模型迈出的重要一步，并可能大大提高这些模型提供的生物学理解。

Only few previous studies attempted to predict kinetic parameters of natural enzymatic reactions in silico. Heckmann and colleagues [5] successfully employed machine learning models to predict unknown turnover numbers for reactions in E. coli. They found that the most important predictors of kcat were the reaction flux catalyzed by the enzyme, estimated computationally through parsimonious flux balance analysis, and structural features of the catalytic site. While many E. coli kcat values could be predicted successfully with this model, active site information was not available for a sizeable fraction of enzymes [5]. Moreover, neither active site information nor reaction flux estimates are broadly available beyond a small number of model organisms, preventing the generalization of this approach.

只有少数先前的研究试图预测计算机中天然酶反应的动力学参数。Heckmann及其同事[5]成功地利用机器学习模型预测了大肠杆菌中反应的未知周转数。他们发现，kcat最重要的预测因素是酶催化的反应通量，通过简约通量平衡分析进行计算估计，以及催化位点的结构特征。虽然使用该模型可以成功预测许多大肠杆菌kcat值，但对于相当一部分酶来说，活性位点信息是不可用的[5]。此外，除了少数模式生物之外，活性位点信息和反应通量估计都不广泛，这阻碍了这种方法的推广。

A related problem to the prediction of KM is the prediction of drug–target interactions, an important task in drug development. Multiple approaches for the prediction of drug–target binding affinities (DTBAs) have been developed (reviewed in [8]). Most of these approaches are either similarity-based, structure-based, or feature-based. Similarity-based methods rely on the assumption that similar drugs tend to interact with similar targets; these methods use known drug–target interactions to learn a prediction function based on drug–drug and target–target similarity measures [9,10]. Structure-based models for DTBA prediction utilize information on the target protein’s 3D structure [11,12]. Neither of these 2 strategies can easily be generalized to genome-scale, organism-independent predictions, as many enzymes and substrates share only distant similarities with well-characterized molecules, and 3D structures are only available for a minority of enzymes.

与KM预测相关的一个问题是药物-靶标相互作用的预测，这是药物开发中的一项重要任务。已经开发了多种预测药物-靶标结合亲和力（DTBA）的方法（综述见[8]）。这些方法大多是基于相似性、基于结构或基于特征的。基于相似性的方法依赖于相似药物倾向于与相似靶标相互作用的假设；这些方法使用已知的药物-靶标相互作用来学习基于药物-药物和靶标-靶标相似性测量的预测函数[9，10]。用于DTBA预测的基于结构的模型利用关于靶蛋白的3D结构的信息[11，12]。这两种策略都不能很容易地推广到基因组规模、与生物体无关的预测，因为许多酶和底物与表征良好的分子只有遥远的相似性，而3D结构只适用于少数酶。

In contrast to these first 2 approaches, feature-based models for drug–target interaction predictions use numerical representations of the drug and the target as the input of fully connected neural networks (FCNNs) [13–16]. The drug feature vectors are most often either SMILES representations [17], expert-crafted fingerprints [18–20], or fingerprints created with graph neural networks (GNNs) [21,22], while those of the targets are usually sequence-based representations. As this information can easily be generated for most enzymes and substrates, we here use a similar approach to develop a model for KM prediction.

与前两种方法相比，基于特征的药物-靶标相互作用预测模型使用药物和靶标的数值表示作为全连接神经网络（FCNN）的输入[13-16]。药物特征向量通常是SMILES表示[17]、专业制作的指纹[18-20]或使用图神经网络（GNN）创建的指纹[21，22]，而靶标的特征向量通常为基于序列的表示。由于这些信息可以很容易地为大多数酶和底物生成，我们在这里使用类似的方法来开发KM预测模型。

An important distinction between the prediction of KM and DTBA prediction is that the former aims to predict affinities for known, natural enzyme–metabolite combinations. These affinities evolved under natural selection for the enzymes’ functions, an evolutionary process strongly constrained by the metabolite structure. In contrast, wild-type proteins did not evolve in the presence of a drug, and, hence, molecular structures are likely to contain only very limited information about the binding affinity for a target without information about the target protein.

KM预测和DTBA预测之间的一个重要区别是，前者旨在预测已知的天然酶-代谢物组合的亲和力。这些亲和力是在酶功能的自然选择下进化而来的，这是一个受代谢产物结构强烈限制的进化过程。相反，野生型蛋白质在药物存在下不会进化，因此，分子结构可能只包含关于靶标结合亲和力的非常有限的信息，而没有关于靶标蛋白质的信息。

Despite the central role of the metabolite molecular structure for the evolved binding affinity of its consuming enzymes, important information on the affinity must also be contained in the enzyme structure and sequence. To predict KM, it would be desirable to employ detailed structural and physicochemical information on the enzyme’s substrate binding site, as done by Heckmann and colleagues for their kcat predictions in E. coli [5]. However, these sites have only been characterized for a minority of enzymes [23]. An alternative approach is to employ a multidimensional numerical representation of the entire amino acid sequence of the enzyme, as provided by UniRep [24]. UniRep vectors are based on a deep representation learning model and have been shown to retain structural, evolutionary, and biophysical information.

尽管代谢产物分子结构对其消耗酶的进化结合亲和力起着核心作用，但有关亲和力的重要信息也必须包含在酶结构和序列中。为了预测KM，需要使用关于酶的底物结合位点的详细结构和物理化学信息，正如Heckmann及其同事在大肠杆菌中预测kcat所做的那样[5]。然而，这些位点仅针对少数酶进行了表征[23]。另一种方法是采用UniRep[24]提供的酶的整个氨基酸序列的多维数值表示。UniRep向量基于深度表示学习模型，并已被证明可以保留结构、进化和生物物理信息。

Here, we combine UniRep vectors of enzymes and diverse molecular fingerprints of their substrates to build a general, organism-, and reaction-independent model for the prediction of KM values, using machine and deep learning models. In the final model, we employ a 1,900-dimensional UniRep vector for the enzyme together with a task-specific molecular fingerprint of the substrate as the input of a gradient boosting model. Our model reaches a coefficient of determination of R2 = 0.53 between predicted and measured values on a test set, i.e., the model explains 53% of the variability in KM values across different, previously unseen natural enzyme–substrate combinations. In S1 Data, we provide complete KM predictions for 47 genome-scale metabolic models, including those for Homo sapiens, Mus musculus, Saccharomyces cerevisiae, and E. coli.

在这里，我们结合了酶的UniRep载体及其底物的不同分子指纹，使用机器和深度学习模型，建立了一个通用的、与生物体和反应无关的KM值预测模型。在最终的模型中，我们使用酶的1900维UniRep载体以及底物的任务特异性分子指纹作为梯度增强模型的输入。我们的模型在测试集上的预测值和测量值之间达到了R2=0.53的决定系数，即，该模型解释了不同的、以前看不见的天然酶-底物组合中KM值53%的可变性。在S1数据中，我们为47个基因组规模的代谢模型提供了完整的KM预测，包括人、Mus musculus、酿酒酵母和大肠杆菌的代谢模型。

Results部分

For all wild-type enzymes in the BRENDA database, we extracted organism name, Enzyme Commission (EC) number, UniProt ID, and amino acid sequence, together with information on substrates and associated KM values. If multiple KM values existed for the same combination of substrate and enzyme amino acid sequence, we took the geometric mean. This resulted in a dataset with 11,675 complete entries, which was split into a training set (80%) and a test set only used for the final validation (20%). All KM values were log10-transformed.

对于BRENDA数据库中的所有野生型酶，我们提取了物种名称、酶委员会（EC）编号、UniProt ID和氨基酸序列，以及底物和相关KM值的信息。如果底物和酶氨基酸序列的相同组合存在多个KM值，我们取几何平均值。这产生了一个包含11675个完整条目的数据集，该数据集被分为训练集（80%）和仅用于最终验证的测试集（20%）。对所有KM值进行log10转换。

Predicting KM from molecular fingerprints

To train a prediction model for KM, we first had to choose a numerical representation of the substrate molecules. For each substrate in our dataset, we calculated 3 different expert-crafted molecular fingerprints, i.e., bit vectors where each bit represents a fragment of the molecule. The expert-crafted fingerprints used are extended connectivity fingerprints (ECFPs), RDKit fingerprints, and MACCS keys. We calculated them with the python package RDKit [19] based on MDL Molfiles of the substrates (downloaded from KEGG; a Molfile lists a molecule’s atom types, atom coordinates, and bond types).

为了训练KM的预测模型，我们首先必须获取底物分子的数值特征。对于我们数据集中的每个底物，我们计算了3种不同的专业制作的分子指纹，即位向量，其中每个位代表分子的一个片段。使用的专业特制指纹有扩展连接指纹（ECFP）、RDKit指纹和MACCS密钥。我们根据底物的MDL摩尔文件（从KEGG下载；摩尔文件列出了分子的原子类型、原子坐标和键类型），用python包RDKit计算了它们。

MACCS keys are 166-dimensional binary fingerprints, where each bit contains the information if a certain chemical structure is present in a molecule, e.g., if the molecule contains a ring of size 4 or if there are fewer than 3 oxygen atoms present in the molecule. RDKit fingerprints are generated by identifying all subgraphs in a molecule that do not exceed a particular predefined range. These subgraphs are converted into numerical values using hash functions, which are then used to indicate which bits in a 2,048-dimensional binary vector are set to 1. Finally, to calculate ECFPs, molecules are represented as graphs by interpreting the atoms as nodes and the chemical bonds as edges. Bond types and feature vectors with information about every atom are calculated (types, masses, valences, atomic numbers, atom charges, and number of attached hydrogen atoms) [18]. Afterwards, these identifiers are updated for a predefined number of steps by iteratively applying predefined functions to summarize aspects of neighboring atoms and bonds. After the iteration process, all identifiers are used as the input of a hash function to produce a binary vector with structural information about the molecule. The number of iterations and the dimension of the fingerprint can be chosen freely. We set them to the default values of 3 and 1,024, respectively; lower or higher dimensions led to inferior predictions.

MACCS密钥是166维二元指纹，其中如果分子中存在特定的化学结构，例如，如果分子中包含尺寸为4的环，或者分子中存在的氧原子少于3个，则每个比特都包含信息。RDKit指纹是通过识别分子中不超过特定预定义范围的所有子图来生成的。使用哈希函数将这些子图转换为数值，然后使用哈希函数指示2048维二进制向量中的哪些位被设置为1。最后，为了计算ECFP，通过将原子解释为节点，将化学键解释为边，将分子表示为图。计算键类型和具有每个原子信息的特征向量（类型、质量、化合价、原子序数、原子电荷和连接的氢原子数）。然后，通过迭代地应用预定义函数来总结相邻原子和键的方面，将这些标识符更新预定义数量的步骤。在迭代过程之后，所有标识符都被用作哈希函数的输入，以产生具有关于分子的结构信息的二进制向量。迭代次数和指纹的尺寸可以自由选择。我们将它们分别设置为默认值3和1024；更低或更高的维度导致较差的预测。

To compare the information on KM contained in the different molecular fingerprints independent of protein information, we used the molecular fingerprints as the sole input to elastic nets, FCNNs, and gradient boosting models. To the fingerprints, we added the 2 features molecular weight (MW) and octanol–water partition coefficient (LogP), which were shown to be correlated with the KM value [28]. The models were then trained to predict the KM values of enzyme–substrate combinations (Fig 1A). The FCNNs consisted of an input layer with the dimension of the fingerprint (including the additional features MW and LogP), 2 hidden layers, and a 1D output layer (for more details, see Methods). Gradient boosting is a machine learning technique that creates an ensemble of many decision trees to make predictions. Elastic nets are regularized linear regression models, where the regularization coefficient is a linear combination of the L1− and L2-norm of the model parameters. For each combination of the 3 model types and the 3 fingerprints, we performed a hyperparameter optimization with 5-fold cross-validation on the training set, measuring performance through the mean squared error (MSE). For all 3 types of fingerprints, the gradient boosting model outperformed the FCNN and the elastic net (S1–S3 Tables).

为了比较独立于蛋白质信息的不同分子指纹中包含的KM信息，我们使用分子指纹作为elastic nets、FCNNs和gradient boosting模型的唯一输入。在指纹中，我们添加了两个特征分子量（MW）和辛醇-水分配系数（LogP），这两个特征与KM值相关。然后对模型进行训练，以预测酶-底物组合的KM值（图1A）。FCNNs由一个具有指纹尺寸的输入层（包括附加特征MW和LogP）、2个隐藏层和一个1D输出层组成（有关更多详细信息，请参阅方法）。Gradient boosting是一种机器学习技术，它创建了许多决策树的集合来进行预测。Elastic nets是正则化线性回归模型，其中正则化系数是模型参数的L1−和L2范数的线性组合。对于3种模型类型和3种指纹的每一种组合，我们对训练集进行了5倍交叉验证的超参数优化，通过均方误差（MSE）测量性能。对于所有3种类型的指纹，gradient boosting模型的性能优于FCNN和elastic net（S1–S3表）。

The KM predictions with the gradient boosting model based solely on the substrate ECFP, MACCS keys, and RDKit molecular fingerprints showed very similar performances on the test set, with MSE = 0.83 and coefficients of determination R2 = 0.40 (Fig 2).

仅基于底物ECFP、MACCS密钥和RDKit分子指纹的gradient boosting模型的KM预测在测试集上显示出非常相似的性能，MSE=0.83，确定系数R2=0.40（图2）。

Best KM predictions from metabolite fingerprints using graph neural networks and gradient boosting

Recent work has shown that superior prediction performance can be achieved through task-specific molecular fingerprints, where a deep neural network simultaneously optimizes the fingerprint and uses it to predict properties of the input. In contrast to conventional neural networks, these GNNs can process non-Euclidean inputs, such as molecular structures. This approach led to state-of-the-art performances on many biological and chemical datasets [21,22].

最近的工作表明，通过特定任务的分子指纹可以实现卓越的预测性能，其中深层神经网络同时优化指纹并使用指纹预测输入的特性。与传统的神经网络相比，这些GNNs可以处理非欧几里得输入，例如分子结构。这种方法在许多生物和化学数据集上实现了最先进的性能。

As an alternative to the predefined, expert-crafted molecular fingerprints, we thus also tested how well we can predict KM from a task-specific molecular fingerprint based on a GNN (Fig 1; for details, see Methods, “Architecture of the graph neural network”). As for the calculations of the ECFPs, each substrate molecule is represented as a graph by interpreting the atoms as nodes and the chemical bonds as edges, for which feature vectors are calculated from the MDL Molfiles. These are updated iteratively for a fixed number of steps, in each step applying functions with learnable parameters to summarize aspects of neighboring atoms and bonds. After the iterations, the feature vectors are pooled into 1 molecular fingerprint vector. In contrast to ECFPs, the parameters of the update functions are not fixed but are adjusted during the training of the FCNN that predicts KM from the pooled fingerprint vector (Methods). As for the predefined molecular fingerprints, we defined an extended GNN fingerprint by adding the 2 global molecular features LogP and MW to the model before the KM prediction step.

因此，作为预定义的、专业制作的分子指纹的替代方案，我们还测试了基于GNN的任务特异性分子指纹对KM的预测效果（图1；有关详细信息，请参阅方法，“图神经网络的架构”）。至于ECFP的计算，通过将原子解释为节点，将化学键解释为边，将每个底物分子表示为图，并根据MDL Molfiles计算其特征向量。这些被迭代更新固定数量的步骤，在每个步骤中应用具有可学习参数的函数来总结相邻原子和键的方面。在迭代之后，将特征向量合并为1个分子指纹向量。与ECFP相比，更新函数的参数不是固定的，而是在FCNN的训练过程中调整的，FCNN根据合并的指纹向量预测KM（方法）。对于预定义的分子指纹，我们通过在KM预测步骤之前将2个全局分子特征LogP和MW添加到模型中来定义扩展的GNN指纹。

To compare the learned substrate representation with the 3 predefined fingerprints, we extracted the extended GNN fingerprint for every substrate in the dataset and fitted an elastic net, an FCNN, and a gradient boosting model to predict KM. As before, we performed a hyperparameter optimization with 5-fold cross-validation on the training set for all models. The gradient boosting model again achieved better results than the FCNN and the elastic net (S1–S3 Tables). The performance of our task-specific fingerprints is better than that of the predefined fingerprints, reaching an MSE = 0.80 and a coefficient of determination R2 = 0.42 on the test set, compared to an MSE = 0.83 and R2 = 0.40 for the other fingerprints (Fig 2). To compare the performances statistically, we used a one-sided Wilcoxon signed-rank test for the absolute errors of the predictions for the test set, resulting in p = 0.0080 (ECFP), p = 0.073 (RDKit), and p = 0.062 (MACCS keys). While the differences in the error distributions are only marginally statistically significant for RDKit and MACCS keys at the 5% level, these analyses support the choice of the task-specific GNN molecular fingerprint for predicting KM.

为了将学习的底物特征与3个预定义的指纹进行比较，我们为数据集中的每个底物提取了扩展的GNN指纹，并拟合了elastic net、FCNN和gradient boosting模型来预测KM。与之前一样，我们对所有模型的训练集进行了超参数优化，并进行了5倍的交叉验证。gradient boosting模型再次获得了比FCNN和elastic net更好的结果（S1–S3表）。与其他指纹的MSE=0.83和R2=0.40相比，我们的任务特定指纹的性能优于预定义指纹，在测试集上达到MSE=0.80和决定系数R2=0.42（图2）。为了在统计上比较性能，我们对测试集的预测的绝对误差使用了单侧Wilcoxon符号秩检验，得出p=0.0080（ECFP）、p=0.073（RDKit）和p=0.062（MACCS密钥）。虽然在5%的水平上，RDKit和MACCS密钥的误差分布差异在统计学上仅具有轻微的显著性，但这些分析支持选择任务特异性GNN分子指纹来预测KM。

It is noteworthy that the errors on the test set are smaller than the errors achieved during cross-validation. We found that the number of training samples has a great influence on model performance (see below, “Model performance increases linearly with training set size”). Hence, the improved performance on the test set may result from the fact that before validation on the test set, models are trained with approximately 2,000 more samples than before each cross-validation.

值得注意的是，测试集上的误差小于交叉验证期间实现的误差。我们发现训练样本的数量对模型性能有很大影响（见下文，“模型性能随训练集大小线性增加”）。因此，测试集性能的提高可能是因为在测试集验证之前，使用比每次交叉验证之前多大约2000个样本来训练模型。

Effects of molecular weight and octanol–water partition coefficient

Before predicting KM from the molecular fingerprints, we added the MW and the LogP. Do these extra features contribute to improved predictions by the task-specific GNN fingerprints? To answer this question, we trained GNNs without the additional features LogP and MW, as well as with only one of those additional features. Fig 3 displays the performance of gradient boosting models that are trained to predict KM with GNN fingerprints with and without extra features, showing that the additional features have only a small effect on performance: Adding both features reduces MSE from 0.82 to 0.80, while increasing R2 from 0.41 to 0.42. The difference in model performance is not statistically significant (p = 0.13, one-sided Wilcoxon signed-rank test for the absolute errors of the predictions for the test set). This indicates that most of the information used to predict KM can be extracted from the graph of the molecule itself. However, since the addition of the 2 additional features slightly improves KM predictions on the test dataset, we include the features MW and LogP in our further analyses.

在根据分子指纹预测KM之前，我们添加了MW和LogP。这些额外的功能是否有助于通过特定任务的GNN指纹改进预测？为了回答这个问题，我们在没有附加特征LogP和MW的情况下训练了GNN，并且只训练了其中一个附加特征。图3显示了gradient boosting模型的性能，这些模型被训练来预测带有和不带有额外特征的GNN指纹的KM，表明额外特征对性能的影响很小：添加这两个特征将MSE从0.82降低到0.80，而R2从0.41增加到0.42。模型性能的差异在统计学上并不显著（p=0.13，测试集预测绝对误差的单侧Wilcoxon符号秩检验）。这表明用于预测KM的大多数信息可以从分子本身的图中提取。然而，由于添加了2个额外的特征，略微改善了测试数据集上的KM预测，我们在进一步的分析中包括了特征MW和LogP。

UniRep vectors as additional features

So far, we have only considered substrate-specific information. As KM values are features of specific enzyme–substrate interactions, we now need to add input features that represent enzyme properties. Important information on substrate binding affinity is contained in molecular features of the catalytic site; however, active site identities and structures are available only for a small minority of enzymes in our dataset.

到目前为止，我们只考虑了底物特定的信息。由于KM值是特定酶-底物相互作用的特征，我们现在需要添加表示酶特性的输入特征。关于底物结合亲和力的重要信息包含在催化位点的分子特征中；然而，在我们的数据集中，活性位点的识别和结构仅适用于少数酶。

We thus restrict the enzyme information utilized by the model to a deep numerical representation of the enzyme’s amino acid sequence, calculating an UniRep vector for each enzyme. UniRep vectors are 1,900-dimensional statistical representation of proteins, created with an mLSTM, a recurrent neural network architecture for sequence modeling that combines the long short-term memory and multiplicative recurrent neural network architectures. The model was trained with 24 million unlabeled amino acid sequences to predict the next amino acid in an amino acid sequence, given the previous amino acids. In this way, the mLSTM learns to store important information about the previous amino acids in a numerical vector, which can later be extracted and used as a representation for the protein. It has been shown that these representations lead to good results when used as input features in prediction tasks concerning protein stability, function, and design.

因此，我们将模型使用的酶信息限制为酶氨基酸序列的深度数值特征，为每种酶计算一个UniRep载体。UniRep向量是蛋白质的1900维统计表示，使用mLSTM创建，这是一种用于序列建模的递归神经网络架构，结合了长短期记忆和乘法递归神经网络结构。该模型使用2400万个未标记的氨基酸序列进行训练，以预测给定先前氨基酸的氨基酸序列中的下一个氨基酸。通过这种方式，mLSTM学会将关于先前氨基酸的重要信息存储在数字向量中，该数字向量稍后可以被提取并用作蛋白质的表示。研究表明，当在涉及蛋白质稳定性、功能和设计的预测任务中用作输入特征时，这些表示会产生良好的结果。

Predicting KM using substrate and enzyme information

To predict the KM value, we concatenated the 52-dimensional task-specific extended fingerprint learned with the GNN and the 1,900-dimensional UniRep vector with information about the enzyme’s amino acid sequence into a global feature vector. This vector was then used as the input for a gradient boosting model for regression in order to predict the KM value. We also trained an FCNN and an elastic; however, predictions were substantially worse (S4–S6 Tables), consistent with the results obtained when using only the substrate fingerprints as inputs.

为了预测KM值，我们将用GNN和1900维UniRep向量学习的52维任务特异性扩展指纹与关于酶的氨基酸序列的信息连接到全局特征向量中。然后将该向量用作gradient boosting模型的输入，用于回归，以预测KM值。我们还训练了一个FCNN和一个elastic；然而，预测要差得多（S4–S6表），这与仅使用底物指纹作为输入时获得的结果一致。

The gradient boosting model that combines substrate and enzyme information achieves an MSE = 0.65 on a log10-scale and results in a coefficient of determination R2 = 0.53, substantially superior to the above models based on substrate information alone. We also validate our model with an additional metric, $r_{m}^{2}$ , which is a commonly used performance measurement tool for quantitative structure–activity relationship (QSAR) prediction models. It is defined as $r_{m}^{2}=r^{^{2}}\times (1-\sqrt{r^{2}-r_{0}^{2}})$ where $r^{2}$ and $r_{0}^{2}$ are the squared correlation coefficients with and without intercept, respectively. Our model achieves a value of $r_{m}^{2}=0.53$ on the test set.

结合底物和酶信息的gradient boosting模型在log10标度上实现MSE＝0.65，并导致决定系数R2＝0.53，显著优于仅基于底物信息的上述模型。我们还用一个额外的度量 $r_{m}^{2}$ 来验证我们的模型，这是定量结构-活动关系（QSAR）预测模型常用的性能测量工具。它被定义为 $r_{m}^{2}=r^{^{2}}\times (1-\sqrt{r^{2}-r_{0}^{2}})$ ，其中r2和 $r_{0}^{2}$ 分别是具有截距和不具有截距的平方相关系数。我们的模型在测试集上达到了 $r_{m}^{2}=0.53$ 。

Fig 4A and 4B compare the performance of the full model to models that use only substrate or only enzyme information as inputs, applied to the BRENDA test dataset (which only contains previously unseen enzyme–substrate combinations). To predict the KM value from only the enzyme UniRep vector, we again fitted a gradient boosting model, leading to MSE = 1.01 and R2 = 0.27. To predict the KM value from substrate information only, we chose the gradient boosting model with extended task-specific fingerprints as its inputs, which was used for the comparison with the other molecular fingerprints. (Fig 2).

图4A和4B将完整模型的性能与仅使用底物或仅使用酶信息作为输入的模型进行了比较，这些模型应用于BRENDA测试数据集（仅包含以前未见过的酶-底物组合）。为了仅从酶UniRep载体预测KM值，我们再次拟合gradient boosting模型，得出MSE=1.01和R2=0.27。为了仅从底物信息预测KM值，我们选择了以扩展的任务特异性指纹为输入的gradient boosting模型，用于与其他分子指纹进行比较。（图2）。

Predicting KM for an independently acquired test dataset 略

Predicting KM for enzymes and substrates not represented in the training data 略

Model performance increases linearly with training set size 略

KM predictions for enzymatic reactions in genome-scale metabolic models 略

Discussion

In conclusion, we found that Michaelis constants of enzyme–substrate pairs, KM, can be predicted through artificial intelligence with a coefficient of determination of R2 = 0.53: More than half of the variance in KM values across enzymes and organisms can be predicted from deep numerical representations of enzyme amino acid sequence and substrate molecular structure. This performance is largely organism-independent and does not require that either enzyme or substrate are covered by the dataset used for training; the good performance was confirmed using a second, independent and nonoverlapping test set from Sabio-RK (R2 = 0.49). To obtain this predictive performance, we used task-specific fingerprints of the substrate (GNN) optimized for the KM prediction, as these appear to contain more information about KM values than predefined molecular fingerprints based on expert-crafted transformations (ECFP, RDKit fingerprint, MACCS keys). The observed differences between GNNs and predefined fingerprints is in line with the results of a previous study on the prediction of chemical characteristics of small molecules [22].

总之，我们发现，通过人工智能可以预测酶-底物的Michaelis常数，即KM，其决定系数为R2=0.53：超过一半的不同酶和生物体的KM值的差异可以从酶的氨基酸序列和底物分子结构的深度数字表示中预测出来。这一性能在很大程度上与生物体无关，并且不要求用于训练的数据集涵盖酶或底物；使用来自Sabio-RK的第二个独立和不重叠的测试集证实了这一良好性能（R2=0.49）。为了获得这种预测性能，我们使用了为知识管理预测而优化的基质（GNN）的特定任务指纹，因为这些指纹似乎比基于专家精心设计的变换（ECFP、RDCit指纹、MACCS键）的预定义分子指纹包含更多的知识管理值信息。观察到的GNN和预定义指纹之间的差异与之前关于小分子化学特征预测的研究结果一致[22]。

Fig 4, which compares KM predictions across different input feature sets, indicates that the relevant information contained in an enzyme’s amino acid sequence may be less important for its evolved binding affinity to a natural substrate than the substrate’s molecular structure: Predictions based only on substrate structures explain almost twice as much variance in KM compared to predictions based only on enzyme representations. It is possible, though, that improved (possibly task-specific) enzyme representations will modify this picture in the future.

图4比较了不同输入特征集的知识管理预测，表明一个酶的氨基酸序列所包含的相关信息对其与天然底物的进化结合亲和力的重要性可能低于底物的分子结构：与仅基于酶的表征的预测相比，仅基于底物结构的预测所解释的KM差异几乎是两倍。不过，改进后的（可能是特定任务的）酶表示法有可能在将来改变这种情况。

A direct comparison of the prediction quality of our model to the results of Yan and colleagues [7] would not be meaningful, as the scope of their model is very different from that of ours. Yan and colleagues trained a model specific to a single enzyme–substrate pair with only 36 data points, aiming to distinguish KM values between different sequences of the same enzyme (beta-glucosidase) for the same substrate (cellobiose). However, the performance of our general model, with MSE = 0.65, compares favorably to that of the substrate-specific statistical models of Borger and colleagues [6], which resulted in an overall MSE = 1.02.

将我们的模型的预测质量与Yan和同事[7]的结果直接比较是没有意义的，因为他们的模型的范围与我们的非常不同。Yan和他的同事只用36个数据点训练了一个专门针对单一酶-底物对的模型，目的是区分同一酶（β-葡萄糖苷酶）对同一底物（纤维二糖）的不同序列的KM值。然而，我们的一般模型的性能，MSE=0.65，与Borger及其同事[6]的底物特异性统计模型相比更有优势，后者的总体MSE=1.02。

We compare our model to 2 different models for DTBA prediction, DeepDTA and SimBoost [10,16]. These two, which were trained and tested on the same 2 datasets, achieved values ranging from 0.63 to 0.67 on test sets. This compares to achieved for KM predictions with our approach. It is generally difficult to compare prediction performance between models trained and tested on different datasets. Here, this difficulty is exacerbated by the different prediction targets (DTBA versus KM). Crucially, the datasets used for DTBA and KM prediction differ substantially with respect to their densities, i.e., the fraction of possible protein–ligand combinations covered by the training and test data. One of the datasets used for DTBA prediction encompasses experimental data for all possible drug–target combinations between 442 different proteins and 68 targets (442×68 = 30,056). The second dataset contains data for approximately 25% of all possible combinations between 229 proteins and 2,111 targets (118,254 out of 229×2,111 = 483, 419). In contrast, our KM dataset features 7,001 different enzymes and 1,582 substrates but comprises only about 0.1% of their possible combinations (11,600 out of 7,001×1582 = 11,075, 582). Thus, our dataset is not only much smaller, but also has an extremely low coverage of possible protein–ligand combinations compared to the DTBA datasets used in [10,16]. As shown in Fig 5, the number of available training samples has a strong impact on model performance, and the same is likely true for the data density. Against this background, the performance of our KM prediction model could be seen as being surprisingly good. Fig 5 indicates that KM predictions can be improved substantially once more training data become available.

我们将我们的模型与DeepDTA和SimBoost[10,16]两个不同的DTBA预测模型进行比较。这两个模型在相同的两个数据集上进行了训练和测试，在测试集上取得了从0.63到0.67的数值。这与我们的方法在KM预测中取得的结果相比。一般来说，在不同数据集上训练和测试的模型之间比较预测性能是很困难的。在这里，这种困难由于不同的预测目标（DTBA与KM）而变得更加严重。最重要的是，用于DTBA和KM预测的数据集在密度方面有很大不同，即训练和测试数据所涵盖的可能的蛋白质-配体组合的比例。其中一个用于DTBA预测的数据集包含了442个不同蛋白质和68个目标之间所有可能的药物-目标组合的实验数据（442×68=30,056）。第二个数据集包含229个蛋白质和2111个目标之间所有可能组合的大约25%的数据（229×2111=483，419中的118，254）。相比之下，我们的KM数据集有7,001种不同的酶和1,582个底物，但只包括它们可能组合的约0.1%（7,001×1582中的11,600=11,075,582）。因此，我们的数据集不仅小得多，而且与[10,16]中使用的DTBA数据集相比，对可能的蛋白质-配体组合的覆盖率也极低。如图5所示，可用的训练样本的数量对模型的性能有很大的影响，数据密度也可能是如此。在这种背景下，我们的知识管理预测模型的性能可以被看作是令人惊讶的好。图5表明，一旦有了更多的训练数据，知识预测可以得到很大的改善。

To provide the model with information about the enzyme, we used statistical representations of the enzyme amino acid sequence. We showed that these features provide important enzyme-specific information for the prediction of KM. It appears likely that predictions could be improved further by taking features of the enzyme active site into account—such as hydrophobicity, depth, or structural properties [5]—once such features become widely available [23]. Adding organism-specific information, such as the typical intracellular pH or temperature, may also increase model performance.

为了给模型提供关于酶的信息，我们使用了酶的氨基酸序列的统计表征。我们表明，这些特征为预测KM提供了重要的酶特定信息。一旦考虑到酶活性部位的特征--如疏水性、深度或结构特性[5]--这些特征被广泛使用，预测可能会进一步提高[23]。添加生物体的特定信息，如典型的细胞内pH值或温度，也可以提高模型的性能。

We wish to emphasize that our model is trained to predict KM values for enzyme–substrate pairs that are known to interact as part of the natural cellular physiology, meaning that their affinity has evolved under natural selection. The model should thus be used with care when making predictions for enzyme interactions with other substrates, such as nonnatural compounds or substrates involved in moonlighting activities. In such cases, DTBA prediction models (with their higher data density) may be better suited, and estimates with our model should be regarded as a lower bound for KM that might be reached under appropriate natural selection.

我们想强调的是，我们的模型是为预测已知作为自然细胞生理学的一部分而相互作用的酶-底物对而训练的，也就是说，它们的亲和力是在自然选择下进化的。因此，在预测酶与其他底物，如非天然化合物或参与月光活动的底物的相互作用时，应谨慎使用该模型。在这种情况下，DTBA预测模型（具有更高的数据密度）可能更适合，而用我们的模型进行的估计应被视为在适当的自然选择下可能达到的KM的下限。

To put the performance of the current model into perspective, we consider the mean relative prediction error MRPE = 4.1, meaning that our predictions deviate from experimental estimates on average by 4.1-fold. This compares to a mean relative deviation of 3.4-fold between a single KM measurement and the geometric mean of all other measurements for the same enzyme–substrate combination in the BRENDA dataset (the geometric means of enzyme–substrate combinations were used for training the models). Part of the high variability across values in BRENDA is due to varying assay conditions in the in vitro experiments [28]. Moreover, entries in BRENDA are not free from errors; on the order of 10% of the values in the database do not correspond to values in the original papers, e.g., due to errors in unit conversion [28].

为了正确看待当前模型的性能，我们认为平均相对预测误差MRPE=4.1，这意味着我们的预测与实验估计值的偏差平均为4.1倍。相比之下，在BRENDA数据集中，对于相同的酶-基质组合，单一的KM测量值与所有其他测量值的几何平均值之间的平均相对偏差为3.4倍（酶-基质组合的几何平均值被用于训练模型）。BRENDA中各数值的高变异性部分是由于体外实验中不同的检测条件造成的[28]。此外，BRENDA中的条目也不是没有错误；数据库中大约有10%的数值与原始论文中的数值不一致，例如，由于单位转换的错误[28]。

Especially on the background of this variation, the performance of our enzyme–substrate specific KM model appears remarkable. In contrast to previous approaches [6,7,13–16], the model requires no previous knowledge about measured KM values for the considered substrate or enzyme. Furthermore, only one general purpose model is trained, and it is not necessary to obtain training data and to fit new models for individual substrates, enzyme groups, or organisms. Once the model has been fitted, it can provide genome-scale KM predictions from existing features within minutes. We here provide such predictions for a broad set of model organisms, including mouse and human; these data can provide base estimates for unknown kinetic constants, e.g., to relate metabolomics data to cellular physiology, and can help to parameterize kinetic models of metabolism. Future work may develop similar prediction frameworks for enzyme turnover numbers (kcat), which would facilitate the completion of such parameterizations.

特别是在这种变化的背景下，我们的酶-底物特定的KM模型的性能显得非常突出。与以前的方法[6,7,13-16]相比，该模型不需要事先了解所考虑的底物或酶的测量知识值。此外，只需要训练一个通用的模型，没有必要为个别底物、酶组或生物体获得训练数据和拟合新的模型。一旦模型被拟合，它可以在几分钟内根据现有的特征提供基因组规模的知识管理预测。我们在此为包括小鼠和人类在内的广泛的模式生物体提供这种预测；这些数据可以为未知的动力学常数提供基础估计，例如，将代谢组学数据与细胞生理学联系起来，并有助于对代谢的动力学模型设置参数。未来的工作可能会开发类似的酶周转次数（kcat）的预测框架，这将有助于完成这种参数化。

Methods

Software and code availability

We implemented all code in Python [32]. We implemented the neural networks using the deep learning library TensorFlow [33] and Keras [34]. We fitted the gradient boosting models using the library XGBoost [35].

我们用Python[32]实现了所有代码。我们使用深度学习库TensorFlow[33]和Keras[34]来实现神经网络。我们使用XGBoost库[35]来装配梯度提升模型。

All datasets generated and the Python code used to produce the results (in Jupyter notebooks) are available from https://github.com/AlexanderKroll/KM_prediction. Two of the Jupyter notebooks contain all the necessary steps to download the data from BRENDA and Sabio-RK and to preprocess it. Execution of a second notebook performs training and validation of our final model. Two additional notebooks contain code to train the models with molecular fingerprints as inputs and to investigate the effect of the 2 additional features, MW and LogP, for the GNN.

所有生成的数据集和用于产生结果的Python代码（在Jupyter笔记本中）都可以从https://github.com/AlexanderKroll/KM_prediction。其中两个Jupyter笔记本包含从BRENDA和Sabio-RK下载数据并进行预处理的所有必要步骤。执行第二个笔记本，对我们的最终模型进行训练和验证。另外两个笔记本包含了用分子指纹作为输入来训练模型的代码，以及研究MW和LogP这两个额外特征对GNN的影响。

Downloading and processing KM values from BRENDA

We downloaded KM values together with organism and substrate name, EC number, UniProt ID of the enzyme, and PubMed ID from the BRENDA database [25]. This resulted in a dataset with 156,387 entries. We mapped substrate names to KEGG Compound IDs via a synonym list from KEGG [26]. For all substrate names that could not be mapped to a KEGG Compound ID directly, we tried to map them first to PubChem Compound IDs via a synonym list from PubChem [36] and then mapped these IDs to KEGG Compound IDs using the web service of MBROLE [37]. We downloaded amino acid sequences for all data points via the UniProt mapping service [38] if the UniProt ID was available; otherwise, we downloaded the amino acid sequence from BRENDA via the organism name and EC number.

我们从BRENDA数据库[25]下载了KM值以及生物体和底物名称、EC编号、酶的UniProt ID和PubMed ID。这就产生了一个有156,387个条目的数据集。我们通过KEGG[26]的同义词列表将底物名称映射到KEGG化合物ID。对于所有不能直接映射到KEGG化合物ID的底物名称，我们首先尝试通过PubChem[36]的同义词列表将其映射到PubChem化合物ID，然后使用MBROLE[37]的网络服务将这些ID映射到KEGG化合物的ID。如果UniProt ID可用，我们通过UniProt映射服务[38]下载所有数据点的氨基酸序列；否则，我们通过生物体名称和EC编号从BRENDA下载氨基酸序列。

We then removed (i) all duplicates (i.e., entries with identical values for KM, substrate, and amino acid sequence as another entry); (ii) all entries with non-wild-type enzymes (i.e., with a commentary field in BRENDA labeling it as mutant or recombinant); (iii) entries for nonbacterial organisms without an UniProt ID for the enzyme; and (iv) entries with substrate names that could not be mapped to a KEGG Compound ID. This resulted in a filtered set of 34,526 data points. Point (iii) was motivated by the expectation that isoenzymes are frequent in eukaryotes but rare in bacteria, such that organism name and EC number are sufficient to unambiguously identify an amino acid sequence in the vast majority of cases for bacteria but not for eukaryotes. If multiple log10-transformed KM values existed for 1 substrate and 1 amino acid sequence, we took the geometric mean across these values. For 11,737 of these, we could find an entry for the EC number–substrate combination in the KEGG reaction database. Since we are only interested in KM values for natural substrates, we only kept these data points [28]. We log10-transformed all KM values in this dataset. We split the final dataset with 11,737 entries randomly into training data (80%) and test data (20%). We further split the training set into 5 subsets, which we used for 5-fold cross-validations for the hyperparameter optimization of the machine learning models. We used the test data to evaluate the final models after hyperparameter optimization.

然后，我们删除了(i)所有重复的条目（即，与另一个条目的知识值、底物和氨基酸序列相同的条目）；(ii)所有非野生型酶的条目（即，在BRENDA中的注释字段将其标记为突变体或重组体）；(iii)没有该酶的UniProt ID的非细菌生物的条目；以及(iv)底物名称不能映射到KEGG化合物ID的条目。这就产生了一个由34,526个数据点组成的过滤集。第(iii)点的动机是期望同工酶在真核生物中经常出现，但在细菌中很少见，因此在绝大多数情况下，生物体名称和欧共体编号足以明确地识别细菌的氨基酸序列，但对真核生物则不然。如果一个底物和一个氨基酸序列存在多个对数10转换的KM值，我们取这些值的几何平均值。对于其中的11,737个，我们可以在KEGG反应数据库中找到EC号-底物组合的条目。由于我们只对天然底物的KM值感兴趣，我们只保留这些数据点[28]。我们对这个数据集中的所有KM值进行了对数10转换。我们将有11,737个条目的最终数据集随机分成训练数据（80%）和测试数据（20%）。我们进一步将训练集分成5个子集，用于机器学习模型超参数优化的5折交叉验证。我们使用测试数据来评估超参数优化后的最终模型。

To estimate the proportion of metabolic enzymes with KM values measured in vitro for E. coli, we mapped the E. coli KM values downloaded from BRENDA to reactions of the genome scale metabolic model iML1515 [39], which comprises over 2,700 different reactions. To do this, we extracted all enzyme–substrate combinations from the iML1515 model for which the model annotations listed an EC number for the enzyme and a KEGG Compound ID for the substrate, resulting in 2,656 enzyme–substrate combinations. For 795 of these combinations (i.e., 29.93%), we were able to find a KM value in the BRENDA database.

为了估计大肠杆菌体外测定的代谢酶的比例，我们将从BRENDA下载的大肠杆菌KM值映射到基因组尺度代谢模型iML1515[39]的反应上，该模型包括2700多个不同的反应。为此，我们从iML1515模型中提取了所有的酶-底物组合，模型注释中列出了酶的EC号和底物的KEGG化合物ID，结果是2656个酶-底物组合。对于这些组合中的795个（即29.93%），我们能够在BRENDA数据库中找到一个KM值。

Download and processing of KM values from Sabio-RK

We downloaded KM values together with the name of the organism, substrate name, EC number, UniProt ID of the enzyme, and PubMed ID from the Sabio-RK database. This resulted in a dataset with 8,375 entries. We processed this dataset in the same way as described above for the BRENDA dataset. We additionally removed all entries with a PubMed ID that was already present in the BRENDA dataset. This resulted in a final dataset with 274 entries, which we used as an additional test set for the final model for KM prediction.

我们从Sabio-RK数据库中下载KM值以及生物体名称、底物名称、EC编号、酶的UniProt ID和PubMed ID。这导致了一个有8,375个条目的数据集。我们用与上述BRENDA数据集相同的方法处理这个数据集。此外，我们还删除了所有具有BRENDA数据集中已有的PubMed ID的条目。这就产生了一个有274个条目的最终数据集，我们将其作为KM预测的最终模型的额外测试集。

Calculation of predefined molecular fingerprints

We first represented each substrate through 3 different molecular fingerprints (ECFP, RDKit fingerprint, MACCS keys). For every substrate in the final dataset, we downloaded an MDL Molfile with 2D projections of its atoms and bonds from KEGG [26] via the KEGG Compound ID. We then used the package Chem from RDKit [19] with the Molfile as the input to calculate the 2,048-dimensional binary RDKit fingerprints [19], the 166-dimensional binary MACCS keys [20], and the 1,024-dimensional binary ECFPs [18] with a radius of 3.

我们首先通过3个不同的分子指纹（ECFP、RDCit指纹、MACCS键）来表示每个底物。对于最终数据集中的每个底物，我们通过KEGG化合物ID从KEGG[26]下载了带有其原子和键的二维投影的MDL Molfile。然后我们使用RDCit[19]的Chem包，以Molfile为输入，计算出2,048维的二进制RDCit指纹[19]，166维的二进制MACCS键[20]，以及半径为3的1024维二进制ECFPs[18]。

Architecture of the fully connected neural network with molecular fingerprints

We used an FCNN to predict KM values using only representations of the substrates as input features. We performed a 5-fold cross-validation on the training set for each of the 4 substrate representations (ECFP, RDKit fingerprints, MACCS keys, and task-specific fingerprints) for the hyperparameter optimization. The FCNN consisted of 2 hidden layers, and we used rectified linear units (ReLUs), which are defined as ReLU(x) = max(x, 0), as activation functions in the hidden layers to introduce nonlinearity. We applied batch normalization [40] after each hidden layer. Additionally, we used L2-regularization in every layer to prevent overfitting. Adding dropout [41] did not improve the model performance. We optimized the model by minimizing the MSE with the stochastic gradient descent with Nesterov momentum as an optimizer. The hyperparameters regularization factor, learning rate, learning rate decay, dimension of hidden layers, batch size, number of training epochs, and momentum were optimized by performing a grid search. We selected the set of hyperparameters with the lowest mean MSE during cross-validation. The results of the cross-validations and best set of hyperparameters for each fingerprint are displayed in S1 Table.

我们使用一个FCNN来预测KM值，只使用基质的表示作为输入特征。我们对4种基质表征（ECFP、RDCit指纹、MACCS密钥和特定任务指纹）中的每一种的训练集进行了5倍交叉验证，以进行超参数优化。FCNN由2个隐藏层组成，我们使用整流线性单元（ReLU），定义为ReLU(x) = max(x, 0)，作为隐藏层的激活函数以引入非线性。我们在每个隐蔽层后应用了批量归一化[40]。此外，我们在每一层都使用了L2-regularization，以防止过度拟合。添加dropout[41]并没有改善模型的性能。我们用随机梯度下降法和Nesterov动量法作为优化器，通过最小化MSE来优化模型。超参数正则化因子、学习率、学习率衰减、隐藏层的维度、批次大小、训练历时数和动量是通过网格搜索进行优化的。在交叉验证过程中，我们选择了平均MSE最低的一组超参数。交叉验证的结果和每个指纹的最佳超参数集显示在S1表中。

Fitting of the gradient boosting models with molecular fingerprints

We used gradient boosting models to predict KM values using only representations of the substrates as input features. As for the FCNNs, we performed a 5-fold cross-validation on the training set for each of the 4 substrate representations (ECFP, RDKit fingerprints, MACCS keys, and task-specific fingerprints) for hyperparameter optimization. We fitted the models using the gradient boosting library XGBoost [35] for Python. The hyperparameters regularization coefficients, learning rate, maximal tree depth, maximum delta step, number of training rounds, and minimum child weight were optimized by performing a grid search. We selected the set of hyperparameters with the lowest mean MSE during cross-validation. The results are displayed in S2 Fig.

我们使用梯度提升模型来预测KM值，只使用基质的代表作为输入特征。对于FCN，我们对4种基质表征（ECFP、RDCit指纹、MACCS密钥和特定任务指纹）中的每一种的训练集进行了5倍交叉验证，以进行超参数优化。我们使用Python的梯度提升库XGBoost[35]来拟合模型。超参数正则化系数、学习率、最大树深、最大delta步长、训练轮数和最小子重通过网格搜索进行优化。我们选择了交叉验证期间平均MSE最低的超参数集。结果显示在S2图中。

Fitting of the elastic nets with molecular fingerprints

We used elastic nets to predict KM values with representations of the substrates as input features. Elastic nets are linear regression model with additional L1- and L2-penalties for the coefficients of the model in order to apply regularization. We performed 5-fold cross-validations on the training set for all 4 substrate representations (ECFP, RDKit fingerprints, MACCS keys, and task-specific fingerprints) for hyperparameter optimization. During hyperparameter optimization, the coefficients for L1-regularization and L2-regularization were optimized by performing a grid search. The models were fitted using the machine learning library scikit-learn [42] for Python. The results of the hyperparameter optimizations are displayed in S3 Table.

我们使用弹性网来预测KM值，以基质的表示作为输入特征。弹性网是线性回归模型，模型的系数有额外的L1和L2参数，以便应用正则化。我们对所有4个基质表征（ECFP、RDCit指纹、MACCS密钥和特定任务指纹）的训练集进行了5倍交叉验证，以进行超参数优化。在超参数优化过程中，L1-正则化和L2-正则化的系数是通过进行网格搜索来优化的。这些模型是用Python的机器学习库scikit-learn[42]拟合的。超参数优化的结果显示在S3表中。

Calculation of molecular weight (MW) and the octanol–water partition coefficient (LogP)

We calculated the additional 2 molecular features, MW and LogP, with the package Chem from RDKit [19], with the MDL Molfile of the substrate as the input.

我们用RDCit[19]的Chem包计算了另外两个分子特征，即MW和LogP，并将底物的MDL Molfile作为输入。

Calculation of the input of the graph neural network

Graphs in GNNs are represented with tensors and matrices. To calculate the input matrices and tensors, we used the package Chem from RDKit [19] with MDL Molfiles of the substrates as inputs to calculate 8 features for very atom v (atomic number, number of bonds, charge, number of hydrogen bonds, mass, aromaticity, hybridization type, chirality) and 4 features for every bond between 2 atoms v and w (bond type, part of ring, stereo configuration, aromaticity). Converting these features (except for atom mass) into one-hot encoded vectors resulted in a feature vector with Fb = 10 dimensions for every bond and in a feature vector with Fa = 32 dimensions for every atom.

GNNs中的图是用张量和矩阵表示的。为了计算输入矩阵和张量，我们使用了来自RDCit[19]的软件包Chem，以底物的MDL Molfiles作为输入，为每个原子v计算8个特征（原子数、键数、电荷、氢键数、质量、芳香度、杂化类型、手性），为两个原子v和w之间的每个键计算4个特征（键类型、环的一部分、立体构型、芳香度）。将这些特征（除原子质量外）转换为单热编码向量，可得到每个键的特征向量为Fb=10维，每个原子的特征向量为Fa=32维。

略

Architecture of the graph neural network

In addition to the predefined fingerprints, we also used a GNN to represent the substrate molecules. We first give a brief overview over such GNNs, before detailing our analysis.

除了预定义的指纹外，我们还使用了一个GNN来表示基质分子。在详细介绍我们的分析之前，我们首先对这种GNN做了一个简单的概述。

略

UniRep vectors

To obtain a 1,900-dimensional UniRep vector for every amino acid sequence in the dataset, we used Python code that is a simplified and modified version of the original code from the George Church group [24] and which contains the already trained UinRep model (available from https://github.com/EngqvistLab/UniRep50). The UniRep vectors were calculated from a file in FASTA format [48], which contained all amino acid sequences of our dataset.

为了获得数据集中每个氨基酸序列的1900维UniRep向量，我们使用了Python代码，该代码是George Church小组[24]的原始代码的简化和修改版本，其中包含已经训练好的UinRep模型（可从https://github.com/EngqvistLab/UniRep50）。UniRep向量是从FASTA格式的文件中计算出来的[48]，该文件包含了我们数据集的所有氨基酸序列。

Fitting of the gradient boosting model with substrate and enzyme information

We concatenated the task-specific substrate fingerprint $\vec{\hat{x}}\in R^{32}$ and the 1,900-dimensional UniRep vector with information about the enzyme’s amino acid sequence. We used the resulting 1,952-dimensional vector as the input for a gradient boosting model for regression, which we trained to predict the KM value. We set the maximal tree depth to 7, minimum child weight to 10.6, maximum delta step to 4.24, the learning rate to 0.012, the regularization coefficient λ to 3.8, and the regularization coefficient α to 3.1. We trained the model for 1,381 iterations. The hyperparameters regularization coefficients, learning rate, maximal tree depth, maximum delta step, number of training iterations, and minimum child weight were optimized by performing a grid search during a 5-fold cross-validation on the training set. We selected the set of hyperparameters with the lowest mean MSE during cross-validation.

我们将任务特定的底物指纹 $\vec{\hat{x}}\in R^{32}$ 和带有酶的氨基酸序列信息的1,900维UniRep向量连接起来。我们用得到的1952维向量作为回归的梯度提升模型的输入，我们对该模型进行了训练以预测KM值。我们将最大的树深设为7，最小的子重设为10.6，最大的delta步长设为4.24，学习率设为0.012，正则化系数λ设为3.8，正则化系数α设为3.1。我们对模型进行了1,381次迭代训练。通过在训练集上进行5倍交叉验证时进行网格搜索，优化了正则化系数、学习率、最大树深、最大delta步长、训练迭代次数和最小子重等超参数。我们选择了交叉验证期间平均MSE最低的超参数集。