吴恩达神经网络1-2-2_图神经网络进行药物发现-第2部分

最新推荐文章于 2024-04-25 09:49:36 发布

weixin_26746401

最新推荐文章于 2024-04-25 09:49:36 发布

阅读量713

点赞数 2

文章标签：神经网络深度学习 python java 人工智能

原文链接：https://towardsdatascience.com/drug-discovery-with-graph-neural-networks-part-2-b1b8d60180c4

版权

吴恩达神经网络1-2-2

预测毒性 (Predicting Toxicity)

目录 (Table of Contents)

Introduction
介绍
Approaching the Problem with Graph Neural Networks
图神经网络解决问题
Hands-on Part with Deepchem
Deepchem的动手部分
About Me
关于我

介绍 (Introduction)

In this article, we will cover another crucial factor that determines whether the drug can pass safety tests — toxicity. In fact, the toxicity accounts for 30% of rejected drug candidates making it one of the most important factors to consider during the drug development stage [1]. Machine learning will prove here very beneficial as it can filter out toxic drug candidates in the early stage of the drug discovery process.

在本文中，我们将介绍另一个决定药物是否可以通过安全性测试的关键因素- 毒性。实际上，毒性占被拒绝药物候选者的30％，这使其成为药物开发阶段要考虑的最重要因素之一[1]。机器学习在这里将被证明是非常有益的，因为它可以在药物发现过程的早期筛选出有毒的候选药物。

I will assume that you’ve read my previous article which explains some topics and terms that I will be using in this article :) Let’s get started!

我假设您已经阅读了上一篇文章，其中解释了本文中将使用的一些主题和术语：)让我们开始吧！

图神经网络解决问题 (Approaching the Problem with Graph Neural Networks)

The feature engineering part is pretty much the same as in part 1 of the series. To convert molecular structure into an input for GNNs, we can create molecular fingerprints, or feed it into graph neural network using adjacency matrix and feature vectors. This features can be automatically generated by external software such as RDKit or Deepchem so we don’t have to worry much about it.

特征工程部分与本系列的第1部分几乎相同。要将分子结构转换为GNN的输入，我们可以创建分子指纹，或使用邻接矩阵和特征向量将其输入到图神经网络中。此功能可由RDKit或Deepchem等外部软件自动生成，因此我们不必为此担心。

毒性 (Toxicity)

The biggest difference is in the machine learning task itself. Toxicity prediction is a classification task, in contrary to the solubility prediction which is a regression task as we might recall from the previous article. There are many different toxicity effects such as carcinogenicity, respiratory toxicity, irritation/corrosion, and others [2]. This makes it a slightly more complicated challenge to work with as we might have to cope also with the imbalanced classes.

最大的区别在于机器学习任务本身。毒性预测是分类任务，与溶解度预测相反，溶解度预测是回归任务，正如我们可能从上一篇文章中回忆的那样。有许多不同的毒性作用，例如致癌性，呼吸毒性，刺激/腐蚀等[2]。这使工作变得更加复杂，因为我们可能还必须应对不平衡的班级。

Fortunately, the toxicity datasets are often considerably bigger than the solubility counterparts. For example, the Tox21 dataset has ~12k training samples when the Delaney dataset used for solubility prediction has only ~3k training samples. This makes neural networks architectures a more promising approach to use as it can capture more hidden information.

幸运的是，毒性数据集通常比溶解度对应数据大得多。例如，当用于溶解度预测的Delaney数据集只有约3k训练样本时，Tox21数据集具有约1.2万训练样本。这使得神经网络体系结构可以捕获更多隐藏信息，因此成为一种更有希望的方法。

Tox21数据集 (Tox21 Dataset)

Tox21 dataset was created as a project challenging researchers to develop machine learning models that achieve the highest performance on the given data. It contains 12 distinct labels and each indicates a different toxicity effect. Overall, the dataset has 12,060 training samples and 647 test samples.

Tox21数据集是作为一个项目而创建的，该项目挑战研究人员开发可在给定数据上实现最高性能的机器学习模型。它包含12个不同的标签，每个标签都表示不同的毒性作用。总体而言，数据集包含12,060个训练样本和647个测试样本。

The winning approach for this challenge was DeepTox [3] which is a deep learning pipeline that utilizes chemical descriptors to predict the toxicity classes. It highly suggests that deep learning is the most effective approach and that graph neural networks have potential to achieve even higher performance.

应对这一挑战的成功方法是DeepTox [3]，它是一种深度学习管道，利用化学描述符来预测毒性等级。它强烈表明深度学习是最有效的方法，并且图神经网络有潜力获得更高的性能。

Deepchem的动手部分 (Hands-on Part with Deepchem)

Colab notebook that you can run by yourself is here.

您可以自己运行的Colab笔记本在这里。

Firstly, we import the necessary libraries. Nothing new here — we will be using Deepchem to train a GNN model on Tox21 data. The GraphConvModel is an architecture that was created by Duvenaud, et al. It uses a modified version of fingerprint algorithms to make them differentiable (so we can do a gradient update). It is one of the first GNN architectures that were designed to handle molecular structures as graphs.

首先，我们导入必要的库。 这里没什么新鲜的 -我们将使用Deepchem在Tox21数据上训练GNN模型。 GraphConvModel是Duvenaud等人创建的架构。它使用指纹算法的修改版本以使其具有差异性(因此我们可以进行梯度更新)。它是最早设计用于将分子结构作为图形处理的GNN架构之一。

# Importing required libraries and its utilities
import numpy as np


np.random.seed(123)
import tensorflow as tf


tf.random.set_seed(123)
import deepchem as dc
from deepchem.molnet import load_tox21
from deepchem.models.graph_models import GraphConvModel

Deepchem contains a convenient API to load the Tox21 for us with .load_tox21 function. We choose a featurizer as GraphConv — it will create chemical descriptors (i.e. features) to match the input requirements for our model. As this is a classification task, ROC AUC score will be used as a metric.

Deepchem包含一个方便的API，可使用来为我们加载Tox21。 load_tox21函数。我们选择一个特征化器作为GraphConv —它会创建化学描述符(即特征)以匹配模型的输入要求。由于这是分类任务，因此ROC AUC得分将用作度量。

# Tox21 is a part of Deepchem library
# so we can convieniently download it using load_tox21 function
tox21_tasks, tox21_datasets, transformers = load_tox21(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = tox21_datasets




# Define metric for the model
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, 
                           np.mean, 
                           mode="classification")

The beauty of the Deepchem is that the models use Keras-like API. We can train the model with .fit function. We pass len(tox21_tasks) into model’s arguments, which is a number of labels (12 in this case). This will set the output size of the final layer as 12. We use a batch size of 32 to speed up the computation time and to specify that the model is used for the classification task. The model takes several minutes to train on the Google Colab notebooks.

Deepchem的优点在于模型使用类似Keras的API。我们可以用训练模型。拟合函数。我们将len(tox21_tasks)传递给模型的arguments ，它是许多标签(在这种情况下为12)。这会将最终层的输出大小设置为12。我们使用32的批处理大小来加快计算时间，并指定将模型用于分类任务。该模型需要几分钟才能在Google Colab笔记本上进行训练。

# Define and fit the model
model = GraphConvModel(len(tox21_tasks), 
                       batch_size=32, 
                       mode='classification')
print("Fitting the model")
model.fit(train_dataset, nb_epoch=10)

After the training is complete, we can evaluate the model. Nothing difficult here again— we can still use the Keras API for that part. The ROC AUC scores are obtained with the .evaluate function.

训练完成后，我们可以评估模型。在这里没什么困难的-我们仍然可以使用Keras API进行该部分。 ROC AUC得分是通过.evaluate函数获得的。

print("Evaluating model with ROC AUC")
train_scores = model.evaluate(train_dataset, [metric], transformers)
valid_scores = model.evaluate(valid_dataset, [metric], transformers)


print("Train scores")
print(train_scores)


print("Validation scores")
print(valid_scores)

In my case, the train ROC AUC score was higher than the validation ROC AUC score. This might indicate that model is overfitting to some molecules.

在我的案例中，火车的ROC AUC分数高于验证的ROC AUC分数。这可能表明模型对某些分子过度拟合。

You can do much more with Deepchem that. It contains several different GNN models that are as easy to use as in this tutorial. I highly suggest looking at their tutorials. For the toxicity task, they have gathered several different examples that run with different models. You can find it here.

利用Deepchem，您可以做更多的事情。它包含几种不同的GNN模型，这些模型与本教程一样易于使用。我强烈建议您看一下他们的教程。对于毒性任务，他们收集了使用不同模型运行的几个不同示例。你可以在这里找到它。

Thank you for reading the article, I hope it was useful for you!

感谢您阅读本文，希望对您有所帮助！

关于我 (About Me)

I am an MSc Artificial Intelligence student at the University of Amsterdam. In my spare time, you can find me fiddling with data or debugging my deep learning model (I swear it worked!). I also like hiking :)

我是阿姆斯特丹大学的人工智能硕士研究生。在业余时间，您会发现我不喜欢数据或调试我的深度学习模型(我发誓它能工作！)。我也喜欢远足:)

Here are my social media profiles, if you want to stay in touch with my latest articles and other useful content:

如果您想与我的最新文章和其他有用内容保持联系，这是我的社交媒体个人资料：