在 PyTorch 中使用前馈神经网络构建简单的 QSAR 模型

最新推荐文章于 2023-09-19 22:33:28 发布

发呆的比目鱼

最新推荐文章于 2023-09-19 22:33:28 发布

阅读量881

点赞数 1

分类专栏： DrugAi 文章标签： pytorch 神经网络 python

本文链接：https://blog.csdn.net/weixin_42486623/article/details/128836370

版权

DrugAi 专栏收录该内容

186 篇文章 182 订阅

订阅专栏

文章介绍了如何利用PyTorch建立一个简单的定量结构-活性关系（QSAR）模型，专注于处理ExcapeDB中的生物活性数据。首先，数据预处理涉及SMILES编码的分子结构转换为摩根指纹，然后通过标准Scaler和VarianceThreshold进行特征选择。接着，使用前馈神经网络进行模型训练，并在训练、验证和测试集上评估性能。最后，展示了如何对新分子进行预测，并提出了模型改进的可能性。

摘要由CSDN通过智能技术生成

在 PyTorch 中使用前馈神经网络构建简单的 QSAR 模型

PyTorch构建一个简单的前馈神经网络，用于构建定量的结构-活动关系模型(QSAR)。

下载数据并准备数据

ExcapeDB(https://solr.ideaconsult.net/search/excape/)是BigChem项目的一部分。它是生物活性数据的宝库，一个很好的功能是可以在设置搜索条件后从界面下载大量数据集。我下载了与SLC6A4基因pXC50值相关的活性分子。我在界面的结果中将基因符号设置为SLC6A4和Active，并以CSV格式下载数据集。下载的文件重命名为SLC6A_active_excape_export.csv。
SLC6A4是一种血清素转运体，它将血清素从突触间隙转运回神经元。它是许多抗抑郁药的目标，这可能就是为什么收集了许多关于不同化合物活性的测量结果。它是许多抗抑郁药的目标，这可能就是为什么收集了许多关于不同化合物活性的测量结果。
我将导入一些库以供以后使用，并使用pandas加载CSV文件并管理数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

data = pd.read_csv("SLC6A4_active_excape_export.csv")
data.head(1)

加载CSV文件会给我们一个包含信息的数据框架。PyTorch中构建QSAR模型的有趣列是结构，它被编码在SMILES列中，pXC50是log10转换后的IC50或EC50值。基更好，但这就是我们的工作。我将使用RDKit将SMILES转换为分子对象并对分子进行指纹识别。因此，我们将从将SMILES转换为RDKit分子对象开始。pandastools模块包含用于创建分子列的方便函数。分子列呈现时，显示在jupyter笔记本，这是方便时，探索数据。

from rdkit import Chem, DataStructs
from rdkit.Chem import PandasTools, AllChem
PandasTools.AddMoleculeColumnToFrame(data,'SMILES','Molecule')
data[["SMILES","Molecule"]].head(1)

据我所知，ExcapeDB项目使用了Ambit工具包，一些SMILES可能无法通过RDKit进行sanitizable。我将检查一些SMILES是否不能通过在Molecule列中寻找None值来转换

data.Molecule.isna().sum()

所有smile似乎都已转换为此数据集。

矢量化分子

摩根指纹是分子指纹识别的标准，但也可以尝试其他指纹识别方法。使用RDKit计算指纹使我们能够将分子转换为可用于建模的数字向量。用一个小的辅助函数来计算是很快的，作为一个例子，我将把1号分子的指纹转换成一个小图像。

def mol2fp(mol):
    fp = AllChem.GetHashedMorganFingerprint(mol, 2, nBits=4096)
    ar = np.zeros((1,), dtype=np.int8)
    DataStructs.ConvertToNumpyArray(fp, ar)
    return ar
     
fp =mol2fp(Chem.MolFromSmiles(data.loc[1,"SMILES"]))
plt.matshow(fp.reshape((64,-1)) &amp;gt; 0)

使用pandas数据框架的apply函数可以很容易地添加带有指纹向量的新列。

data["FPs"] = data.Molecule.apply(mol2fp)

分割和转换数据

PyTorch需要在PyTorch张量上进行训练，PyTorch张量与Numpy数组类似，但具有一些额外的功能，例如传输到GPU内存的能力。但首先，我们将使用scikit learn将数据集分为训练、验证和测试。

X = np.stack(data.FPs.values)
print(X.shape)

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

y = data.pXC50.values.reshape((-1,1))
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.10, random_state=42)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train,  test_size=0.05, random_state=42)
#Normalizing output using standard scaling
scaler = StandardScaler()
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)
y_validation = scaler.transform(y_validation)

我将去掉一些方差最小的位。它们几乎可以一直开着，也可以很少使用。scikit-learn中的VarienceThreshold类就很方便。就像归一化一样，我们将只使用火车集计算方差，并使用它从所有数据集中去除相同的特征。

# We'll remove low variance features
from sklearn.feature_selection import VarianceThreshold
feature_select = VarianceThreshold(threshold=0.05)
X_train = feature_select.fit_transform(X_train)
X_validation = feature_select.transform(X_validation)
X_test = feature_select.transform(X_test)
X_train.shape

看起来大部分都很少被使用。阈值是模型整个管道的超参数，可以进行调优。我们将数据集移动到GPU并将其转换为pytorch张量。

# Let's get those arrays transfered to the GPU memory as tensors
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
# If you don't have a GPU, buy a graphics card. I have for a long time used a 1060 GTX, which is not that expensive anymore.
X_train = torch.tensor(X_train, device=device).float()
X_test = torch.tensor(X_test, device=device).float()
X_validation = torch.tensor(X_validation, device=device).float()
y_train = torch.tensor(y_train, device=device).float()
y_test = torch.tensor(y_test, device=device).float()
y_validation = torch.tensor(y_validation, device=device).float()
X_train

	
X_train.shape

Pytorch与数据集和数据加载器一起工作，在训练期间将小批量提供给模型，因此我们转换数据。很容易从已经创建的张量创建数据集。批处理大小设置为256，当使用了epoch时，trainloader将对数据进行洗牌。

from torch.utils.data import TensorDataset
train_dataset = TensorDataset(X_train, y_train)
validation_dataset = TensorDataset(X_validation, y_validation)

train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                          batch_size=256,
                                          shuffle=True)
validation_loader = torch.utils.data.DataLoader(dataset=validation_dataset,
                                          batch_size=256,
                                          shuffle=False)

好的，训练数据准备好了，接下来是有趣的部分:定义网络。在PyTorch中，这可以通过子类化nn.Module来实现。init函数将设置网络中所需的所有层，forward方法将定义如何使用它们，其语法与Keras函数api非常相似:tensor = layer()(tensor)。最终输出没有激活函数，因为这是一个回归模型，线性激活就可以了。

class Net(nn.Module):
    def __init__(self, input_size, hidden_size, dropout_rate, out_size):
        super(Net, self).__init__()
        # Three layers and a output layer
        self.fc1 = nn.Linear(input_size, hidden_size)  # 1st Full-Connected Layer
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, hidden_size)
        self.fc_out = nn.Linear(hidden_size, out_size) # Output layer
        #Layer normalization for faster training
        self.ln1 = nn.LayerNorm(hidden_size)
        self.ln2 = nn.LayerNorm(hidden_size)
        self.ln3 = nn.LayerNorm(hidden_size)        
        #LeakyReLU will be used as the activation function
        self.activation = nn.LeakyReLU()
        #Dropout for regularization
        self.dropout = nn.Dropout(dropout_rate)
     
    def forward(self, x):# Forward pass: stacking each layer together
        # Fully connected =&amp;gt; Layer Norm =&amp;gt; LeakyReLU =&amp;gt; Dropout times 3
        out = self.fc1(x)
        out = self.ln1(out)
        out = self.activation(out)
        out = self.dropout(out)
        out = self.fc2(out)
        out = self.ln2(out)
        out = self.activation(out)
        out = self.dropout(out)
        out = self.fc3(out)
        out = self.ln3(out)
        out = self.activation(out)
        out = self.dropout(out)
        #Final output layer
        out = self.fc_out(out)
        return out

定义了网络类后，我们可以设置超参数并创建模型。

#Defining the hyperparameters
input_size = X_train.size()[-1]     # The input size should fit our fingerprint size
hidden_size = 1024   # The size of the hidden layer
dropout_rate = 0.80    # The dropout rate
output_size = 1        # This is just a single task, so this will be one
learning_rate = 0.001  # The learning rate for the optimizer
model = Net(input_size, hidden_size, dropout_rate, output_size)

模型可以通过继承自nn. cuda()方法移动到GPU。模块类。我还没有查找模块如何发现被设置为属性的层。

model.cuda()

对于损失函数，我将使用均方误差。Adam被选为优化器。

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

与Keras相比，训练更容易上手。这为以不同方式训练网络层提供了灵活性，但在这里，我们只是用所有数据训练它。对于每个epoch, running_loss被归零，并且从DataLoader中的列车数据中获取小批量。我们通过调零梯度开始每个mini-batch处理循环。从这一点开始进行的所有前向计算都将被记录下来，并用于计算优化网络权重所需的后向传递。在每个epoch结束时，一些统计数据被打印出来，这样就可以跟踪训练。

model.train() #Ensure the network is in "train" mode with dropouts active
epochs = 200
for e in range(epochs):
    running_loss = 0
    for fps, labels in train_loader:
        # Training pass
        optimizer.zero_grad() # Initialize the gradients, which will be recorded during the forward pa
         
        output = model(fps) #Forward pass of the mini-batch
        loss = criterion(output, labels) #Computing the loss
        loss.backward() # calculate the backward pass
        optimizer.step() # Optimize the weights
         
        running_loss += loss.item()
    else:
        if e%10 == 0:
            validation_loss = torch.mean(( y_validation - model(X_validation) )**2).item()
            print("Epoch: %3i Training loss: %0.2F Validation loss: %0.2F"%(e,(running_loss/len(train_loader)), validation_loss))

在使用模型进行预测之前，必须将其设置为评估模式，其中退出层不再活动。我们可以在训练集、验证集和外部测试集上计算预测。

model.eval() #Swith to evaluation mode, where dropout is switched off
y_pred_train = model(X_train)
y_pred_validation = model(X_validation)
y_pred_test = model(X_test)

可以使用pytorch操作符来计算均方根误差。.item()方法将单个元素张量转换为用于打印的Python标量。

torch.mean(( y_train - y_pred_train )**2).item()

torch.mean(( y_validation - y_pred_validation )**2).item()

torch.mean(( y_test - y_pred_test )**2).item()

测试集的误差实际上低于验证集，尽管两者都比火车集差得多。这表明过拟合，即使退出率设置为0.8。
最后，让我们直观地评估模型。在此之前，似乎需要将张量传输到cpu，在我们可以将其转换为numpy并将其压平之前将其从梯度中分离出来。

def flatten(tensor):
    return tensor.cpu().detach().numpy().flatten()

plt.scatter(flatten(y_pred_test), flatten(y_test), alpha=0.5, label="Test")
plt.scatter(flatten(y_pred_train), flatten(y_train), alpha=0.1, label="Train")
plt.legend()
plt.plot([-1.5, 1.5], [-1.5,1.5], c="b")

如果我们想要预测任意分子，我们必须记住在指纹上使用VarianceThreshold对象，并从前馈神经网络反向缩放预测。scikit learn对象已经适合，可以在这里使用。

def predict_smiles(smiles):
    fp =mol2fp(Chem.MolFromSmiles(smiles)).reshape(1,-1)
    fp_filtered = feature_select.transform(fp)
    fp_tensor = torch.tensor(fp_filtered, device=device).float()
    prediction = model(fp_tensor)
    #return prediction.cpu().detach().numpy()
    pXC50 = scaler.inverse_transform(prediction.cpu().detach().numpy())
    return pXC50[0][0]
predict_smiles('Cc1ccc2c(N3CCNCC3)cc(F)cc2n1')