2021SC@SDUSC
一,工作背景
为了更好的明白原码中各个部分是如何运行的,我们阅读了另外一篇论文---DeepPurpose: a Deep Learning Library for Drug-Target Interaction Prediction,论文中封装了一个Deep Purpoese的库,将整个DTI预测的关键步骤,封装了起来,并逐步分析每个步骤的作用,输入输出。通过这篇论文的研读,对DTI预测过程有了更深刻的理解,这对进一步的理解和修改原码提供了帮助。
二,环境配置
1,安装企业版Pycharm
PyCharm官方下载地址:https://www.jetbrains.com/PyCharm/download/#section=windows
官方学生认证地址:Free Educational Licenses - Community Support
学生认证流程参考这篇博客:https://blog.csdn.net/qq_45656879/article/details/104606348
2,连接远端服务器
1)建立SSH连接
2)选中远端环境
3,下载xshell和xftp
xshell+xftp官网地址:XSHELL - NetSarang Website
1)打开xshell建立会话
2)填写连接服务器信息
4,启动代码
1)xshell中,先conda activate XXX激活选中环境,jupyter notebook启动notebook
2)本地终端中,写一个端口映射
3)到端口1234,找到对应的ipynb文件
三,原码分析
第一部分:从数据集中读入数据
一, 作用:让代码运行阶段忽略报错信息
from DeepPurpose import utils, dataset
from DeepPurpose import DTI as models
import warnings
warnings.filterwarnings("ignore")
二,自定义一个从本地数据集中读取药物,蛋白质,和预测结果的方法,因为源代码中是从网上直接下,由于代理服务器没有vpn,下载很慢
import pandas as pd
import numpy as np
import wget
from zipfile import ZipFile
from DeepPurpose.utils import *
import json
import os
def load_process_DAVIS2(path = './data', binary = False, convert_to_log = True, threshold = 30):
print('Beginning Processing...')
affinity = pd.read_csv(path + '/DAVIS/affinity.txt', header=None, sep = ' ')
with open(path + '/DAVIS/target_seq.txt') as f:
target = json.load(f)
with open(path + '/DAVIS/SMILES.txt') as f:
drug = json.load(f)
target = list(target.values())
drug = list(drug.values())
SMILES = []
Target_seq = []
y = []
for i in range(len(drug)):
for j in range(len(target)):
SMILES.append(drug[i])
Target_seq.append(target[j])
y.append(affinity.values[i, j])
if binary:
print('Default binary threshold for the binding affinity scores are 30, you can adjust it by using the "threshold" parameter')
y = [1 if i else 0 for i in np.array(y) < threshold]
else:
if convert_to_log:
print('Default set to logspace (nM -> p) for easier regression')
y = convert_y_unit(np.array(y), 'nM', 'p')
else:
y = y
print('Done!')
return np.array(SMILES), np.array(Target_seq), np.array(y)
三,从数据集读取输入的三个数组( drug SMILES string, protein sequence,affinity score)
1,从本地读取
X_drugs, X_targets, y = dataset.read_file_training_dataset_drug_target_pairs('./toy_data/dti.txt')
print('Drug 1: ' + X_drugs[1])
print('Target 1: ' + X_targets[1])
print('Score 1: ' + str(y[1]))
Drug 1: CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N
Target 1: SADAQSFLNRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKTNCCRFQEKDEDDNLIDSYFVVKRHTFSNYQHEETIYNLLKDCPAVAKHDFFKFRIDGDMVPHISRQRLTKYTMADLVYALRHFDEGNCDTLKEILVTYNCCDDDYFNKKDWYDFVENPDILRVYANLGERVRQALLKTVQFCDAMRNAGIVGVLTLDNQDLNGNWYDFGDFIQTTPGSGVPVVDSYYSLLMPILTLTRALTAESHVDTDLTKPYIKWDLLKYDFTEERLKLFDRYFKYWDQTYHPNCVNCLDDRCILHCANFNVLFSTVFPPTSFGPLVRKIFVDGVPFVVSTGYHFRELGVVHNQDVNLHSSRLSFKELLVYAADPAMHAASGNLLLDKRTTCFSVAALTNNVAFQTVKPGNFNKDFYDFAVSKGFFKEGSSVELKHFFFAQDGNAAISDYDYYRYNLPTMCDIRQLLFVVEVVDKYFDCYDGGCINANQVIVNNLDKSAGFPFNKWGKARLYYDSMSYEDQDALFAYTKRNVIPTITQMNLKYAISAKNRARTVAGVSICSTMTNRQFHQKLLKSIAATRGATVVIGTSKFYGGWHNMLKTVYSDVENPHLMGWDYPKCDRAMPNMLRIMASLVLARKHTTCCSLSHRFYRLANECAQVLSEMVMCGGSLYVKPGGTSSGDATTAYANSVFNICQAVTANVNALLSTDGNKIADKYVRNLQHRLYECLYRNRDVDTDFVNEFYAYLRKHFSMMILSDDAVVCFNSTYASQGLVASIKNFKSVLYYQNNVFMSEAKCWTETDLTKGPHEFCSQHTMLVKQGDDYVYLPYPDPSRILGAGCFVDDIVKTDGTLMIERFVSLAIDAYPLTKHPNQEYADVFHLYLQYIRKLHDELTGHMLDMYSVMLTNDNTSRYWEPEFYEAMYTPHTVLQ
Score 1: 4.999
2,远端下载数据集后读取(这里调用了第二步自定义的load_process_DAVIS2方法)
X_drugs, X_targets, y = load_process_DAVIS2(path = './data', binary = False, convert_to_log = True, threshold = 30)
print('Drug 1: ' + X_drugs[0])
print('Target 1: ' + X_targets[0])
print('Score 1: ' + str(y[0]))
Beginning Processing...
Default set to logspace (nM -> p) for easier regression
Done!
Drug 1: CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N
Target 1: MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTAEKRKGGQTVDSGLPLLSVSDPFIPLQVPDAPEKLIEGLKSPDTSLLLPDLLPMTDPFGSTSDAVIEKADVAVESLIPGLEPPVPQRLPSQTESVTSNRTDSLTGEDSLLDCSLLSNPTTDLLEEFAPTAISAPVHKAAEDSNLISGFDVPEGSDKVAEDEFDPIPVLITKNPQGGHSRNSSGSSESSLPNLARSLLLVDQLIDL
Score 1: 7.366531544420414
第二部分:模型训练+两个应用实例
主要分为如下七个步骤:
步骤一:编码器选择(Encoder specification)
此处选择Morgan编码器和CNN分别为药物,蛋白质编码
# drug_encoding, target_encoding = 'MPNN', 'Conjoint_triad'
drug_encoding, target_encoding = 'Morgan', 'CNN'
下面网址列出了所有的蛋白质,药物encoder,可以自己选择搭配
https://github.com/kexinhuang12345/DeepPurpose#encodings
步骤二:编码(Data encoding and split)
这里,我们使用 ```utils.data_process``` 函数将数据编码为指定的格式。 它指定训练/验证/测试分割分数和随机种子以确保相同的数据分割以实现可重复性。 此函数还支持数据拆分方法,例如 ```cold_drug``` 和 ```cold_protein``` 和 ```random```,在药物/蛋白质上进行拆分以进行模型稳健性评估以测试未见的药物/蛋白质。
train, val, test = utils.data_process(X_drugs, X_targets, y,
drug_encoding, target_encoding,
split_method='cold_drug',frac=[0.7,0.1,0.2],
random_seed = 1)
train.head(1)
部分编码结果:
步骤三:模型参数初始化(Model configuration generation)
config = utils.generate_config(drug_encoding = drug_encoding,
target_encoding = target_encoding,
cls_hidden_dims = [1024,1024,512],
train_epoch = 5,
LR = 0.001,
batch_size = 128,
hidden_dim_drug = 128,
mpnn_hidden_size = 128,
mpnn_depth = 3,
cnn_target_filters = [32,64,96],
cnn_target_kernels = [4,8,12]
)
步骤四:模型初始化
这里使用了步骤三初始化的config
model = models.model_initialize(**config)
步骤五:模型训练
model.train(train, val, test)
model训练过程,损失函数逐渐降低,本次训练中选用最小二乘法的损失函数
步骤六:利用训练好的model去预测
应用一:Repuposing 药物重用
1)提取蛋白质
t, t_name = dataset.load_SARS_CoV2_Protease_3CL()
print('Target Name: ' + t_name)
print('Amino Acid Sequence: '+ t)
2)提取多种药物
r, r_name, r_pubchem_cid = dataset.load_antiviral_drugs()
print('Repurposing Drug 1 Name: ' + r_name[0])
print('Repurposing Drug 1 SMILES: ' + r[0])
print('Repurposing Drug 1 Pubchem CID: ' + str(r_pubchem_cid[0]))
3)预测
y_pred = models.repurpose(X_repurpose = r, target = t, model = model, drug_names = r_name, target_name = t_name,
result_folder = "./result/", convert_y = True)
预测结果:分数越高结合的可能性越大
应用二:virtual screening 虚拟筛选
1)下载蛋白质和药物
t, d = dataset.load_IC50_1000_Samples()
2)结合score预测
y_pred = models.virtual_screening(d, t, model)
结果
第三部分:保存训练好的模型
model.save_model('./tutorial_model')
四,结语
以上就是本周的全部工作,连接远端服务器是看教程的基础上加了些个人理解,如有问题,欢迎指正!