2021SC@SDUSC软件工程应用与实践02----借助notebook运行原码

2021SC@SDUSC

一,工作背景

为了更好的明白原码中各个部分是如何运行的,我们阅读了另外一篇论文---DeepPurpose: a Deep Learning Library for Drug-Target Interaction Prediction,论文中封装了一个Deep Purpoese的库,将整个DTI预测的关键步骤,封装了起来,并逐步分析每个步骤的作用,输入输出。通过这篇论文的研读,对DTI预测过程有了更深刻的理解,这对进一步的理解和修改原码提供了帮助。

二,环境配置

1,安装企业版Pycharm

PyCharm官方下载地址:https://www.jetbrains.com/PyCharm/download/#section=windows

官方学生认证地址:Free Educational Licenses - Community Support

学生认证流程参考这篇博客:https://blog.csdn.net/qq_45656879/article/details/104606348

2,连接远端服务器

1)建立SSH连接

 

 2)选中远端环境

 

3,下载xshell和xftp

xshell+xftp官网地址:XSHELL - NetSarang Website

1)打开xshell建立会话

2)填写连接服务器信息

 

4,启动代码

 1)xshell中,先conda activate XXX激活选中环境,jupyter notebook启动notebook

 2)本地终端中,写一个端口映射

 3)到端口1234,找到对应的ipynb文件

三,原码分析

第一部分:从数据集中读入数据

一, 作用:让代码运行阶段忽略报错信息

from DeepPurpose import utils, dataset
from DeepPurpose import DTI as models
import warnings
warnings.filterwarnings("ignore")

二,自定义一个从本地数据集中读取药物,蛋白质,和预测结果的方法,因为源代码中是从网上直接下,由于代理服务器没有vpn,下载很慢

import pandas as pd
import numpy as np
import wget
from zipfile import ZipFile
from DeepPurpose.utils import *
import json
import os

def load_process_DAVIS2(path = './data', binary = False, convert_to_log = True, threshold = 30):
    print('Beginning Processing...')

    affinity = pd.read_csv(path + '/DAVIS/affinity.txt', header=None, sep = ' ')

    with open(path + '/DAVIS/target_seq.txt') as f:
        target = json.load(f)

    with open(path + '/DAVIS/SMILES.txt') as f:
        drug = json.load(f)

    target = list(target.values())
    drug = list(drug.values())

    SMILES = []
    Target_seq = []
    y = []

    for i in range(len(drug)):
        for j in range(len(target)):
            SMILES.append(drug[i])
            Target_seq.append(target[j])
            y.append(affinity.values[i, j])

    if binary:
        print('Default binary threshold for the binding affinity scores are 30, you can adjust it by using the "threshold" parameter')
        y = [1 if i else 0 for i in np.array(y) < threshold]
    else:
        if convert_to_log:
            print('Default set to logspace (nM -> p) for easier regression')
            y = convert_y_unit(np.array(y), 'nM', 'p')
        else:
            y = y
    print('Done!')
    return np.array(SMILES), np.array(Target_seq), np.array(y)

三,从数据集读取输入的三个数组( drug SMILES string,  protein sequence,affinity score)

1,从本地读取

X_drugs, X_targets, y = dataset.read_file_training_dataset_drug_target_pairs('./toy_data/dti.txt')
print('Drug 1: ' + X_drugs[1])
print('Target 1: ' + X_targets[1])
print('Score 1: ' + str(y[1]))
Drug 1: CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N
Target 1: SADAQSFLNRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKTNCCRFQEKDEDDNLIDSYFVVKRHTFSNYQHEETIYNLLKDCPAVAKHDFFKFRIDGDMVPHISRQRLTKYTMADLVYALRHFDEGNCDTLKEILVTYNCCDDDYFNKKDWYDFVENPDILRVYANLGERVRQALLKTVQFCDAMRNAGIVGVLTLDNQDLNGNWYDFGDFIQTTPGSGVPVVDSYYSLLMPILTLTRALTAESHVDTDLTKPYIKWDLLKYDFTEERLKLFDRYFKYWDQTYHPNCVNCLDDRCILHCANFNVLFSTVFPPTSFGPLVRKIFVDGVPFVVSTGYHFRELGVVHNQDVNLHSSRLSFKELLVYAADPAMHAASGNLLLDKRTTCFSVAALTNNVAFQTVKPGNFNKDFYDFAVSKGFFKEGSSVELKHFFFAQDGNAAISDYDYYRYNLPTMCDIRQLLFVVEVVDKYFDCYDGGCINANQVIVNNLDKSAGFPFNKWGKARLYYDSMSYEDQDALFAYTKRNVIPTITQMNLKYAISAKNRARTVAGVSICSTMTNRQFHQKLLKSIAATRGATVVIGTSKFYGGWHNMLKTVYSDVENPHLMGWDYPKCDRAMPNMLRIMASLVLARKHTTCCSLSHRFYRLANECAQVLSEMVMCGGSLYVKPGGTSSGDATTAYANSVFNICQAVTANVNALLSTDGNKIADKYVRNLQHRLYECLYRNRDVDTDFVNEFYAYLRKHFSMMILSDDAVVCFNSTYASQGLVASIKNFKSVLYYQNNVFMSEAKCWTETDLTKGPHEFCSQHTMLVKQGDDYVYLPYPDPSRILGAGCFVDDIVKTDGTLMIERFVSLAIDAYPLTKHPNQEYADVFHLYLQYIRKLHDELTGHMLDMYSVMLTNDNTSRYWEPEFYEAMYTPHTVLQ
Score 1: 4.999

2,远端下载数据集后读取(这里调用了第二步自定义的load_process_DAVIS2方法)

X_drugs, X_targets, y = load_process_DAVIS2(path = './data', binary = False, convert_to_log = True, threshold = 30)
print('Drug 1: ' + X_drugs[0])
print('Target 1: ' + X_targets[0])
print('Score 1: ' + str(y[0]))
Beginning Processing...
Default set to logspace (nM -> p) for easier regression
Done!
Drug 1: CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N
Target 1: MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTAEKRKGGQTVDSGLPLLSVSDPFIPLQVPDAPEKLIEGLKSPDTSLLLPDLLPMTDPFGSTSDAVIEKADVAVESLIPGLEPPVPQRLPSQTESVTSNRTDSLTGEDSLLDCSLLSNPTTDLLEEFAPTAISAPVHKAAEDSNLISGFDVPEGSDKVAEDEFDPIPVLITKNPQGGHSRNSSGSSESSLPNLARSLLLVDQLIDL
Score 1: 7.366531544420414

第二部分:模型训练+两个应用实例

主要分为如下七个步骤:

 

步骤一:编码器选择(Encoder specification)

此处选择Morgan编码器和CNN分别为药物,蛋白质编码

# drug_encoding, target_encoding = 'MPNN', 'Conjoint_triad'
drug_encoding, target_encoding = 'Morgan', 'CNN'

下面网址列出了所有的蛋白质,药物encoder,可以自己选择搭配

https://github.com/kexinhuang12345/DeepPurpose#encodings

步骤二:编码(Data encoding and split)

这里,我们使用 ```utils.data_process``` 函数将数据编码为指定的格式。 它指定训练/验证/测试分割分数和随机种子以确保相同的数据分割以实现可重复性。 此函数还支持数据拆分方法,例如 ```cold_drug``` 和 ```cold_protein``` 和 ```random```,在药物/蛋白质上进行拆分以进行模型稳健性评估以测试未见的药物/蛋白质。

train, val, test = utils.data_process(X_drugs, X_targets, y,
                                drug_encoding, target_encoding,
                                split_method='cold_drug',frac=[0.7,0.1,0.2],
                                random_seed = 1)
train.head(1)

部分编码结果:

步骤三:模型参数初始化(Model configuration generation)

​
config = utils.generate_config(drug_encoding = drug_encoding,
                         target_encoding = target_encoding,
                         cls_hidden_dims = [1024,1024,512],
                         train_epoch = 5,
                         LR = 0.001,
                         batch_size = 128,
                         hidden_dim_drug = 128,
                         mpnn_hidden_size = 128,
                         mpnn_depth = 3,
                         cnn_target_filters = [32,64,96],
                         cnn_target_kernels = [4,8,12]
                        )

​

步骤四:模型初始化

这里使用了步骤三初始化的config

model = models.model_initialize(**config)

步骤五:模型训练

model.train(train, val, test)

model训练过程,损失函数逐渐降低,本次训练中选用最小二乘法的损失函数

 步骤六:利用训练好的model去预测

应用一:Repuposing 药物重用

1)提取蛋白质

t, t_name = dataset.load_SARS_CoV2_Protease_3CL()
print('Target Name: ' + t_name)
print('Amino Acid Sequence: '+ t)

2)提取多种药物

r, r_name, r_pubchem_cid = dataset.load_antiviral_drugs()
print('Repurposing Drug 1 Name: ' + r_name[0])
print('Repurposing Drug 1 SMILES: ' + r[0])
print('Repurposing Drug 1 Pubchem CID: ' + str(r_pubchem_cid[0]))

3)预测

y_pred = models.repurpose(X_repurpose = r, target = t, model = model, drug_names = r_name, target_name = t_name,
                          result_folder = "./result/", convert_y = True)

预测结果:分数越高结合的可能性越大

 应用二:virtual screening 虚拟筛选

1)下载蛋白质和药物

t, d = dataset.load_IC50_1000_Samples()

2)结合score预测

y_pred = models.virtual_screening(d, t, model)

结果

第三部分:保存训练好的模型

model.save_model('./tutorial_model')

四,结语

以上就是本周的全部工作,连接远端服务器是看教程的基础上加了些个人理解,如有问题,欢迎指正!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

见到我请过去学习

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值