2021SC@SDUSC软件工程应用与实践09----由GNN与蛋白质序列提取特征

最新推荐文章于 2024-04-11 23:44:05 发布

见到我请过去学习

最新推荐文章于 2024-04-11 23:44:05 发布

阅读量1.8k

点赞数

分类专栏：软件工程应用于实践文章标签： python 深度学习神经网络人工智能机器学习

本文链接：https://blog.csdn.net/m0_55985367/article/details/121712067

版权

图神经网络药物发现蛋白质相互作用接触图特征提取

关键词由CSDN通过智能技术生成

软件工程应用于实践专栏收录该内容

23 篇文章 4 订阅

订阅专栏

2021SC@SDUSC

一，前言

之前关于GNN基础知识，GCN的一些编程知识，以及contact map的生成都讲了很多了，这周主要针对这份代码 https://github.com/595693085/DGraphDTA 进行分析。由于代码本身比较长，本周主要分析利用contact map提取蛋白质药物特征的部分。

二，原码分析

指定一下选中的cuda，载入模型，选择损失函数，学习率，并初始化模型参数

USE_CUDA = torch.cuda.is_available()
device = torch.device(cuda_name if USE_CUDA else 'cpu')
model = GNNNet()
model.to(device)
model_st = GNNNet.__name__
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LR)

接下来是一个for循环(其实并不是，前面有

datasets = [['davis', 'kiba'][int(sys.argv[1])]]

，dataset为指定用davis还是kiba数据集)

for dataset in datasets:
    train_data, valid_data = create_dataset_for_5folds(dataset, fold)
    train_loader = torch.utils.data.DataLoader(train_data, batch_size=TRAIN_BATCH_SIZE, shuffle=True,
                                               collate_fn=collate)
    valid_loader = torch.utils.data.DataLoader(valid_data, batch_size=TEST_BATCH_SIZE, shuffle=False,
                                               collate_fn=collate)

这一部分很重要，我们先去分析一下create_dataset_for_5folds(dataset,fold)方法

这里根据输入的fold，从5份fold中，选择4份作为train_fold，另外一份作为vaild_fold，这种操作可以防止过拟合

def create_dataset_for_5folds(dataset, fold=0):
    # load dataset
    dataset_path = 'data/' + dataset + '/'
    #TODO 此处的train_fold_setting1.txt 文件不懂，猜测是模型或者训练方式的一些设定？
    train_fold_origin = json.load(open(dataset_path + 'folds/train_fold_setting1.txt'))
    train_fold_origin = [e for e in train_fold_origin]  # for 5 folds


.......

 valid_fold = train_fold_origin[fold]  # one fold
    for i in range(len(train_fold_origin)):  # other folds
        if i != fold:
            train_folds += train_fold_origin[i]

载入蛋白质和药物字典(字典为key+序列)

 ligands = json.load(open(dataset_path + 'ligands_can.txt'), object_pairs_hook=OrderedDict)
    proteins = json.load(open(dataset_path + 'proteins.txt'), object_pairs_hook=OrderedDict)

找到存aln(比对文件)和pconsc4(contactmap文件)位置，并存起来

# load contact and aln
    msa_path = 'data/' + dataset + '/aln'
    contac_path = 'data/' + dataset + '/pconsc4'
    msa_list = []
    contact_list = []
    for key in proteins:
        msa_list.append(os.path.join(msa_path, key + '.aln'))
        contact_list.append(os.path.join(contac_path, key + '.npy'))

存储药物和药物序列

    # smiles
    for d in ligands.keys():
        lg = Chem.MolToSmiles(Chem.MolFromSmiles(ligands[d]), isomericSmiles=True)
        drugs.append(lg)
        drug_smiles.append(ligands[d])

存储蛋白质的key和序列

    # seqs
    for t in proteins.keys():
        prots.append(proteins[t])
        prot_keys.append(t)

针对davis数据集处理一下affinity

    if dataset == 'davis':
        affinity = [-np.log10(y / 1e9) for y in affinity]
    affinity = np.asarray(affinity)

针对Y，剔除addinity为nan的顶点对，找到预测出亲和力的药物蛋白分子对

    for opt in opts:
        if opt == 'train':
            rows, cols = np.where(np.isnan(affinity) == False)
            rows, cols = rows[train_folds], cols[train_folds]

将所有接触图与字典中key匹配，并将药物序列，蛋白质序列，蛋白质key，亲和力加起来成ls，经处理后放到train_fold_entries中

            for pair_ind in range(len(rows)):
                if not valid_target(prot_keys[cols[pair_ind]], dataset):  # ensure the contact and aln files exists
                    continue
                ls = []
                ls += [drugs[rows[pair_ind]]]
                ls += [prots[cols[pair_ind]]]
                ls += [prot_keys[cols[pair_ind]]]
                ls += [affinity[rows[pair_ind], cols[pair_ind]]]
                train_fold_entries.append(ls)
                valid_train_count += 1

提取药物特征，并生成分子图

    # create smile graph
    smile_graph = {}    
    for smile in compound_iso_smiles:
        g = smile_to_graph(smile)
        smile_graph[smile] = g

根据蛋白质key和接触图和多序列比对结果提取蛋白质特征，接下来我们去看下target_to_graph方法

    for key in target_key:
        if not valid_target(key, dataset):  # ensure the contact and aln files exists
            continue
        g = target_to_graph(key, proteins[key], contac_path, msa_path)
        target_graph[key] = g

过滤一下接触图，接触概率>=0.5的才认为其接触

def target_to_graph(target_key, target_sequence, contact_dir, aln_dir):
    target_edge_index = []
    target_size = len(target_sequence)
    # contact_dir = 'data/' + dataset + '/pconsc4'
    contact_file = os.path.join(contact_dir, target_key + '.npy')
    contact_map = np.load(contact_file)
    contact_map += np.matrix(np.eye(contact_map.shape[0]))
    index_row, index_col = np.where(contact_map >= 0.5)

根据接触图得到的邻接矩阵，来构建蛋白质图的edge_index

    for i, j in zip(index_row, index_col):
        target_edge_index.append([i, j])
    target_feature = target_to_feature(target_key, target_sequence, aln_dir)
    target_edge_index = np.array(target_edge_index)

接下来我们去看下target_to_feature()看下，蛋白质的节点特征是如何得到的，这里找到了.aln多重序列比对的结果，我们需要去看下target_feature()，使用多重序列比对和序列得到了feature

def target_to_feature(target_key, target_sequence, aln_dir):
    # aln_dir = 'data/' + dataset + '/aln'
    aln_file = os.path.join(aln_dir, target_key + '.aln')
    # if 'X' in target_sequence:
    #     print(target_key)
    feature = target_feature(aln_file, target_sequence)
    return feature

该方法就比较清晰了，根据.aln和序列得到pssm，由序列得到other features，组合起来作为节点特征，大功告成了！

def target_feature(aln_file, pro_seq):
    pssm = PSSM_calculation(aln_file, pro_seq)
    other_feature = seq_feature(pro_seq)
    # print('target_feature')
    # print(pssm.shape)
    # print(other_feature.shape)

    # print(other_feature.shape)
    # return other_feature
    return np.concatenate((np.transpose(pssm, (1, 0)), other_feature), axis=1)