计算生物学习——Code_SMILES的向量表示_ChemBERTa(07.16)

文大于2

已于 2024-07-17 18:16:12 修改

阅读量1k

点赞数 31

分类专栏： 2024学习笔记文章标签：深度学习 python transformer vscode

于 2024-07-17 17:57:56 首次发布

本文链接：https://blog.csdn.net/weixin_43213559/article/details/140498779

版权

2024学习笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

由于在写另一篇关于蛋白靶点小分子药物对接亲和力预测的pytorch总框架时出现错误，所以先来复现可以把SMILES转变成向量的代码。

参考大佬loong_XL博客，

ChemBERTa 化合物小分子的向量表示及相似检索-CSDN博客

尝试加载模型时，遇到和之前一样的情况，用和之前一样ESM2的方法：本地加载

打开浏览器并访问 Hugging Face 的模型库主页：Hugging Face Models

在搜索栏中输入 "DeepChem ChemBERTa-77M-MLM" 并搜索。

“77M-MLM” 代表 ChemBERTa 模型的具体版本和预训练任务。

77M ：有 7700 万（77 Million）个参数。参数越多，模型的容量越大，可以捕捉到更多的复杂特征，但也需要更多的计算资源来训练和推理。
MLM ： Masked Language Modeling（掩码语言模型）。 BERT 模型预训练的核心任务之一。模型在输入序列中随机遮盖（mask）一些令牌（tokens），然后预测这些被遮盖的令牌。

或直接打开网址：https://huggingface.co/DeepChem/ChemBERTa-77M-MLM/tree/main

下载文件后直接放进VScode里

记得定义一下自己的local_model_path(文件夹（chemBERTa_files）所在的路径)

再运行下面的代码

出现错误：The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'RobertaTokenizer'. The class this function is called from is 'BertTokenizer'.

根据错误信息，我加载的模型使用的是 RobertaTokenizer 而不是 BertTokenizer,所以对应的

from transformers import BertTokenizer, AutoTokenizer,BertModel,RobertaTokenizer,RobertaModel
# 加载本地的 tokenizer 和模型
tokenizer_ = RobertaTokenizer.from_pretrained(local_model_path)
# 加载本地模型
model = RobertaModel.from_pretrained(local_model_path)

依然提示：Some weights of RobertaModel were not initialized from the model checkpoint at /home/embark/rain/wenxue/chemBERTa/chemBERTa_files and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

这是在提示说我加载的模型不完整或有一些参数没有在预训练过程中保存下来。这里提到的 roberta.pooler.dense.bias 和 roberta.pooler.dense.weight 是模型池化层（pooler layer）的权重（先试试忽略池化层权重.....）

import torch

def smiles_to_vector(seq):
    inputs = tokenizer_(seq, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze()

smiles = "NS(=O)(=O)c1ccc(S(=O)(=O)NCc2cccs2)s1"
smiles_vector = smiles_to_vector(smiles)
print(smiles_vector)