山东大学2019级软件工程应用与实践——基于人工智能的多肽药物分析问题（十）

最新推荐文章于 2023-09-11 19:53:45 发布

Haws001

最新推荐文章于 2023-09-11 19:53:45 发布

阅读量316

点赞数

分类专栏：山东大学2019级软件工程应用与实践文章标签：人工智能深度学习自然语言处理

本文链接：https://blog.csdn.net/ChloeS0/article/details/121667720

版权

山东大学2019级软件工程应用与实践专栏收录该内容

13 篇文章 4 订阅

订阅专栏

2021SC@SDUSC

基于人工智能的多肽药物分析问题

主题：蛋白质预训练模型（4）

代码分析

在这里插入图片描述
Prediction Section
ProtTrans/Prediction/ProtBert-BFD-Predict-SS3.ipynb

加载必要的库，包括 huggingface transformer

from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline
import re

加载 TokenClassificationPipeline

pipeline = TokenClassificationPipeline(
    model=AutoModelForTokenClassification.from_pretrained("Rostlab/prot_bert_bfd_ss3"),
    tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True),
    device=0
)

运行结果

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=769.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1676081643.0, style=ProgressStyle(descr…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=81.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=86.0, style=ProgressStyle(description_w…

创建或者加载序列，将很少出现的氨基酸 (U,Z,O,B) 映射到 (X)

sequences_Example = ["M G A E E E D T A I L Y P F T I S G N D R N G N F T I N F K G T P N S T N N G C I G Y S Y N G D W E K I E W E G S C D G N G N L V V E V P M S K I P A G V T S G E I Q I W W H S G D L K M T D Y K A L E H H H H H H", 
"M N K Y L F E L P Y E R S E P G W T I R S Y F D L M Y N E N R F L D A V E N I V N K E S Y I L D G I Y C N F P D M N S Y D E S E H F E G V E F A V G Y P P D E D D I V I V S E E T C F E Y V R L A C E K Y L Q L H P E D T E K V N K L L S K I P S A G H H H H H H"]

sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]

做预测

sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]

运行结果
请添加图片描述

Generate Section
ProtTrans/Generate/ProtXLNet.ipynb

与xlnet模型的差别在于将（X）替换为（unk）

加载必要的库，包括 huggingface transformer

import torch
from transformers import XLNetLMHeadModel, XLNetTokenizer,pipeline
import re
import os
import requests
from tqdm.auto import tqdm

载入 vocabulary 和 ProtXLNet 模型

tokenizer = XLNetTokenizer.from_pretrained("Rostlab/prot_xlnet", do_lower_case=False)


model = XLNetLMHeadModel.from_pretrained("Rostlab/prot_xlnet")

若GPU可用则将模型载入GPU

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')


model = model.to(device)
model = model.eval()

创建或者加载序列，将很少出现的氨基酸 (U,Z,O,B) 映射到 (unk)

sequences_Example = "A E T C Z A O"
sequences_Example = re.sub(r"[UZOB]", "<unk>", sequences_Example)

分词，编码

ids = tokenizer.encode(sequences_Example, add_special_tokens=False)
input_ids = torch.tensor(ids).unsqueeze(0).to(device)

生成蛋白质序列

max_length = 100
temperature = 1.0
k = 0
p = 0.9
repetition_penalty = 1.0
num_return_sequences = 3

output_ids = model.generate(
        input_ids=input_ids,
        max_length=max_length,
        temperature=temperature,
        top_k=k,
        top_p=p,
        repetition_penalty=repetition_penalty,
        do_sample=True,
        num_return_sequences=num_return_sequences,
    )

output_sequences = [" ".join(" ".join(tokenizer.decode(output_id)).split()) for output_id in output_ids]

print('Generated Sequences\n')
for output_sequence in output_sequences:
  print(output_sequence)

请添加图片描述

Haws001

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
山东大学2019级软件工程应用与实践——基于人工智能的多肽药物分析问题（十）

2021SC@SDUSC基于人工智能的多肽药物分析问题主题：蛋白质预训练模型（4）代码分析Prediction SectionProtTrans/Prediction/ProtBert-BFD-Predict-SS3.ipynbfrom transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipelineimport repipeline = TokenClassif
复制链接

扫一扫