山东大学2019级软件工程应用与实践——基于人工智能的多肽药物分析问题(十)

2021SC@SDUSC

基于人工智能的多肽药物分析问题

主题:蛋白质预训练模型(4)
代码分析

在这里插入图片描述
Prediction Section
ProtTrans/Prediction/ProtBert-BFD-Predict-SS3.ipynb

加载必要的库,包括 huggingface transformer

from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline
import re

加载 TokenClassificationPipeline

pipeline = TokenClassificationPipeline(
    model=AutoModelForTokenClassification.from_pretrained("Rostlab/prot_bert_bfd_ss3"),
    tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True),
    device=0
)

运行结果

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=769.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1676081643.0, style=ProgressStyle(descr…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=81.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=86.0, style=ProgressStyle(description_w…

创建或者加载序列,将很少出现的氨基酸 (U,Z,O,B) 映射到 (X)

sequences_Example = ["M G A E E E D T A I L Y P F T I S G N D R N G N F T I N F K G T P N S T N N G C I G Y S Y N G D W E K I E W E G S C D G N G N L V V E V P M S K I P A G V T S G E I Q I W W H S G D L K M T D Y K A L E H H H H H H", 
"M N K Y L F E L P Y E R S E P G W T I R S Y F D L M Y N E N R F L D A V E N I V N K E S Y I L D G I Y C N F P D M N S Y D E S E H F E G V E F A V G Y P P D E D D I V I V S E E T C F E Y V R L A C E K Y L Q L H P E D T E K V N K L L S K I P S A G H H H H H H"]

sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]

做预测

sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]

运行结果
请添加图片描述

Generate Section
ProtTrans/Generate/ProtXLNet.ipynb

与xlnet模型的差别在于将(X)替换为(unk)

加载必要的库,包括 huggingface transformer

import torch
from transformers import XLNetLMHeadModel, XLNetTokenizer,pipeline
import re
import os
import requests
from tqdm.auto import tqdm

载入 vocabulary 和 ProtXLNet 模型

tokenizer = XLNetTokenizer.from_pretrained("Rostlab/prot_xlnet", do_lower_case=False)


model = XLNetLMHeadModel.from_pretrained("Rostlab/prot_xlnet")

若GPU可用则将模型载入GPU

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')


model = model.to(device)
model = model.eval()

创建或者加载序列,将很少出现的氨基酸 (U,Z,O,B) 映射到 (unk)

sequences_Example = "A E T C Z A O"
sequences_Example = re.sub(r"[UZOB]", "<unk>", sequences_Example) 

分词,编码

ids = tokenizer.encode(sequences_Example, add_special_tokens=False)
input_ids = torch.tensor(ids).unsqueeze(0).to(device)

生成蛋白质序列

max_length = 100
temperature = 1.0
k = 0
p = 0.9
repetition_penalty = 1.0
num_return_sequences = 3

output_ids = model.generate(
        input_ids=input_ids,
        max_length=max_length,
        temperature=temperature,
        top_k=k,
        top_p=p,
        repetition_penalty=repetition_penalty,
        do_sample=True,
        num_return_sequences=num_return_sequences,
    )

output_sequences = [" ".join(" ".join(tokenizer.decode(output_id)).split()) for output_id in output_ids]

print('Generated Sequences\n')
for output_sequence in output_sequences:
  print(output_sequence)

请添加图片描述

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值