2021SC@SDUSC
基于人工智能的多肽药物分析问题
主题:蛋白质预训练模型(4)
代码分析
Prediction Section
ProtTrans/Prediction/ProtBert-BFD-Predict-SS3.ipynb
加载必要的库,包括 huggingface transformer
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline
import re
加载 TokenClassificationPipeline
pipeline = TokenClassificationPipeline(
model=AutoModelForTokenClassification.from_pretrained("Rostlab/prot_bert_bfd_ss3"),
tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True),
device=0
)
运行结果
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=769.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1676081643.0, style=ProgressStyle(descr…
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=81.0, style=ProgressStyle(description_w…
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=86.0, style=ProgressStyle(description_w…
创建或者加载序列,将很少出现的氨基酸 (U,Z,O,B) 映射到 (X)
sequences_Example = ["M G A E E E D T A I L Y P F T I S G N D R N G N F T I N F K G T P N S T N N G C I G Y S Y N G D W E K I E W E G S C D G N G N L V V E V P M S K I P A G V T S G E I Q I W W H S G D L K M T D Y K A L E H H H H H H",
"M N K Y L F E L P Y E R S E P G W T I R S Y F D L M Y N E N R F L D A V E N I V N K E S Y I L D G I Y C N F P D M N S Y D E S E H F E G V E F A V G Y P P D E D D I V I V S E E T C F E Y V R L A C E K Y L Q L H P E D T E K V N K L L S K I P S A G H H H H H H"]
sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]
做预测
sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]
运行结果
Generate Section
ProtTrans/Generate/ProtXLNet.ipynb
与xlnet模型的差别在于将(X)替换为(unk)
加载必要的库,包括 huggingface transformer
import torch
from transformers import XLNetLMHeadModel, XLNetTokenizer,pipeline
import re
import os
import requests
from tqdm.auto import tqdm
载入 vocabulary 和 ProtXLNet 模型
tokenizer = XLNetTokenizer.from_pretrained("Rostlab/prot_xlnet", do_lower_case=False)
model = XLNetLMHeadModel.from_pretrained("Rostlab/prot_xlnet")
若GPU可用则将模型载入GPU
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
model = model.eval()
创建或者加载序列,将很少出现的氨基酸 (U,Z,O,B) 映射到 (unk)
sequences_Example = "A E T C Z A O"
sequences_Example = re.sub(r"[UZOB]", "<unk>", sequences_Example)
分词,编码
ids = tokenizer.encode(sequences_Example, add_special_tokens=False)
input_ids = torch.tensor(ids).unsqueeze(0).to(device)
生成蛋白质序列
max_length = 100
temperature = 1.0
k = 0
p = 0.9
repetition_penalty = 1.0
num_return_sequences = 3
output_ids = model.generate(
input_ids=input_ids,
max_length=max_length,
temperature=temperature,
top_k=k,
top_p=p,
repetition_penalty=repetition_penalty,
do_sample=True,
num_return_sequences=num_return_sequences,
)
output_sequences = [" ".join(" ".join(tokenizer.decode(output_id)).split()) for output_id in output_ids]
print('Generated Sequences\n')
for output_sequence in output_sequences:
print(output_sequence)