化学科学催化反应产率预测之Transformer建模SMILES(Datawhale AI 夏令营)

        在上一节中,我们学习了RNN,并认识到其在处理长序列数据时存在一些缺点。接下来,我们将探索如何利用Transformer模型解决这些问题。

Transformer模型介绍

Transformer诞生的由来

        循环神经网络(RNN)在序列到序列的建模方法中存在一些局限性:

  • 循环神经网络:由于所有的前文信息都蕴含在一个隐向量中,随着序列长度的增加,早期的上下文信息会逐渐被遗忘。
  • 卷积神经网络:受限的上下文窗口在建模长文本时存在天然不足。需要多层卷积操作才能处理长文本。
Transformer的优势

        Transformer完全通过注意力机制完成对序列的全局依赖的建模。这种结构可以高并行,大大提高了计算效率。

Transformer的基本架构

        Transformer模型的基本架构包括以下几个部分:

  1. 嵌入层(Embedding Layer): 将token转化为向量表示。与RNN不同的是,Transformer在词嵌入中加入了位置编码(Positional Encoding)。

    位置编码通常使用不同频率的正余弦函数:

  2. 自注意力层(Self-Attention Layer): 自注意力机制引入了查询(Query)、键(Key)和值(Value)。具体计算公式如下:

  3. 前馈层(Feed Forward Layer): 通过线性层对输入进行处理。

  4. 残差连接: 避免由于网络过深在优化过程中潜在的梯度消失问题。

  5. 层归一化(LayerNorm): 使得每一层的输入输出范围稳定在合理范围内。

利用Transformer的Encoder进行编码

        Transformer的Encoder部分将输入序列编码成一个向量,然后通过线性层输出一个值。我们期望通过模型的学习,这个输出的值就是该化学反应的产率。

提升模型性能的方法

  1. 调整epoch:训练时间越长,模型性能一般会更好,但也有可能会过拟合。
  2. 调整模型大小:包括中间向量的维度、模型的层数、注意力头的个数。
  3. 数据清洗和增广:对数据进行清洗和增广,可以提高模型的泛化能力。
  4. 学习率调度策略:在训练过程中动态调整学习率。
  5. 集成学习:训练多个模型并使用集成方法来提高模型的稳定性和性能。

实践部分

安装环境
!pip install torchtext
!pip install pandas
数据处理

        定义自定义的tokenizer和vocab,将SMILES字符串按字符拆分,并替换为词汇表中的序号。

import pandas as pd
from torch.utils.data import Dataset, DataLoader, Subset
from typing import List, Tuple
import re
import torch
import torch.nn as nn
import time
import torch.optim as optim

class Smiles_tokenizer():
    def __init__(self, pad_token, regex, vocab_file, max_length):
        self.pad_token = pad_token
        self.regex = regex
        self.vocab_file = vocab_file
        self.max_length = max_length

        with open(self.vocab_file, "r") as f:
            lines = f.readlines()
        lines = [line.strip("\n") for line in lines]
        vocab_dic = {}
        for index, token in enumerate(lines):
            vocab_dic[token] = index
        self.vocab_dic = vocab_dic

    def _regex_match(self, smiles):
        regex_string = r"(" + self.regex + r"|"
        regex_string += r".)"
        prog = re.compile(regex_string)

        tokenised = []
        for smi in smiles:
            tokens = prog.findall(smi)
            if len(tokens) > self.max_length:
                tokens = tokens[:self.max_length]
            tokenised.append(tokens)
        return tokenised
    
    def tokenize(self, smiles):
        tokens = self._regex_match(smiles)
        tokens = [["<CLS>"] + token + ["<SEP>"] for token in tokens]
        tokens = self._pad_seqs(tokens, self.pad_token)
        token_idx = self._pad_token_to_idx(tokens)
        return tokens, token_idx

    def _pad_seqs(self, seqs, pad_token):
        pad_length = max([len(seq) for seq in seqs])
        padded = [seq + ([pad_token] * (pad_length - len(seq))) for seq in seqs]
        return padded

    def _pad_token_to_idx(self, tokens):
        idx_list = []
        new_vocab = []
        for token in tokens:
            tokens_idx = []
            for i in token:
                if i in self.vocab_dic.keys():
                    tokens_idx.append(self.vocab_dic[i])
                else:
                    new_vocab.append(i)
                    self.vocab_dic[i] = max(self.vocab_dic.values()) + 1
                    tokens_idx.append(self.vocab_dic[i])
            idx_list.append(tokens_idx)

        with open("../new_vocab_list.txt", "a") as f:
            for i in new_vocab:
                f.write(i)
                f.write("\n")

        return idx_list

    def _save_vocab(self, vocab_path):
        with open(vocab_path, "w") as f:
            for i in self.vocab_dic.keys():
                f.write(i)
                f.write("\n")
        print("update new vocab!")

def read_data(file_path, train=True):
    df = pd.read_csv(file_path)
    reactant1 = df["Reactant1"].tolist()
    reactant2 = df["Reactant2"].tolist()
    product = df["Product"].tolist()
    additive = df["Additive"].tolist()
    solvent = df["Solvent"].tolist()
    if train:
        react_yield = df["Yield"].tolist()
    else:
        react_yield = [0 for i in range(len(reactant1))]
    
    input_data_list = []
    for react1, react2, prod, addi, sol in zip(reactant1, reactant2, product, additive, solvent):
        input_info = ".".join([react1, react2])
        input_info = ">".join([input_info, prod])
        input_data_list.append(input_info)
    output = [(react, y) for react, y in zip(input_data_list, react_yield)]
    return output

class ReactionDataset(Dataset):
    def __init__(self, data: List[Tuple[List[str], float]]):
        self.data = data
        
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]
    
def collate_fn(batch):
    REGEX = r"\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9]"
    tokenizer = Smiles_tokenizer("<PAD>", REGEX, "../vocab_full_new.txt", 300)
    smi_list = []
    yield_list = []
    for i in batch:
        smi_list.append(i[0])
        yield_list.append(i[1])
    tokenizer_batch = torch.tensor(tokenizer.tokenize(smi_list)[1])
    yield_list = torch.tensor(yield_list)
    return tokenizer_batch, yield_list
定义Transformer模型
class TransformerEncoderModel(nn.Module):
    def __init__(self, input_dim, d_model, num_heads, fnn_dim, num_layers, dropout):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, d_model)
        self.layerNorm = nn.LayerNorm(d_model)
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, 
                                                        nhead=num_heads, 
                                                        dim_feedforward=fnn_dim,
                                                        dropout=dropout,
                                                        batch_first=True,
                                                        norm_first=True)
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, 
                                                         num_layers=num_layers,
                                                         norm=self.layerNorm)
        self.dropout = nn.Dropout(dropout)
        self.lc = nn.Sequential(nn.Linear(d_model, 256),
                                nn.Sigmoid(),
                                nn.Linear(256, 96),
                                nn.Sigmoid(),
                                nn.Linear(96, 1))

    def forward(self, src):
        embedded = self.dropout(self.embedding(src))
        outputs = self.transformer_encoder(embedded)
        z = outputs[:,0,:]
        outputs = self.lc(z)
        return outputs.squeeze(-1)
训练模型
def adjust_learning_rate(optimizer, epoch, start_lr):
    lr = start_lr * (0.1 ** (epoch // 3))
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

def train():
    N = 10
    INPUT_DIM = 292
    D_MODEL = 512
    NUM_HEADS = 4
    FNN_DIM = 1024
    NUM_LAYERS = 4
    DROPOUT = 0.2
    CLIP = 1
    N_EPOCHS = 40
    LR = 1e-4
    
    start_time = time.time()
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    data = read_data("../dataset/round1_train_data.csv")
    dataset = ReactionDataset(data)
    subset_indices = list(range(N))
    subset_dataset = Subset(dataset, subset_indices)
    train_loader = DataLoader(dataset, batch_size=128, shuffle=True, collate_fn=collate_fn)

    model = TransformerEncoderModel(INPUT_DIM, D_MODEL, NUM_HEADS, FNN_DIM, NUM_LAYERS, DROPOUT)
    model = model.to(device)
    model.train()
    
    optimizer = optim.AdamW(model.parameters(), lr=LR, weight_decay=0.01)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10)
    criterion = nn.MSELoss()

    best_valid_loss = 10
    for epoch in range(N_EPOCHS):
        epoch_loss = 0
        for i, (src, y) in enumerate(train_loader):
            src, y = src.to(device), y.to(device)
            optimizer.zero_grad()
            output = model(src)
            loss = criterion(output, y)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), CLIP)
            optimizer.step()
            epoch_loss += loss.detach().item()
            
            if i % 50 == 0:
                print(f'Step: {i} | Train Loss: {epoch_loss:.4f}')
                
        scheduler.step(epoch_loss / len(train_loader))
        loss_in_a_epoch = epoch_loss / len(train_loader)
        print(f'Epoch: {epoch+1:02} | Train Loss: {loss_in_a_epoch:.3f}')
        if loss_in_a_epoch < best_valid_loss:
            best_valid_loss = loss_in_a_epoch
            torch.save(model.state_dict(), '../model/transformer.pth')
    end_time = time.time()
    elapsed_time_minute = (end_time - start_time)/60
    print(f"Total running time: {elapsed_time_minute:.2f} minutes")

if __name__ == '__main__':
    train()
预测结果
def predict_and_make_submit_file(model_file, output_file):
    INPUT_DIM = 292
    D_MODEL = 512
    NUM_HEADS = 4
    FNN_DIM = 1024
    NUM_LAYERS = 4
    DROPOUT = 0.2
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    test_data = read_data("../dataset/round1_test_data.csv", train=False)
    test_dataset = ReactionDataset(test_data)
    test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False, collate_fn=collate_fn) 

    model = TransformerEncoderModel(INPUT_DIM, D_MODEL, NUM_HEADS, FNN_DIM, NUM_LAYERS, DROPOUT).to(device)
    model.load_state_dict(torch.load(model_file))
    model.eval()
    output_list = []
    for i, (src, y) in enumerate(test_loader):
        src = src.to(device)
        with torch.no_grad():
            output = model(src)
            output_list += output.detach().tolist()
    ans_str_lst = ['rxnid,Yield']
    for idx,y in enumerate(output_list):
        ans_str_lst.append(f'test{idx+1},{y:.4f}')
    with open(output_file,'w') as fw:
        fw.writelines('\n'.join(ans_str_lst))
    
predict_and_make_submit_file("../model/transformer.pth",
                              "../output/result.txt")

结语

        在本篇文章中,我们学习了Transformer的基本架构及其在化学反应产率预测中的应用。通过利用Transformer的Encoder部分对SMILES进行建模,我们能够更有效地捕捉序列数据的特征,从而提高模型的预测性能。希望通过本篇文章,读者能够进一步了解Transformer在化学信息学中的应用,并能够将所学知识应用到实际项目中。

如果你觉得这篇博文对你有帮助,请点赞、收藏、关注我,并且可以打赏支持我!

欢迎关注我的后续博文,我将分享更多关于人工智能、自然语言处理和计算机视觉的精彩内容。

谢谢大家的支持!

  • 3
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

会飞的Anthony

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值