【阅读总结】Variant Effect Predictor: Tranception 自回归预测 + ProteinGym 基准数据集

Lasgalena

已于 2024-10-06 14:55:49 修改

阅读量962

点赞数 15

分类专栏：论文阅读文章标签：人工智能机器学习论文阅读

于 2024-08-20 22:41:20 首次发布

本文链接：https://blog.csdn.net/weixin_44728829/article/details/141370211

版权

论文阅读专栏收录该内容

5 篇文章

订阅专栏

省流：

本系列旨在整理Debora课题组一系列基于深度生成模型预测致病突变的工作，包括EVE、Tranception、TranceptEVE、EVEscape和popEVE，主要讨论数据来源与处理、模型架构与训练、性能测试与实例。

Model	Publish	Year	Available	简述
EVE	Nature	2021	https://github.com/OATML-Markslab/EVE https://evemodel.org/	VAE模型，输入MSA，输出突变的进化分数（野生型ELBO减突变型ELBO）
Tranception	ICML	2022	https://github.com/OATML-Markslab/Tranception	k-mers融合输入的自回归模型
TranceptEVE	NIPS	2022		Tranception+EVE，weighted average
StructSeq	NIPS	2023		TranceptEVE+ESM-IF1
EVEscape	Nature	2023	https://github.com/OATML-Markslab/EVEscape	EVE+Biophysical
popEVE	medRxiv	2024	https://github.com/debbiemarkslab/popEVE https://pop.evemodel.org/	EVE+ESM-1v

已完成：
【阅读总结】Variant Effect Predictor: EVE 深度生成模型预测致病突变

Tranception

基于比对的模型无法预测indels（插入/缺失）引发的突变效应，对于蛋白的无序区域无法处理或者无法比对搜索到足够多的同源序列，对构建MSA的超参数敏感，训练自不同MSA的模型之间也缺少信息共享。

但是，当前不基于比对的模型表现相对逊色，也无法计算评估突变效应的log likelihood（尤其是多突变），同样无法处理indels。为了解决这些问题，本工作提出Tranception。

数据

训练数据

训练数据来自UniRef100，∼250 million protein sequences：use 99% of the data (∼ 249 million sequences) for training and set aside 1% of the data for validation (∼ 2.5 million sequences).

数据清洗：

删除UniRef50 级别的所有单例
从训练和验证数据集中排除包含不常见的吡咯赖氨酸 (O) 或硒代半胱氨酸 (U) 氨基酸的所有序列，或包含两个或多个连续的不确定氨基酸 X 的序列。
剩余的不确定氨基酸（X、B、J、Z）在训练时保留并随机估算如下：X 估算为 20 个标准氨基酸中的任何一个，B 估算为 D（天冬氨酸）或 N（天冬酰胺），J 为 I（异亮氨酸）或 L（亮氨酸），Z 为 E（谷氨酸）或 Q（谷氨酰胺）。所有具有不确定性的序列均被排除在验证集中。

ProteinGym

ProteinGym是一组旨在比较模型预测蛋白质突变效应的能力的benchmark数据集，根据突变类型（substitutions，indels）、来源（DMS 分析与临床注释）和训练方案（无监督与有监督）进行划分。

Validation set to decide between different architecture options：10 种检测涵盖了不同的分类群（3 种病毒蛋白、4 种人类和其他真核生物蛋白、3 种原核生物蛋白）、MSA深度（3 种低、4 种中和 3 种高），并包括一种具有多突变的assay（与 ProteinGym 内多突变assay的比例一致）。

• BLAT ECOLX (Jacquier et al., 2013)
• CALM1 HUMAN (Weile et al., 2017)
• CCDB ECOLI (Tripathi et al., 2016)
• DLG4 RAT (McLaughlin Jr et al., 2012)
• PA I34A1 (Wu et al., 2015)
• Q2N0S5 9HIV1 (Haddox et al., 2018)
• RL401 YEAST (Roscoe et al., 2013)
• SPG1 STRSG (Olson et al., 2014)
• SPIKE SARS2 (Starr et al., 2020)
• TPOR HUMAN (Bridgford et al., 2020)

模型

以下代码分析来自https://github.com/OATML-Markslab/Tranception，ProteinGym-proteingym-baselines-tranception中代码基本一致。目前公开的代码中没有train函数，因此无法讨论训练细节。模型参数可在链接中下载，压缩包中包含config.json。

Tranception在GPT2的基础上结合了Primer (So et al., 2021) and Inception (Szegedy et al., 2014)，通过消融实验证明了k-mers融合、size、Squared ReLU、Grouped ALiBi的作用（B.1，Table 5）

Tranception Attention

k-mers 是 k 个元素（通常是核苷酸或氨基酸）组成的连续子序列，广泛应用于多种序列分析任务。大多数蛋白质语言模型仅专注于在氨基酸水平提取信息，而Tranception attention使用1/3/5/7-mers作为输入。

Tranception的vocabulary依然是20种标准氨基酸，通过不同尺寸的卷积核提取k-mers：

class SpatialDepthWiseConvolution(nn.Module):
    def __init__(self, head_dim: int, kernel_size: int = 3):
        super().__init__()
        self.kernel_size = kernel_size
        self.conv = nn.Conv1d(in_channels=head_dim, out_channels=head_dim, kernel_size=(kernel_size,), padding=(kernel_size - 1,), groups=head_dim)
    
    def forward(self, x: torch.Tensor):
        batch_size, heads, seq_len, head_dim = x.shape
        x = x.permute(0, 1, 3, 2).contiguous()
        x = x.view(batch_size * heads, head_dim, seq_len)
        x = self.conv(x)
        if self.kernel_size>1:
            x = x[:, :, :-(self.kernel_size - 1)]
        x = x.view(batch_size, heads, head_dim, seq_len)
        x = x.permute(0, 1, 3, 2)
        return x


for kernel_idx, kernel in enumerate([3,5,7]):
    self.query_depthwiseconv[str(kernel_idx)] = SpatialDepthWiseConvolution(self.head_dim,kernel)
    self.key_depthwiseconv[str(kernel_idx)]   = SpatialDepthWiseConvolution(self.head_dim,kernel)
    self.value_depthwiseconv[str(kernel_idx)] = SpatialDepthWiseConvolution(self.head_dim,kernel)

Squared ReLU

顾名思义，Squared ReLU为ReLU函数取平方，用在两次卷积之间。

def squared_relu(x):
    """
    Squared ReLU variant that is fastest with Pytorch.
    """
    x = nn.functional.relu(x)
    return x*x

class TranceptionBlockMLP(nn.Module):
    def __init__(self, intermediate_size, config):
        super().__init__()
        embed_dim = config.hidden_size
        self.c_fc = Conv1D(intermediate_size, embed_dim)
        self.c_proj = Conv1D(embed_dim, intermediate_size)
        self.act = tranception_ACT2FN[config.activation_function]
        self.dropout = nn.Dropout(config.resid_pdrop)
    
    def forward(self, hidden_states):
        hidden_states = self.c_fc(hidden_states)
        hidden_states = self.act(hidden_states)
        hidden_states = self.c_proj(hidden_states)
        hidden_states = self.dropout(hidden_states)
        return hidden_states

Grouped ALiBi

Attention with Linear Biases (ALiBi) 是一篇ICLR 2022中提出的方法。ALiBi不加入learned or sinusoidal position embeddings，通过在attention score中加入基于Q和K之间距离的线性偏差，来提升模型对长序列的处理能力：

$\operatorname{softmax}\left(\mathbf{q}_i \mathbf{K}^{\top}+m \cdot[-(i-1), \ldots,-2,-1,0]\right)$

斜率 $m$ 是head-specific标量，在训练中不变。ALiBi论文中，heads数 $n$ 只取2的幂次方的数， $m$ 从 $2^{\frac{-8}{n}}$ 开始，详见函数get_slopes_power_of_2，返回一个长为 $n$ ，公比为 $2^{\frac{-8}{n}}$ ，从 $2^{\frac{-8}{n}}$ 到 $2^{-8}$ 的等比数列。

Tranception对标GPT2， $n$ 不是2的幂次方的数，因此代码中使用math.floor()向下取整，取 $n$ 最近的2的幂次方，记为 $n^{'}$ 。不足的数在 $2 n^{'}$ 得到的数列中从前到后每隔一个数字取一个（保证公比不变？）来补全。另外，为了配合attention模块四个不同kmer的输入，Tranception提出Grouped ALiBi，对于heads编为四组，组内 $m$ 正常计算，四组得到同样的等比数列。

def get_slopes(n, mode="standard_alibi", verbose=False):
    """
    Function to compute the m constant for each attention head. Code has been adapted from the official ALiBi codebase at:
    https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py
    """
    def get_slopes_power_of_2(n):
        start = (2**(-2**-(math.log2(n)-3)))
        ratio = start
        return [start*ratio**i for i in range(n)]
    if mode == "grouped_alibi":
        n = n // 4
    if math.log2(n).is_integer():
        result = get_slopes_power_of_2(n)                   
    else:
        # Workaround when the number of heads is not a power of 2
        closest_power_of_2 = 2**math.floor(math.log2(n))  
        result = get_slopes_power_of_2(closest_power_of_2) + get_slopes(2*closest_power_of_2)[0::2][:n-closest_power_of_2]
    if mode == "grouped_alibi":
        result = result * 4
        if verbose:
            print("ALiBi slopes: {}".format(result))
    return result

使用：

maxpos = config.n_positions
attn_heads = config.n_head
# 获取m
self.slopes = torch.Tensor(get_slopes(attn_heads, mode=self.position_embedding))
# 与距离相乘
alibi = self.slopes.unsqueeze(1).unsqueeze(1) * torch.arange(maxpos).unsqueeze(0).unsqueeze(0).expand(attn_heads, -1, -1)
alibi = alibi.view(attn_heads, 1, maxpos)

得到的alibi与attn_weights相加即可。

attn_weights = attn_weights + alibi[:,:,:attn_weights.size(-1)]

与使用learned position encodings相比，Grouped ALiBi有助于减少参数数量，在训练期间更快地收敛，并带来更好的下游任务性能。

Scoring protein sequences

自回归模型能够对完整序列计算log-likelihood，因此给定突变 $x^{mut}$ 的fitness $F_x$ 计算为突变型和野生型的log-likelihood ratio：

$P(x)=\prod_{i=1}^l P\left(x_i \mid x_1, \ldots, x_{i-1}\right)=\prod_{i=1}^l P\left(x_i \mid x_{<i}\right)$

$F_x=\log \frac{P\left(x^{m u t}\right)}{P\left(x^{w t}\right)}$

Mirrored sequences：对每个batch中的序列以 0.5 的概率随机反转，对每个序列及其mirror image获得的log-likelihood ratios进行平均
Scoring window：对于长度超过max context size = 1024 tokens的序列，观察到截断和滑动窗口取平均两种方法，在DMS上的average Spearman表现差异很小，因此为了简便选择截断。
1. 单突变：maximize the context available around，即突变位点尽量在1024中间
2. 多突变：maximize context around the barycenter，即突变中点尽量在1024中间

Inference-time retrieval

检索推理用于加强上一节的自回归推理

检索目标序列的 MSA
1. 推理substitutions数据集：每个family只构建一次MSA
2. 推理indels数据集：删除 MSA 中与删除位置相对应的列，并在 MSA 中突变蛋白的插入位置处添加零填充列，为每个序列定制MSA。在推理时，插入的列或完全不覆盖的位置被忽略，模型只依靠其自回归模式在这些位置进行预测。
根据pseudocounts和Laplace smoothing计算每个比对位置的氨基酸经验分布

Laplace smoothing通过在每个可能的事件中都加上一个常数（这里取 $\alpha=10^{-5}$ ）来避免零概率问题，如对于长为 $L$ 的序列，氨基酸 $x_i$ 的出现概率 $\hat{\theta}_i=\frac{x_i+\alpha}{L+20\alpha}$