【阅读总结】Variant Effect Predictor: TranceptEVE、EVEscape、popEVE

最新推荐文章于 2025-04-30 20:09:44 发布

Lasgalena

最新推荐文章于 2025-04-30 20:09:44 发布

阅读量1.3k

点赞数 25

分类专栏：论文阅读文章标签：论文阅读 python deep learning

本文链接：https://blog.csdn.net/weixin_44728829/article/details/142727230

版权

论文阅读专栏收录该内容

5 篇文章

订阅专栏

省流：

本系列旨在整理Debora课题组一系列基于深度生成模型预测致病突变的工作，包括EVE、Tranception、TranceptEVE、EVEscape和popEVE，主要讨论数据来源与处理、模型架构与训练、性能测试与实例。

Model	Publish	Year	Available	简述
EVE	Nature	2021	https://github.com/OATML-Markslab/EVE https://evemodel.org/	VAE模型，输入MSA，输出突变的进化分数（野生型ELBO减突变型ELBO）
Tranception	ICML	2022	https://github.com/OATML-Markslab/Tranception	k-mers融合输入的自回归模型
TranceptEVE	NIPS	2022		Tranception+EVE，weighted average
StructSeq	NIPS	2023		TranceptEVE+ESM-IF1
EVEscape	Nature	2023	https://github.com/OATML-Markslab/EVEscape	EVE+Biophysical
popEVE	medRxiv	2024	https://github.com/debbiemarkslab/popEVE https://pop.evemodel.org/	EVE+ESM-1v

已完成：
【阅读总结】Variant Effect Predictor: EVE 深度生成模型预测致病突变
 【阅读总结】Variant Effect Predictor: Tranception 自回归预测 + ProteinGym 基准数据集

TranceptEVE

Tranception在推理时使用的retrieval，是一种对MSA的抽提。TranceptEVE中进一步用EVE来强化MSA信息，与自回归推理/检索推理配合。

数据

evaluate on ProteinGym benchmarks

模型

TrancepEVE的结果有三个部分组成：retrieved MSA的氨基酸概率分布 $log P_M$ ，自回归预测结果 $log P_T$ ，和EVE预测结果 $log P_E$ ：

$\log P\left(x_i \mid x_{<i}\right)=\left(1-\alpha_P\right)\left[\left(1-\beta_P\right) \log P_T\left(x_i \mid x_{<i}\right)+\beta_P \log P_M\left(x_i\right)\right]+\alpha_P \log P_E\left(x_i\right)$

依然保留双向预测（ $\mathrm{N} \rightarrow \mathrm{C}$ 和 $\mathrm{C} \rightarrow \mathrm{N}$ )：

$\log P(\mathbf{x})=\frac{1}{2}\left[\log P\left(\mathbf{x}_{N \rightarrow C}\right)+\log P\left(\mathbf{x}_{C \rightarrow N}\right)\right]$

$α$ 和 $β$ 取决于检索得到的 MSA深度（见下表）。如果目标蛋白质没有或很少有同源物，就依赖自回归预测；如果 MSA 更深，会更加重视 MSA 和 EVE 先验。

MSA depth (Nb. sequences)
	$< 10$	$10^2$	$10^3$	$10^5$	$\geq 10^5$
$\alpha$	0.0	0.3	0.6	0.7	0.8
$\beta$	0.0	0.1	0.3	0.4	0.5

对indels的处理同Tranception。

原始EVE仅对覆盖完全（有较高质量MSA）的区域建模，但是本工作中EVE计算目标蛋白所有位置所有突变类型。TranceptEVE也可以兼容原始EVE，用其他两项log priors计算。

Appendix B.6提到，标准ensemble（log likelihood ratios from Tranception和delta ELBOs from EVE取平均）与TranceptEVE整体表现相当。但是TranceptEVE不需要考虑Tranception和EVE的缩放，保留了Tranception的自回归性（未来用于生成？）和处理indels的能力，且对于一个野生型的所有突变预测只需要建立一个EVE模型。

EVEscape

EVEscape score由三个部分组成：

Fitness：EVEscape针对抗体逃逸的病毒蛋白，调整了EVE使用的超参数
- sequence re-weighting in MSA (theta): 0.01, better suited to viruses
- fragment filtering (threshold_sequence_frac_gaps): sequences in the MSA that align to at least 50% of the target sequence.
- position filtering (threshold_focus_cols_frac_gaps): columns with at least 70% coverage, except for SARS-CoV-2 Spike for which we lower the required value to 30% in order to maximally cover experimental positions and significant pandemic sites.
  EVE模型架构不变。
Accessibility：根据病毒蛋白的 PDB 文件计算Weighted Contact Number（WCN）
Dissimilarity：突变体和野生型之间的电荷和疏水性差异之和

对这 3 项的乘积进行对数变换以获得最终的 EVEscape 分数：

def logistic(x):
    return 1 / (1 + np.exp(-x))

def standardization(x):
    return (x - x.mean()) / x.std()
    
#Compute EVEscape scores
summary["evescape"] = 0
summary["evescape"] += np.log(
    logistic(
        standardization(df_imp["evol_indices"]) * 1 /
        temperatures["fitness"]))
summary["evescape"] += np.log(
    logistic(
        standardization(df_imp["wcn_fill_r"]) * 1 /
        temperatures["surfacc"]))
summary["evescape"] += np.log(
    logistic(
        standardization(df_imp["charge_ew-hydro"]) * 1 /
        temperatures["exchangability"]))

popEVE

popEVE旨在全蛋白组范围内比较突变效应

数据

预印本中未放出。

模型

对EVE和ESM-1v预测分数，用gpytorch训练a two-component gaussian mixture model（见EVE）。本段代码可参见官方教程与解说。

class GPModel(gpytorch.models.ApproximateGP):
    def __init__(self, inducing_points):
        # 初始化高斯过程模型，包括设置变分策略、均值和协方差模块
        variational_distribution = gpytorch.variational.NaturalVariationalDistribution(inducing_points.size(0))
        # 使用GPyTorch提供的变分策略，将模型的诱导点位置作为可学习参数
        variational_strategy = gpytorch.variational.VariationalStrategy(
            self, inducing_points, variational_distribution, learn_inducing_locations=True
        )
        super(GPModel, self).__init__(variational_strategy)
        
        # 采用零均值函数来降低过度预测致病性的风险
        self.mean_module = gpytorch.means.ZeroMean()
        # 设定核函数
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        # 返回多元正态分布
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

使用Pòlya-Gamma辅助变量增强对高斯过程二元分类进行有效推理：

class PGLikelihood(gpytorch.likelihoods._OneDimensionalLikelihood):
    """
    Florian Wenzel, Theo Galy-Fajou, Christan Donner, Marius Kloft, Manfred Opper.
    "Efficient Gaussian process classification using Pòlya-Gamma data augmentation."
    """

    def expected_log_prob(self, target, input):
        # 计算 GP 输出（即输入分布）与目标值（标签）之间的 期望对数似然。
        mean, variance = input.mean, input.variance
        raw_second_moment = variance + mean.pow(2)

        # 将目标标签值从 0 和 1 转换为 -1 和 1 以匹配Pòlya-Gamma
        target = target.to(mean.dtype).mul(2.).sub(1.)

        # 通过二阶矩计算辅助变量 c，并从计算图中分离出来以避免梯度更新。
        c = raw_second_moment.detach().sqrt()
        # 计算 Pòlya-Gamma 辅助变量期望值
        half_omega = 0.25 * torch.tanh(0.5 * c) / c

        res = 0.5 * target * mean - half_omega * raw_second_moment
        res = res.sum(dim=-1)
        return res

    def forward(self, function_samples):
        return torch.distributions.Bernoulli(logits=function_samples)

    def marginal(self, function_dist):
        prob_lambda = lambda function_samples: self.forward(function_samples).probs
        probs = self.quadrature(prob_lambda, function_dist)
        return torch.distributions.Bernoulli(probs=probs)