proteinMPNN 应用论文解读

论文连接:Improving Protein Expression, Stability, and Function withProteinMPNN


  1. Natural proteins are highly optimized for functionbut are often difficult to produce at a scale suitable for biotechnological applications due to poor expression in heterolo-gous systems, limited solubility, and sensitivity to temperature.

  2. Evolution has optimized function over stability in most natural proteins; as a result, they often exhibit poor solubility,thermostability, and expression in heterologous systems, all of which reduce the yield of functional protein.

  3. Experimental methodssuch as directed evolution have been extensively used to optimize desirable features in proteins but are often prohibitively resource- and labor-intensive.Computational tools have been developed to achieve the benefits of directedevolution while minimizing experimental screening.

天然蛋白质在外源系统中(天然大肠杆菌vs人体内 转基因大肠杆菌vs人体内)

  1. 蛋白质功能表达水平差

  2. 蛋白质本身物理性质:溶解度低(工业上蛋白质难以纯化和恢复活性,且会在生物体内以包含体的方式存在:非活性不溶解)

  3. 对温度敏感(热稳定性表现差)


active site:活性部位 / conserved positions:保守区域 / substrate:底物 

The design space is chosen to preserve alternative protein function by fixing the amino acid identities of residues close to the ligand and those that are highly conserved in multiple sequence alignments.

In all targets, to preserve the catalytic machinery and substrate-binding site, we fixed the amino acid identities of the first shell functional positions defined as those within 7Å of the substrate in aligand-bound crystal structure complex.

With the design space selected, we generated sequences with Pro-teinMPNN, predicted the structures with AlphaFold2,15 and filtered by the predicted local distance difference test score(pLDDT) and Cα root-mean-square deviation (RMSD) to the input structure


We chose as model systems one of the first proteins whose structure was solved, the oxygenstorage protein myoglobin, and the widely used protease from tobacco etch virus (TEV).

1.Design of Myoglobin Variants with IncreasedStability.

usage:Myoglobin binds heme to carry oxygen inmammalian muscle tissue, and has relevance in clinical applications as a biomarker, as a versatile platform for biocatalytic applications,and in food science as aningredient in artificial meat products. The globin superfamily, of which myoglobin is a member, has a fold made up of eight alpha helical regions, with diversity in the termini and two loop regions flanking the heme-binding pocket.

 biding site:结合位置 / heme:亚铁血红素 / inpainted regions:填充区域


1.Experiment process

We applied the ProteinMPNN design protocol described above using a crystal structure of human myoglobin, nMb(PDB: 3RGK) to preserve the oxygen storage function, wefixed the identities of 17 positions located around the hemeligand in the heme-bound structure .
我们应用了上面描述的ProteinMPNN设计协议,使用了人类肌红蛋白的晶体结构nMb(PDB: 3RGK)来保持储氧功能,我们固定了血红素结合结构中位于血红素配体周围的17个位置的身份。

results: Sixty sequences were generated with ProteinMPNN and evaluated for their likelihood to recapitulate the myoglobin backbonecoordinates using AlphaFold2 single-sequence predictions (seeSupporting Information). Eight of the designs did so with highconfidence (pLDDT > 85.0 and Cα RMSD < 1.0 Å; analogoussingle-sequence prediction of the native sequence yielded pLDDT = 50.6 and Cα RMSD = 7.5 Å). Four designs with close structural agreement in the heme-binding region were selected for experimental testing.

  1. pLDDT(predicted Local Distance Difference Test):评估蛋白质结构预测质量的指标。

    1. 它基于局部距离差异测试(Local Distance Difference Test),评估预测的每个残基(或局部区域)的结构可信度。

    2. pLDDT的数值范围通常从0到100,表示预测结构与实验结构之间的相似性。较高的pLDDT值表明预测的结构与实验结构更为相似和可靠。

  2. RMSD(Root-Mean-Square Deviation of Cα atoms):量化两个蛋白质结构之间整体结构差异的度量。

    1. 它计算两个结构中所有Cα原子位置的均方根偏差。

    2. 通常以埃(Å)为单位,Cα RMSD越小,表示两个结构越相似。


In myoglobin, we performed a limited backbone redesign to further stabilize the structure.

We also explored the limited backbone redesign of poorly ordered regions to attempt to further stabilize the protein.

We selected these less-conserved loopregions for backbone remodeling with RoseTTAFold joint inpainting. We generated two distinct sets of designs with structural remodeling: one with the region joining helices E and F redesigned and one additionally including the CD-loop region.
我们选择了这些保守程度较低的环区,用RoseTTAFold joint inpainting进行骨架重塑我们生成了两组不同的结构重塑设计:一组重新设计了连接螺旋E和F的区域,另一组额外包括cd-loop区域。

From these remodeled back-bones, we again performed sequence design with Pro-teinMPNN, with the heme-binding site kept fixed as described above.


(b) SEC traces of 20 designed myoglobin variants.
(c) Soluble yield of myoglobin designs and native myoglobin (nMb, represented as a red dashedline).

Thirteen of the twenty designs had higherlevels (up to a 4.1-fold increase) of total soluble protein yield compared to that of native myoglobin. All 20 designs had similar heme-binding spectra to native myoglobin,with agreement in the Soret maximum (407−413 nm vs 409nm in native) and Q-band features (500, 537, 582, and 630nm), suggesting the preservation of the native heme-binding mechanism .
与天然肌红蛋白相比,20种设计中有13种的总可溶性蛋白含量更高(高达4.1倍)。所有20种设计与天然肌红蛋白具有相似的血红素结合光谱,在最大Soret值(407−413 nm vs天然肌红蛋白409nm)和q波段特征(500、537、582和630nm)上一致,表明天然肌红蛋白结合机制得到了保留。


 2. Design of TEV Protease Variants with ImprovedStability and Catalytic Activity.

For TEV protease, we used evolutionary information to further identify residuescritical to activity.

TEVd (PDB: 1LVM)input structure with positions fixed during redesign highlighted. Active site residues surrounding the substrate (blue), 50% of the most highly conserved residues (yellow), and catalytic residues (pink) are highlighted. Inset shows a zoomed-in view of the active site region.
EVd (PDB: 1LVM)重新设计期间固定位置的输入结构高亮。底物周围的活性位点残基(蓝色)、50%高度保守的残基(黄色)和催化残基(粉色)被突出显示。内嵌显示活动站点区域的放大视图。

 1.Experiment process

We ranked each amino acid identity at each position by the degree of conservation in the sequence alignment and varied the percentage of these most highly conserved residues to fix during sequence redesign between 30 and 70%. We generated four distinct sets of designs that fixed the amino acid identities of just the active site residues or the active site residues and 30, 50, and 70% of the most conserved residues in the TEV family (Figure 3A, see Supporting Information).

results: A total of 144 sequences were generated with ProteinMPNN, which were all predicted with high confidence to fold to the TEV structure by AlphaFold2 (pLDDT > 87.5; native TEV is predicted with pLDDT = 90) and possess 55 to 85% sequence identity to the parent sequence.
总共生成了144个序列,这些模型的预测置信度都很高。用AlphaFold2实现TEV结构的折叠(pLDDT > 87.5; 预测原生TEV的pLDDT = 90),并具有55 ~ 85%的序列与父序列相同。

129 of 144 designs exhibited higher levels of soluble expression than TEVd (TEVd average yield = 1 mg/L culture, design average yield = 20.1 mg/L culture
144组设计中的129组可溶性表达水平高于TEVd (TEVd)平均产量= 1 mg/L培养,设计平均产量= 20.1mg/L培养。

Designs made with no evolutionary constraints had improved soluble expression over the parent but were not active on the peptide substrate, while designs with the highest activities were designed with the top 50% most conserved residues fixed





