【论文阅读】Data-driven regularization lowers the size barrier of cryo-EM structure determination

Data-driven regularization lowers the size barrier of cryo-EM structure determination

数据驱动的正则化降低了冷冻电镜结构测定的尺寸障碍

Nature Methods (2024)

Abstract

(问题背景)Macromolecular structure determination by electron cryo-microscopy (cryo-EM) is limited by the alignment of noisy images of individual particles. Because smaller particles have weaker signals, alignment errors impose size limitations on its applicability.
电子冷冻显微镜(Cryo-EM)测定大分子结构受到单个颗粒噪声图像排列的限制。由于较小的粒子具有较弱的信号,排列误差会对其适用性造成大小限制。

(理论)Here, we explore how image alignment is improved by the application of deep learning to exploit prior knowledge about biological macromolecular structures that would otherwise be difficult to express mathematically.
在这里,我们探索如何通过应用深度学习来利用关于生物大分子结构的先验知识来改进图像对齐,否则这些知识很难用数学表示。

(方法)We train a denoising convolutional neural network on pairs of half-set reconstructions from the electron microscopy data bank (EMDB) and use this denoiser as an alternative to a commonly used smoothness prior. We demonstrate that this approach, which we call Blush regularization, yields better reconstructions than do existing algorithms, in particular for data with low signal-to-noise ratios.
我们在电子显微镜数据库(EMDB)的半集重构对上训练去噪卷积神经网络,并使用该去噪器作为常用的平滑先验的替代。我们证明了这种方法,我们称之为Blush正则化,产生了比现有算法更好的重建结果,特别是对于低信噪比的数据。

(效果)The reconstruction of a protein–nucleic acid complex with a molecular weight of 40 kDa, which was previously intractable, illustrates that denoising neural networks will expand the applicability of cryo-EM structure determination for a wide range of biological macromolecules.
对分子量为40 kDa的蛋白质-核酸复合体的重构表明,去噪神经网络将扩大低温EM结构测定的适用范围,适用于广泛的生物大分子。

main

(问题现状)Despite rapid progress in cryo-EM technology in the past decade1, many biological macromolecules of interest are still too small to allow reliable structure determination.
尽管冷冻电镜技术在过去十年中取得了迅速的进展,但许多感兴趣的生物大分子仍然太小,无法进行可靠的结构测定。
To limit the damage that electrons cause to biological structures of interest, cryo-EM images are taken using low doses of electron radiation, leading to high levels of experimental noise. The noise in the images impedes their alignment, resulting in an ill-posed optimization problem in which many reconstructions (which might be noisy or artifactual) are equally probable, given the data.
为了限制电子对感兴趣的生物结构造成的损害,冷冻电镜图像使用低剂量电子辐射拍摄,导致高水平的实验噪声。图像中的噪声阻碍了它们的对齐,导致了一个不适定的优化问题,在这个问题中,给定数据,许多重建(可能是有噪声的或人为的)是同样可能的。
The ill-posedness of the reconstruction imposes a minimum size barrier for cryo-EM structure determination, because smaller complexes yield images with lower signal-to-noise ratios. Although this barrier has been overcome in experiments involving the formation of complexes between small targets and other proteins2, the formation of sufficiently rigid complexes is often difficult.
重构的不稳定性对低温电镜结构的确定施加了最小的尺寸障碍,因为较小的配合物产生的图像具有较低的信噪比。尽管在小靶标与其他蛋白质之间形成复合物的实验中已经克服了这一障碍,但要形成足够刚性的复合物往往很困难。
Here we explore a computational method that lowers the size barrier for existing cryo-EM datasets.
在这里,我们探索了一种降低现有低温电镜数据集尺寸障碍的计算方法。

Even for ill-posed reconstruction problems, the correct solution can still be identified through the incorporation of prior knowledge. Most cryo-EM structures are calculated using explicit regularization of a likelihood function in Fourier space, which assumes cryo-EM reconstructions are smooth in real space3,4,5.
即使对于病态重构问题,仍然可以通过结合先验知识来识别正确的解决方案。大多数低温电镜结构的计算使用傅里叶空间中似然函数的显式正则化,这假设低温电镜重构在真实空间中是光滑的3,4,5。
Although we know much more about the structures of biological macromolecules beyond just the fact that their density varies smoothly, it has been difficult to incorporate richer sources of prior knowledge into the optimization process.
尽管我们对生物大分子的结构有了更多的了解,而不仅仅是它们的密度是平滑变化的,但将更丰富的先验知识来源纳入优化过程是困难的。
Denoising convolutional neural networks can incorporate complex prior knowledge into an iterative optimization process6. By training a denoising network on simulated pairs of noisy and ground-truth images, we have previously provided proof of principle that prior knowledge about protein structures can be exploited to improve cryo-EM structure determination7.
去噪卷积神经网络可以将复杂的先验知识融入迭代优化过程6。通过在模拟的噪声图像和真实图像对上训练去噪网络,我们之前已经证明了可以利用蛋白质结构的先验知识来改善冷冻电镜结构确定7。
However, we also observed problems with overfitting and the hallucination of protein-like features in the resulting reconstructions. Moreover, because experimental cryo-EM structures often comprise regions of well-ordered proteins and nucleic acid domains alongside less structured regions, including, for example, membrane patches or flexible domains, it was not clear how ground-truth pairs for experimental cryo-EM data could be generated.
然而,我们也观察到了过度拟合的问题,以及由此产生的重建中出现的蛋白质样特征幻觉。此外,由于实验低温电镜结构通常包括有序的蛋白质和核酸结构域区域以及结构较少的区域,例如,包括膜斑块或柔性结构域,因此尚不清楚如何生成实验低温电镜数据的真值对。

Here, we demonstrate how a pre-trained denoising convolutional neural network, trained and deployed in an application-specific manner inspired by the noise2noise approach8 (Fig. 1 and Methods), can improve cryo-EM structure determination using experimental data.
在这里,我们展示了如何利用一种经过预训练的去噪卷积神经网络(受noise2noise方法8的启发进行应用特定的训练和部署,如图1和方法部分所示)来改善基于实验数据的冷冻电镜结构确定。
Through this approach, which we call Blush regularization, we improve reconstructions across a variety of existing cryo-EM datasets, including one for a protein–nucleic acid complex that was too small for analysis using existing methods.
通过这种我们称之为“Blush”的正则化方法,我们改善了各种现有的冷冻电镜数据集的重构结果,其中包括一个蛋白质-核酸复合物的数据集,该数据集太小,无法用现有的方法进行分析。

Fig. 1: Schematic illustration of Blush regularization and slices of example volumes.
图1:Blush正则化的示意图和示例体积切片。

请添加图片描述

a,训练程序,显示两个半图的两个通道和循环去噪器的输出(粉色),并计算均方误差(L2)损失。
b,带光谱拖尾的迭代重建。每个半映射分别重建。在每次迭代中,FSC用于估计截止频率(ρ),该频率随后用于低通滤波去噪输出。最后的输出不通过去噪,但受到维纳滤波器,类似于基线重建。
c、降噪U-net架构,由五个连续的编码器块和一个卷积块组成,然后是五个连续的解码器块。SiLU为s型线性单元;批量规范化规范。
d,e,在PfCRT (d)和剪接体(e)重建的最后迭代中,对单个降噪应用之前(左)和之后(右)的映射进行切片。
f,g, FIA (f)和Aca2-RNA复合物(g)的基线重建图(左)和Blush正则化后图(右)的切片。比例尺,30 Å。

Results

Blush regularization improves reconstruction without overfitting

blush 正则化 提高重建效果但是没有过拟合
Blush regularization improves reconstruction without overfitting
We first tested Blush regularization on a cryo-EM dataset (EMPIAR-10330)9 for the Plasmodium falciparum chloroquine resistance transporter (PfCRT)10. This dataset has been used as a standard to demonstrate the performance of several approaches in reducing overfitting during cryo-EM refinement11,12. Standard refinement using regularized likelihood optimization in RELION, which we refer to as the baseline, yields an overall resolution of 3.8 Å for this data set.
我们首先在低温电镜数据集(EMPIAR-10330)9上测试了恶性疟原虫氯喹耐药转运体(PfCRT)的Blush正则化。该数据集已被用作标准来演示在低温电镜细化过程中减少过拟合的几种方法的性能11,12。在RELION中使用正则化似然优化的标准细化,我们将其称为基线,该数据集的总分辨率为3.8 Å。

Application of Blush regularization (Fig. 2) yielded an overall resolution estimate of 3.4 Å. In the last iteration, spectral trailing, a heuristic method that prevents overfitting by limiting the spatial frequency at which information from the denoiser is used (Methods), was applied with a cut-off at 3.5 Å. Compared with the baseline reconstruction, local resolution improved for most regions of the map, with a corresponding increase in visible side-chain densities. The improvement in resolution, as measured by half-map Fourier shell correlation (FSC), was confirmed by FSCs between both maps and the atomic model that was deposited for this dataset (Protein Data Bank (PDB): 6UKJ). Throughout this paper, FSCs between the map and atomic model were calculated using Servalcat13. We also assessed the relative quality of both maps by application of our automated model-building software ModelAngelo14, which generated a model with 84% completeness in the baseline map and 97% completeness in the Blush map. Model completeness is defined as the percentage of residues that match the reference model with a Cα distance of 3 Å or less.
应用Blush正则化(图2)得到了3.4 Å的总体分辨率估计。在最后一次迭代中,光谱拖尾是一种启发式方法,通过限制使用去噪信息的空间频率来防止过拟合(方法),截止值为3.5 Å。与基线重建相比,地图大部分区域的局部分辨率有所提高,可见侧链密度相应增加。通过半图傅立叶壳相关(FSC)测量的分辨率的提高,通过两个图之间的FSC和为该数据集(蛋白质数据库(PDB): 6UKJ)沉积的原子模型之间的FSC证实了这一点。在本文中,使用Servalcat13计算地图和原子模型之间的FSCs。我们还通过应用我们的自动化模型构建软件ModelAngelo14评估了这两张地图的相对质量,该软件生成的模型在基线地图中具有84%的完整性,在Blush图中具有97%的完整性。模型完备性定义为与参考模型匹配且Cα距离小于等于3 Å的残基的百分比。

Fig. 2: Single-particle reconstruction of the PfCRT dataset.
图2:PfCRT数据集的单粒子重建。

请添加图片描述

  • a, Maps colored by local resolution, comparing the baseline reconstruction (left) and the reconstruction after Blush regularization (right).
  • b, Automated atomic modeling by ModelAngelo for the baseline (left) and Blush (right) maps. Colored by chain.
  • c, FSCs between the masked maps and deposited model (PDB: 6UKJ).
  • d, Solvent-corrected half-map FSCs. Both plots show FSCs for Blush (purple), Blush without spectral trailing (pink) and baseline (black). The dashed pink line shows the solvent-corrected half-map FSC for Blush without spectral trailing when applied to data with phase randomization beyond 4-Å resolution.

a,通过局部分辨率着色的图,比较基线重建(左)和blush正则化后的重建(右)。
b,由ModelAngelo对基线(左)和Blush(右)地图进行自动原子建模。用链条着色。
c,遮罩图与沉积模型之间的FSCs (PDB: 6UKJ)。
d,溶剂校正半图FSCs。两幅图都显示了blush(紫色)、blush无光谱拖尾(粉红色)和基线(黑色)的FSCs。虚线粉色表示溶剂校正的半图FSC,当应用于相位随机化超过4-Å分辨率的数据时,没有光谱拖尾。

To assess the potential for overfitting by the denoiser, we also performed a phase-randomization test15. We applied Blush regularization without spectral trailing for refinement of the PfCRT dataset with phase randomization beyond 4-Å resolution. Although spectral trailing was not used, no overfitting was observed. Switching off spectral trailing led to a marginal improvement in the quality of reconstruction, as quantified by the FSC between the map and the atomic model (Fig. 2d). These results indicate that the denoiser can prevent overfitting for this dataset, even without spectral trailing. In general, we still recommend running Blush regularization with spectral trailing, because the benefits of switching it off are small and overfitting could be more prominent for other datasets. Consequently, in the following sections, we present results obtained only using spectral trailing.
为了评估去噪器过度拟合的可能性,我们还进行了相位随机化测试15。我们应用了不带光谱拖尾的Blush正则化来细化PfCRT数据集,其相位随机化分辨率超过4-Å。虽然没有使用光谱拖尾,但没有观察到过拟合。通过地图和原子模型之间的FSC(图2)可以量化,关闭光谱拖尾导致重建质量的边际改善(图2)。这些结果表明,即使没有光谱拖尾,该去噪器也可以防止该数据集的过拟合。一般来说,我们仍然建议使用光谱拖尾来运行Blush 正则化,因为关闭它的好处很小,而且对于其他数据集来说,过拟合可能更加突出。因此,在接下来的章节中,我们将介绍仅使用光谱拖尾获得的结果。

Blush expands the applicability of cryo-EM reconstruction

We subsequently assessed the broader applicability of Blush regularization by applying it to four types of structures and refinement methods.
随后,我们通过将Blush正则化应用于四种类型的结构和细化方法来评估其更广泛的适用性。

First, we tested Blush regularization on a small membrane protein, Ste2, which is a dimeric G-protein-coupled receptor (GPCR)16 (Fig. 3 and Extended Data Table 1). Full-length monomeric Ste2 has a molecular weight of 47.85 kDa, which includes a long disordered carboxy-terminal tail that comprises 125 amino acids. The total mass of the ordered dimeric Ste2 that contributes to alignment is roughly 67 kDa, most of which lies embedded in a detergent micelle.
首先,我们测试了一种小膜蛋白Ste2的Blush正则化,Ste2是一种二聚体g蛋白偶联受体(GPCR)16(图3和扩展数据表1)。全长单体Ste2的分子量为47.85 kDa,包括一条由125个氨基酸组成的无序羧基末端长尾。有助于排列的有序二聚体Ste2的总质量约为67 kDa,其中大部分位于洗涤剂胶束中。

Fig. 3: Single-particle reconstruction of the Ste2 dataset.
图3:Ste2数据集的单粒子重建。

请添加图片描述
a,b, Reconstructions of the Ste2 dataset, colored by local resolution, comparing the baseline reconstruction (a) and the reconstruction after Blush regularization (b).
c,d, Automated atomic modeling by ModelAngelo, using the baseline © and Blush (d) maps.
e, Solvent-corrected half-map FSCs.
a,b,重建Ste2数据集,局部分辨率着色,比较基线重建(a)和Blush正则化后的重建(b)。
c,d, ModelAngelo自动原子建模,使用基线©和Blush(d)图像。
e,溶剂校正半图FSCs。

The dataset used was acquired from a similar complex to that in PDB entry 7QB9, reported in ref. 16, but with different biochemical conditions affecting the stability of the structure. Alignment of images of Ste2 is difficult because few protein features extend from the smooth detergent micelle. Baseline reconstruction yielded a map with an overall resolution of 3.8 Å, with limited densities for side chains. Application of Blush regularization led to a structure with an overall resolution of 3.4 Å. Spectral trailing ensured that no information from the denoiser was inserted beyond 3.7-Å resolution. Compared with the baseline reconstruction, the density of the transmembrane helices is improved. Loops at the top and bottom of the structure are still relatively poorly resolved, probably owing to molecular flexibility. In agreement with the visibility of improved side-chain densities and local resolution estimates, the completeness of models built by ModelAngelo in these maps improved from 19% to 43%.
使用的数据集来自于参考文献16报道的PDB条目7QB9中类似的复合物,但不同的生化条件影响了结构的稳定性。对Ste2的图像进行比对是困难的,因为很少有蛋白质特征从光滑的洗涤剂胶束中延伸出来。基线重建生成的地图总体分辨率为3.8 Å,侧链密度有限。应用Blush正则化得到了一个整体分辨率为3.4 Å的结构。光谱拖尾确保没有来自去噪器的信息被插入到3.7-Å分辨率之外。与基线重建相比,跨膜螺旋的密度有所提高。在结构的顶部和底部的环仍然相对较差的解决,可能是由于分子的灵活性。与改进侧链密度的可见性和局部分辨率估计一致,ModelAngelo在这些地图中建立的模型的完整性从19%提高到43%。

Second, we evaluated the performance of Blush regularization in multi-body refinement17, in which partial signal subtraction is used to align independently moving domains within a larger complex. Reconstructions from subtracted images were included in the training set for the denoiser. Moreover, signal subtraction reduces the amount of signal in each image, placing stringent limitations on the minimal size of domains that can be aligned. We applied Blush regularization in multi-body refinement of a publicly available dataset (EMPIAR-10180) for the Saccharomyces cerevisiae pre-catalytic spliceosomal B complex18 (Fig. 4). Using four bodies, one each for the core, the foot, the helicase and the SF3b regions, Blush regularization improved the quality of reconstructions of all domains compared with baseline multi-body refinement, as measured by local resolution, half-map FSCs and FSCs with the reference atomic model (PDB: 5NRL). The improvements in resolution were largest in the helicase and SF3b regions, which are the most flexible and thus the hardest to reconstruct. The improvements in resolution were reflected by automated model building in ModelAngelo, which increased model completeness of the entire complex from 32% to 48%. In particular, the model completeness for the SF3b region was improved from 3% to 29%.
其次,我们评估了Blush正则化在多体细化中的性能,其中使用部分信号减法来对齐更大复合体内的独立移动域。从减去的图像中重建的图像被包括在去噪的训练集中。此外,信号减法减少了每个图像中的信号量,对可以对齐的最小域大小施加了严格的限制。我们将Blush正则化应用于公开可用数据集(EMPIAR-10180)的多体精化中,用于酵母预催化剪接体B复合物18(图4)。使用四个体,核心,脚,解旋酶和SF3b区域各一个,与基线多体精化相比,Blush正则化提高了所有结构域的重建质量,通过局部分辨率,半图FSCs和参考原子模型(PDB: 5NRL)测量FSCs。解旋酶和SF3b区域分辨率的提高最大,这两个区域最灵活,因此最难重建。ModelAngelo中的自动模型构建反映了分辨率的提高,它将整个综合体的模型完整性从32%提高到48%。特别是SF3b区域的模型完备性从3%提高到29%。

Fig. 4: Multi-body reconstruction of the spliceosome dataset.
图4:剪接体数据集的多体重建。

请添加图片描述
a,b, Combined maps of the individual bodies, colored by local resolution, comparing the baseline reconstruction (a) and the reconstruction after Blush regularization (b).
c, FSCs between the masked maps of each body and the corresponding region in the deposited model (PDB: 5NRL).
d, Solvent-corrected half-map FSCs for the individual bodies. In c and d, dashed and solid lines correspond to baseline and Blush maps, respectively. FSCs are shown for each body: core (light gray), foot (dark gray), helicase (purple) and SF3b (pink).
e, Completeness of atomic models built by ModelAngelo for each body, using baseline (gray) and Blush (pink) maps.
f, Gold-standard half-map resolutions of each body for baseline (gray) and Blush (pink) maps.

a,b,单个物体的组合图,通过局部分辨率着色,比较基线重建(a)和Blush正则化后的重建(b)。
c,每个主体的蒙面地图与沉积模型中相应区域之间的FSCs (PDB: 5NRL)。d,针对个体的溶剂校正半图FSCs。在c和d中,虚线和实线分别对应基线和Blush图。每个身体的FSCs:核心(浅灰色),足部(深灰色),解旋酶(紫色)和SF3b(粉红色)。
e, ModelAngelo为每个物体建立的原子模型的完整性,使用基线(灰色)和Blush (粉红色)地图。
f,基线(灰色)和Blush (粉红色)地图中每个主体的黄金标准半地图分辨率。

Third, we assessed the performance of Blush regularization for a biological assembly that was different than the types of structures that the denoiser was trained on: the first intermediate amyloid (FIA) that forms during the in vitro assembly of recombinant tau (residues 297–391)19. This dataset is also publicly available (EMPIAR-11720). Unlike any of the structures in the training set, the FIA has helical symmetry (Fig. 5). It is an amyloid filament, with parallel β-strands repeating every 4.7 Å in the direction of the helical axis. Besides deviating from the types of structures in the training set, the FIA is also one of the smallest amyloid structures solved to date, with only 15 ordered residues in each of two opposing β-sheets. Baseline helical refinement yielded a 5.0-Å-resolution map, in which the density for β-strands along the helical axis was not separated, and no atomic model could be built. Blush regularization improved the resolution to 2.8 Å, and ModelAngelo built all 15 ordered residues in the resulting map.
第三,我们评估了与降噪器训练的结构类型不同的生物组装的Blush 正则化性能:重组tau(残基297-391)在体外组装过程中形成的第一个中间淀粉样蛋白(FIA) 19。该数据集也是公开可用的(EMPIAR-11720)。与训练集中的任何结构不同,FIA具有螺旋对称(图5)。它是淀粉样蛋白细丝,平行的β-链在螺旋轴方向上每4.7 Å重复一次。除了偏离训练集的结构类型外,FIA也是迄今为止解决的最小的淀粉样蛋白结构之一,在两个相对的β薄片中,每个只有15个有序残基。基线螺旋细化得到5.0-Å-resolution图,其中β-链沿螺旋轴的密度没有分离,无法建立原子模型。Blush 正则化将分辨率提高到2.8 Å, ModelAngelo在生成的地图中构建了所有15个有序残基。

Fig. 5: Helical reconstruction of the FIA, colored by local resolution, for the baseline.
图5:基线的FIA螺旋重建,用局部分辨率着色。

请添加图片描述
a,b, Maps of the baseline reconstruction (a) and after application of Blush regularization (b).
c, Automated atomic modeling by ModelAngelo, comprising tau residues 302–316.
d, Solvent-corrected half-map FSCs of the reconstructed maps.
a,b,基线重建图(a)和应用Blush正则化后的图(b)。
c,由ModelAngelo自动原子建模,包含tau残基302-316。
d,重建图的溶剂校正半图FSCs。

Fourth, we applied Blush to the small anti-CRISPR associated protein 2 (Aca2) bound to RNA, which has a total molecular weight of 40 kDa (Fig. 6 and Extended Data Table 1). Using different classification and refinement strategies in baseline RELION and CryoSPARC, we could not obtain a reliable reconstruction. Although an initial model generated using the standard VDAM algorithm in RELION20 suffered from anisotropy, the first three-dimensional (3D) classification using Blush regularization resulted in one class with recognizable protein features. Similar 3D classifications without Blush regularization did not yield recognizable protein features. Refinement of the corresponding class yielded a better initial model for a second 3D classification, from which a single class was selected for subsequent CTF refinement21 and particle polishing21. A 3D classification was performed without alignment, followed by a final 3D refinement. Blush regularization was used for all 3D classifications with alignment and 3D refinements. The resolution of the final map was 2.5 Å, with ModelAngelo successfully building 97% of the protein sequence and 33 out of 42 nucleotides.
第四,我们对RNA结合的小抗crispr相关蛋白2 (Aca2)应用了Blush,其总分子量为40 kDa(图6和扩展数据表1)。在基线RELION和CryoSPARC中使用不同的分类和细化策略,我们无法获得可靠的重建。尽管在RELION20中使用标准VDAM算法生成的初始模型存在各向异性,但使用Blush正则化的第一个三维(3D)分类产生了一个具有可识别蛋白质特征的类别。类似的3D分类没有Blush 正则化不能产生可识别的蛋白质特征。对相应的类进行细化得到了一个更好的初始模型,用于第二个3D分类,从中选择一个类进行后续的CTF细化21和粒子抛光21。在不对齐的情况下进行3D分类,然后进行最终的3D细化。Blush 正则化用于所有3D分类与对齐和3D细化。最终图谱的分辨率为2.5 Å, ModelAngelo成功构建了97%的蛋白质序列和42个核苷酸中的33个。

Fig. 6: Single-particle reconstruction of the Aca2–RNA complex with a molecular weight of 40 kDa.
图6:分子量为40 kDa的Aca2-RNA复合物的单颗粒重建。

请添加图片描述

a, Local resolution of the reconstruction with Blush regularization.
b, Automated atomic model assignment with ModelAngelo.
c, Detailed view of an α-helical segment in the reconstructed map and refined atomic model.
d, The solid line shows the FSC to a reference atomic model, and the dashed line shows the half-map FSC. The solvent-corrected resolution is 2.5 Å, using a spectral trailing cut-off at 3.0 Å.
e, Processing pipeline from initial model to final reconstruction. Numbers indicate the number of particles assigned to each map. Purple squares indicate reconstructions using Blush regularization.
a,采用Blush正则化重建的局部分辨率。b、使用ModelAngelo自动分配原子模型。
c,重构图和精化原子模型中α-螺旋段的详细视图。
d、实线表示参考原子模型的FSC,虚线表示半映射FSC。溶剂校正的分辨率为2.5 Å,使用3.0 Å的光谱尾随截止。
e、从初始模型到最终重构的加工流水线。数字表示分配给每个地图的粒子数量。紫色方块表示使用Blush正则化的重建。

Discussion

Methods

Rationale

The noise2noise framework8 facilitates the training of a denoising convolutional neural network in the absence of explicit access to ground-truth images. Instead, it relies on pairs of noisy images to extract information about their shared signal. Here, we present an application-specific approach that incorporates this aspect from the noise2noise framework. We trained a denoiser on a set of 422 pairs of noisy half-maps that we downloaded from the EMDB30. We selected only entries with reported resolutions higher than 4 Å for which both unfiltered half-maps were deposited. Maps with obvious artifacts, for example those associated with overfitting, and maps of a structure that was already present in the training set were eliminated during manual curation.
noise2noise框架8有助于在没有明确访问真实图像的情况下训练去噪卷积神经网络。相反,它依赖于噪声图像对来提取它们共享信号的信息。在这里,我们提出了一种特定于应用程序的方法,该方法结合了noise2noise框架的这一方面。我们对从EMDB30下载的422对噪声半映射集训练了一个去噪器。我们只选择报告分辨率高于4 Å的条目,其中存放了两个未过滤的半地图。带有明显人工制品的地图,例如那些与过拟合相关的地图,以及已经存在于训练集中的结构的地图,在人工管理期间被消除。

在这里插入图片描述
我们定制了数据增强和降噪训练,将其与迭代期望最大化算法相结合,用于低温电镜重建。所有的半映射对, x i ( k ) x_i^{(k)} xi(k) k ∈ 0 , 1 k∈{0,1} k0,1,重新缩放到统一的体素大小1.5 A ˚ Å A˚,并通过生成新对 y i k y_i^k yik, y ‾ i k \overline{y}_i^k yik进行增强
在这里插入图片描述
e e e是随机彩色噪声, M i M_i Mi [ 0 , 1 ] N [0,1]^N [0,1]N是封装感兴趣的分子的平滑掩模, ⊙ ⊙ 表示体素乘法, h ( ⋅ ) h(·) h()一个到15 A ˚ Å A˚的低通滤波器。
H C , A [ ⋅ ] H_{C,A}[·] HC,A[] 应用具有协方差矩阵C的各向异性高斯滤波器,包括旋转和平移的仿射变换A,对 6 4 3 64^3 643个体素的块进行裁剪和体素值标准化。 C C C C ‾ \overline{C} C, A A A, e e e r r r的随机分配实现了数据扩充。

By using a range of resolution cut-offs for C C C and C ‾ \overline{C} C, the denoiser explicitly learns to handle maps with varying resolutions. This is necessary for its application inside the iterative expectation-maximization algorithm, which typically starts at relatively low resolutions and gradually progresses to higher resolutions. Although using a lower resolution cut-off for C C C than for C ‾ \overline{C} C could have produced a network that enhances the resolution of the half-maps, similar to deblurring networks31, we opted not to do so to minimize the risk of hallucinations in high-resolution features.
通过对 C C C C ‾ \overline{C} C使用一系列分辨率截止值,去噪器显式地学习处理不同分辨率的映射。这对于它在迭代期望最大化算法中的应用是必要的,迭代期望最大化算法通常从相对较低的分辨率开始,逐渐发展到更高的分辨率。虽然对于 C C C使用比 C ‾ \overline{C} C更低的分辨率截止值可以产生一个提高半地图分辨率的网络,类似于去模糊网络31,但我们选择不这样做,以尽量减少高分辨率特征产生幻觉的风险。

Using different degrees of anisotropy in C C C and C ‾ \overline{C} C, the denoiser learns to deal with the artifacts that arise from non-uniform orientational distributions, and random orientations and affine transformations in A lead to invariance with respect to rotations, translations and intensity scale. Although initial versions of our training protocol did not include masks, we observed that the resulting networks would learn to smoothen densities in disordered regions, such as the solvent or detergent micelles, which would improve image alignments. To amplify these effects, we then implemented the supervised masking approach with M i M_i Mi and h ( . ) h(.) h(.). By filling disordered regions with a 15- A ˚ Å A˚-resolution low-pass filtered version of the map, as opposed to a straightforward voxel-wise multiplication with the mask M i M_i Mi, higher density values in regions with disordered molecules, such as detergent micelles, are maintained.
C C C C ‾ \overline{C} C中使用不同程度的各向异性,去噪器学习处理由非均匀方向分布引起的伪影,而A中的随机方向和仿射变换导致旋转、平移和强度尺度的不变性。虽然我们的训练方案的初始版本不包括掩模,但我们观察到,最终的网络将学习平滑无序区域的密度,例如溶剂或洗涤剂胶束,这将改善图像对齐。为了放大这些效果,我们使用 M i M_i Mi h ( . ) h(.) h(.)实现了监督掩蔽方法。通过使用15- A ˚ Å A˚分辨率的低通滤波版本的地图填充无序区域,而不是直接使用掩码 M i M_i Mi进行体素乘法,在具有无序分子(如洗涤剂胶团)的区域中保持更高的密度值。

By re-scaling all maps to a common voxel size of 1.5 A ˚ Å A˚, and then cropping maps to patches of 6 4 3 64^3 643 voxels, the network can be trained on and applied to maps of any size. To apply the denoiser to maps that are larger than one patch, overlapping patches can be denoised independently.

Training the denoiser

Our denoiser ( f θ ) (fθ) (fθ) consists of a U-net with approximately 13 million trainable parameters ( θ ) (θ) (θ) (Fig. 1). It is trained using residual learning[32] and with a dropout rate of 50% (ref. [33]). Instance normalization[34] is used to handle small mini-batches ( B ) (B) (B), with b   =   8 b = 8 b= 8 samples from the training dataset, during training. We minimize the following loss:
我们的去噪器 ( f θ ) (fθ) (fθ)由一个具有大约1300万个可训练参数 ( θ ) (θ) (θ)的U-net组成(图1)。它使用残差学习进行训练[32],dropout 率为50%(参考文献[33])。实例规范化[34]用于在训练过程中处理小批量的 ( B ) (B) (B),其中$ b = 8$个样本来自训练数据集。我们尽量减少以下损失:

请添加图片描述
where R r [ f θ , y ] R_r[f_θ,y] Rr[fθ,y] returns the output of the denoiser f θ f_θ fθ after recursively calling it r   ∈   r ∈  r{ 0 ,   … ,   5 0, …, 5 0,, 5 } times with y i k y_i^k yik as the initial input. This enables the denoiser to recognize and suppress artifacts brought about by its repeated usage, thereby limiting the amplification of artifacts in the reconstruction that are introduced by the denoiser during subsequent iterations of the expectation-maximization algorithm7.

其中 R r [ f θ , y ] R_r[f_θ,y] Rr[fθy] y i k y_i^k yik作为初始输入,递归地调用 r ∈ r∈ r{ 0 , … , 5 0,…,5 05}后返回去噪器 f θ f_θ fθ的输出。这使得去噪器能够识别并抑制因其重复使用而产生的伪影,从而限制了在期望最大化算法的后续迭代中由去噪器引入的重建伪影的放大。

Training for 950,000 steps took six days using a single Nvidia A100 GPU.
使用单个Nvidia A100 GPU训练95万步需要6天时间。

Iterative denoising with spectral trailing

We refer to the application of our pre-trained denoiser within the iterative expectation-maximization algorithm as Blush regularization. In our original work, with simulated data, we incorporated the denoiser into the L2 regularization in the M-step, on the basis of the approximation that the prior function is ‘close’ to a Gaussian7. In this work, we do not make formal claims about the role of the denoiser within a Bayesian framework. Instead, our approach is motivated by empirical observations.
我们将预训练的去噪器在迭代期望最大化算法中的应用称为Blush正则化。在我们的原始工作中,使用模拟数据,我们基于先验函数“接近”高斯7的近似,将去噪器纳入m步的L2正则化中。在这项工作中,我们没有对贝叶斯框架中去噪器的作用做出正式的声明。相反,我们的方法是由经验观察驱动的。

Although one effect of the denoiser is that it tends to dampen Fourier components at higher spatial frequencies, the amount by which it does so is not well defined. Therefore, we use a heuristic method, here referred to as spectral trailing, to prevent overfitting in 3D autorefinement and multi-body refinement. First, we calculate the FSC between two independently refined half-maps before the denoiser is applied, and determine the ρ ρ ρ value at which the solvent-corrected FSC drops below 0.143. We then apply the denoiser to both half-maps and subsequently apply a low-pass filter at a spatial frequency that is two Fourier shells (each shell is one Fourier voxel wide) lower than ρ. If ρ exceeds the Nyquist frequency of the denoiser, here set to 3 A ˚ Å A˚, the remaining Fourier shells at higher frequencies are populated with the reconstruction from the standard regularization in Fourier space. The resulting denoised, low-pass-filtered maps are then used as references for alignment in the next iteration. The denoiser is not applied to the output of the final refinement step.
尽管去噪器的一个作用是它倾向于在较高的空间频率上抑制傅里叶分量,但它的抑制量并没有很好地定义。因此,我们使用一种启发式方法,这里称为光谱拖尾,以防止在3D自细化和多体细化中过度拟合。首先,在应用去噪之前,我们计算了两个独立精炼半映射之间的FSC,并确定了溶剂校正FSC降至0.143以下的 ρ ρ ρ值。然后,我们将去噪器应用于两个半映射,并随后在低于ρ的两个傅里叶壳层(每个壳层是一个傅里叶体素宽)的空间频率上应用低通滤波器。如果ρ超过去噪器的奈奎斯特频率,这里设置为3 A ˚ Å A˚,则剩余的高频傅里叶壳层由傅里叶空间中的标准正则化重建填充。所得到的去噪、低通滤波的映射然后用作下一次迭代中对齐的参考。去噪器不应用于最后细化步骤的输出。

Blush regularization has been implemented in the open-source software RELION-5, using a combination of C++ and PyTorch. It can be used for 3D classification, multi-body refinement and 3D autorefinement jobs, including those for particles with point-group or helical symmetry. For 3D classification for data that are separated into independent half-sets, the filtered map from the regularized likelihood approach is used as input for the denoiser. No additional low-pass filtering is applied. In this job type, the denoiser is also applied in the last iteration.
在开源软件RELION-5中,使用c++和PyTorch的组合实现了Blush的正则化。它可以用于三维分类,多体细化和三维自动细化工作,包括点群或螺旋对称的粒子。对于分离成独立半集的数据的3D分类,使用正则似然方法过滤后的映射作为去噪的输入。没有额外的低通滤波应用。在这种作业类型中,去噪器也在最后一次迭代中应用。

Data availability

Code availability

Blush regularization has been implemented in the open-source software RELION-5, which is distributed for free under the GPLv2 license and can be downloaded from https://github.com/3dem/relion.

Additionally, code used in the training procedure of the Blush denoiser model is available at https://github.com/dkimanius/blush-training.

References

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值