【论文阅读】一种高性能SurfaceNets离散等值线算法

本文链接：https://blog.csdn.net/m0_50910915/article/details/141357174

A High-Performance SurfaceNets Discrete Isocontouring Algorithm

在这里插入图片描述

ABSTRACT

Isocontouring is one of the most widely used visualization techniques. However, many popular contouring algorithms were created prior to the advent of ubiquitous parallel approaches, such as multicore, shared memory computing systems. With increasing data sizes and computational loads, it is essential to reimagine such algorithms to leverage the increased computing capabilities available today. To this end we have redesigned the SurfaceNets algorithm, a powerful technique which is often employed to isocontour non-continuous, discrete, volumetric scalar fields such as segmentation label maps. Label maps are ubiquitous to medical computing and biological analysis, used in applications ranging from anatomical atlas creation to brain connectomics. This novel Parallel SurfaceNets algorithm has been redesigned using concepts from the high-performance Flying Edges continuous isocontouring algorrithm. It consists of two basic steps, surface extraction followed by constrained smoothing, parallelized over volume edges and employing a double-buffering smoothing approach to guarantee determinism. The algorithm can extract and smooth multiple segmented objects in a single execution, producing a polygonal (triangular/quadrilateral) mesh with points and polygons fully shared between neighboring objects. Performance is typically one to two orders of magnitude faster than the current sequential algorithms for discrete isosurface extraction on small core-count commodity CPU hardware. We demonstrate the effectiveness of the algorithm on five different datasets including human torso and brain atlases, mouse brain segmentation, and electron microscopy connectomics. The software is currently available under a permissive, open source license in the VTK visualization system.

等值面是最广泛使用的可视化技术之一。然而，许多流行的等值面算法是在普遍存在的并行方法（例如多核、共享内存计算系统）出现之前创建的。随着数据规模和计算负载的增加，必须重新构想此类算法以利用当今的计算能力。为此，我们重新设计了 SurfaceNets 算法，这是一种强大的技术，通常用于非连续、离散、体积标量场的表面重建，例如分割标签图。标签图在医学计算和生物分析中无处不在，其应用范围从解剖图谱创建到大脑连接组学。这种新颖的 Parallel SurfaceNets 算法使用高性能 Flying Edges 连续等轮廓算法的概念进行了重新设计。它由两个基本步骤组成，表面提取，然后是约束平滑，在体积边上并行化，并采用双缓冲平滑方法来保证确定性。该算法可以在一次执行中提取和平滑多个分段对象，生成多边形（三角形/四边形）网格，其中相邻对象之间完全共享点和多边形。在小核心数商用 CPU 硬件上进行离散等值面提取时，性能通常比当前的顺序算法快一到两个数量级。我们在五个不同的数据集上展示了该算法的有效性。该软件目前可在 VTK 可视化系统中以宽松的开源许可证使用。

1 INTRODUCTION

…
Various approaches to improving performance have been proposed over the years, culminating with the recent introduction of the Flying Edges (FE) isocontouring algorithm [13], which addresses many of the computational deficiencies of Marching Cubes, providing high-speed, scalable performance.

多年来，人们提出了各种提高性能的方法，最近推出的 Flying Edges (FE) 等值面算法[13]解决了 Marching Cubes 算法在计算方面的许多不足，提供了高速、可扩展的性能。

Another important driver in the computational environment is the increasing specialization and sophistication of scientific and engineering workflows. Again returning to Marching Cubes, which was initially designed to isocontour continuous scalar fields (continuous in the sense that scalar values are assumed to vary linearly across voxel edges), it was repurposed to contour discrete scalar field such as segmentation label maps. While this adaptation was certainly useful, as the demands on discrete isocontouring have grown, the MC adaptation has become increasingly problematic; for example, it produces surface meshes with serious deficiencies such as duplicated output polygons between adjacent labeled regions, so that boundaries between extracted objects are not fully shared. Further, to extract multiple isocontours, multiple passes are required for each contour value (i.e., for each segmentation label). Discrete Marching Cubes also produces “voxelized” surfaces which clearly reflect the resolution of the volumetric grid. Typically an additional smoothing pass is necessary to produce surfaces that better represent the actual object geometry.

计算环境的另一个重要驱动因素是科学和工程工作流程日益专业化和复杂化。回到 Marching Cubes ，它最初被设计用于连续标量场（连续是指假设标量值在体素边上线性变化）的等值面提取，它被重新用于重建离散标量场，例如分割标签图。随着对离散等轮廓的需求的增长，MC 的问题越来越凸显；例如，它生成的表面网格存在严重缺陷：相邻标记区域之间存在重复的输出多边形，因此提取的对象之间的边界不能完全共享。此外，为了提取多个等值面，每个等值面（即每个分割标签）都需要重新运行。Discrete Marching Cubes 还产生“体素化”表面，可以清楚地反映体积网格的分辨率。通常需要额外的平滑步骤来生成更好地代表实际对象几何形状的表面。

SurfaceNets (SN) was developed in part to address the limitation of MC for discrete isocontouring, and to better support medical computing workflows such as anatomical atlas generation [5, 10]. As described by Frisken [3, 4], at a high level the algorithm consists of two steps: 1) extraction of a voxelized surface, followed by 2) a smoothing of the surface (Fig. 2). The surface extraction employs a dual contouring approach; that is, rather than generating points by intersecting voxel edges with an isocontour value, the extraction process generates at most one point at the center of a voxel. A point is generated when any one of its twelve edges intersects object boundaries, i.e., when the edge endpoints lie in separate labelled regions (including the background region). A quadrilateral polygon is also produced for each intersected voxel edge, connecting the voxel’s center point to three other voxels sharing the intersected edge. At the same time, a smoothing stencil is defined consisting of up to six edges connecting voxel face neighbors which also contain a generated point. Note that all the isosurfaces (i.e., labeled regions) are simultaneously extracted in one execution of the algorithm. Once the surface is extracted, an iterative, constrained Laplacian smoothing process is performed using the stencils, in a process that constrains points to lie within the voxel from which they were generated. Typically after smoothing, the generated quadrilateral polygons are triangulated as the resulting surface is not planar. Also, as [4] describes, the smoothing stencils are often modified to move points along boundary surfaces or edges, thereby improving the quality of the smoothing process.

开发SurfaceNets (SN) 的部分原因是为了解决 MC 对离散等值面的限制，并更好地支持医学计算工作流程，例如解剖图谱生成 [5, 10]。正如 Frisken [3, 4] 所描述的，该算法在较高层次上由两个步骤组成：1) 提取体素化表面， 2) 平滑表面（图 2）。表面提取采用双轮廓方法；也就是说，提取过程不是通过将体素边与等值面相交来生成点，而是在体素的中心最多生成一个点。当其十二条边中的任何一条与对象边界相交时，即当边的端点（体素顶点好理解点）位于单独的标记区域（包括背景区域）时，就会生成一个点。还为每个相交的体素边生成四边形多边形，将体素的中心点连接到共享相交的边的其他三个体素。同时，定义了一个平滑模板，该模板由最多六个连接体素面邻居的边组成，这些邻居也包含生成的点。请注意，在算法的一次执行中同时提取所有等值面（即标记区域）。提取表面后，将使用模板执行迭代的约束拉普拉斯平滑过程，该过程将点限制在生成点的体素内。通常在平滑之后，生成的四边形多边形会被三角化，因为生成的表面不是平面的。此外，正如文献[4]所述，平滑模版经常会被修改，以沿着边界表面或边移动点，从而提高平滑处理的质量。

在这里插入图片描述
Figure 2: Overview of the SurfaceNets algorithm. A surface mesh encapsulating separate objects (i.e., segmented regions) is extracted from a labeled volume. Adjacent objects share boundary points and cells. The surface is then smoothed using a constrained Laplacian approach.

图 2：SurfaceNets 算法概述。从标记体积中提取封装单独对象（即分段区域）的表面网格。相邻对象共享边界点和单元。然后使用约束拉普拉斯方法对表面进行平滑处理。

In this paper, we present a novel, parallel, high-performance SurfaceNets algorithm. The algorithm borrows concepts from Flying Edges, including a multi-pass approach that processes data across volume edges, edge trimming to reduce computational load, and point and polygon generation without bottleneck functions such as incremental memory allocation and coincident point merging. In the following we describe the algorithm, offer important implementation details, and characterize performance on five representative datasets.

在本文中，我们提出了一种新颖的、并行的、高性能的 SurfaceNets 算法。该算法借鉴了 Flying Edges 的概念，包括跨体积边处理数据的多步方法、边修剪以减少计算负载，以及无瓶颈（例如增量内存分配和重合点合并）的点和多边形生成。下面我们描述该算法，提供重要的实现细节，并描述五个代表性数据集的性能特征。

2 ALGORITHM

In this section we begin by providing a high-level overview of the SurfaceNets algorithm, including introducing descriptive notation. Key algorithmic features of the Parallel SurfaceNets (PSN) algorithm are then described.

在本节中，我们首先提供 SurfaceNets 算法的高级概述，包括介绍描述性符号。然后描述并行 SurfaceNets (PSN) 算法的关键算法特征。

2.1 Notation

The SurfaceNets algorithm operates on structured grids with voxel scalar values arranged on a topologically regular lattice of data points (i.e., a data volume) with scalar values $s_{i, j,k}$ . Voxel cells $v_{i, j,k}$ are defined from the eight adjacent points associated with scalar label values $s_{i±(0,1), j±(0,1),k±(0,1)}$ . Volume edges $E_{i, j,k}$ run in the row $E_{ j,k}$ , column $E_{i, k}$ , and stack $E_{i, j}$ directions; and are composed of the union of voxel cell edges $e_{i, j,k}$ . For example the volume $x$ -edges are defined by $E_{j,k} = \cup_{i}e_{i, j,k}$ .

SurfaceNets 算法在结构化网格上运行，体素标量值排列在数据点（即数据体）的拓扑规则网格上，标量值为 $s_{i, j,k}$ 。体素单元 $v_{i, j,k}$ 由与标量标签值 $s_{i±(0,1), j±(0,1),k±(0,1)}$ 相关的八个相邻点定义。体积边 $E_{i, j,k}$ 沿行 $E_{ j,k}$ 、列 $E_{i, k}$ 和堆 $E_{i, j}$ 方向运行；由体素单元边 $e_{i, j,k}$ 的并集组成。例如， $x$ 边由 $E_{j,k} = \cup_{i}e_{i, j,k}$ 定义。

The purpose of the algorithm is to generate an approximation to the set of isocontours $Q_i$ with isovalues $q_i$ such that $q_i: Q_i(q_i) = {p | F(p) = q_i}$ where $F (p)$ maps the point $p ∈ R^N$ to a real-valued number. The resulting approximation surfaces $S_i$ are continuous, piecewise linear surfaces such as a triangle mesh (in 3D). Note also that we say that $Q_i$ intersects $e_{i, j,k}$ when some $p$ lies along the voxel cell edge $e_{i, j,k}$ .

该算法的目的是生成等值线 $Q_i$ 的近似值，使得 $q_i : Q_i(q_i) = {p | F(p) = q_i}$ 其中 $F (p)$ 将点 $p ∈ R^N$ 映射到实数值。生成的近似曲面 $S_i$ 是连续的分段线性曲面，例如三角形网格（3D 中）。另请注意，当某些 $p$ 位于体素单元边 $e_{i, j,k}$ 时，我们说 $Q_i$ 与 $e_{i, j,k}$ 相交。

This formulation presumes that multiple isosurfaces $Q_i$ are simultaneously extracted, consistent with a volume defined by multiple discrete label values (i.e., a label map $L$ , or segmentation mask). No presumption is made regarding the continuity of the scalar field; in fact in general the field is not continuous since the labels are typically disjoint. Consequently many isocontouring algorithms such as Marching Cubes which assume continuous (e.g., linear) variation across voxel edges, are not applicable to this problem, at least not without some assumption on continuity (e.g., voxel edges are intersected at their centerpoint). We presume that $L$ is a set consisting of multiple labeled regions $l_i$ , and $s_{i, j,k} ∈ L$ , with the special definition of a background label $B$ consisting of any $s_{i, j,k} \notin L$ . Note to produce isocontours we choose some non-empty set $q_i ∈ L$ . This produces labeled objects which approximate the isosurfaces $Q_i$ .

该公式假设同时提取多个等值面 $Q_i$ ，与由多个离散标签值（即标签图 $L$ 或分割掩模）定义的体积一致。没有对标量场的连续性做出任何假设；事实上，一般来说，场不是连续的，因为标签通常是不相交的。因此，许多等轮廓算法（例如假设体素边连续（例如线性）变化的 Marching Cubes）不适用于此问题，至少在没有连续性假设的情况下（例如，体素边在其中心点相交）。我们假设 $L$ 是由多个标记区域 $l_i$ 组成的集合，并且 $s_{i, j,k} ∈ L$ ，并且特殊定义了由 $s_{i , j,k} \notin L$ 组成的背景标签 $B$ 。请注意，为了生成等值线，我们选择一些非空集合 $q_i ∈ L$ 。这会生成近似等值面 $Q_i$ 的标记对象。

Voxel cell rows $V_{i, j,k}$ consist of all voxel cells $V_{i, j,k}$ touching both $E_{i, j,k}$ and $E_{i+1, j+1,k+1}$ . So for example, an $x$ row edge $E_{j,k} = \cup_{i}e_{ j,k}$ and $V_{ij,k} = \cup_{i}v_{i, j,k}$ . Each $v_{i, j,k}$ has an associated cell axes $t_{i, j,k}$ (i.e., the voxel triad) which is composed with the three cell edges emanating from the voxel origin located at $s_{i, j,k}$ in the positive row $x$ , column $y$ , and stack $z$ directions. Refer to Fig. 3.

体素单元行 $V_{i, j,k}$ 由同时接触 $E_{i, j,k}$ 和 $E_{i+1, j+1,k+1}$ 的所有体素单元 $V_{i, j,k}$ 组成。因此，举例来说，一个 $x$ 行边 $E_{j,k} = \cup_{i}e_{ j,k}$ 和 $V_{ij,k} = \cup_{i}v_{i,j,k}$ 。每个 $v_{i, j,k}$ 都有一个相关的单元轴 $t_{i,j,k}$ （即体素三元组），它是由从位于 $s_{i, j,k}$ 的体素原点向正行 $x$ 、列 $y$ 和堆 $z$ 方向发出的三条单元边组成的。参见图 3。
在这里插入图片描述

Figure 3: Depiction of parallel SurfaceNets concepts. The voxel cell $v_{i, j,k}$ with edge $e_{i, j,k}$ form volume rows $V_{ j,k}$ and volume edges $E_{ j,k}$ . A voxel cell triad $t_{i, j,k}$ is associated with each $v_{i, j,k}$ . Scalar label values $s_{i, j,k}$ are associated with each point in the volumetric lattice.

图 3：并行 SurfaceNets 符号描述。体素单元 $v_{i, j,k}$ 与边 $e_{i, j,k}$ 形成体积行 $V_{ j,k}$ 和体积边 $E_{ j,k}$ 。体素单元三元组 $t_{i, j,k}$ 与每个 $v_{i, j,k}$ 相关联。标量标签值 $s_{i, j,k}$ 与体素中的每个点相关联。

2.2 SurfaceNets

SurfaceNets is an elegantly simple algorithm well suited to the extraction of labeled objects from volumetric objects. Conceptually, the twelve edges defining a voxel cell are intersected against the set $L$ . An edge intersection is detected if for the two edge endpoints with scalar values $s_0 and s_1, {s_0}\neq{s_1} and (s_0\neq B or s_1\neq{B})$ (i.e., if $s_0 = s_1 = B, or s_0 = s_1$ then no edge intersection occurs).

SurfaceNets 是一种优雅而简单的算法，非常适合从体素化对象中提取标注物体。从概念上讲，定义体素单元的十二条边与 $L$ 集相交。如果两个边端点的标量值 $s_0 and s_1, {s_0}\neq{s_1} and (s_0\neq B or s_1\neq{B})$ 则检测到相交。（若 $s_0 = s_1 = B, or s_0 = s_1$ 那么就没有相交）。

Once the edges of a voxel cell are intersected, a 12-bit edge case $c^e_{i, j,k}$ can be generated to represent the intersection state (see Section 3.3). If any of the cell’s twelve edges is intersected, then a single point is generated in the center of the voxel cell. Further, for each intersected edge of $c^e_{i, j,k}$ a quadrilateral polygon is generated orthogonal to and which bisects the edge. The quadrilateral connects the generated voxel cell center point with points located in the neighboring voxel cells sharing the intersected edge. Thus, depending on which edges are intersected, the generated point may be directly connected to up to six face neighboring cells, and up to 12 quadrilaterals may be generated (Fig. 10). This connectivity defines the smoothing stencil which is the local network of edges connecting each generated point to its voxel face neighbors which must necessarily also contain a generated point. Note that the smoothing stencil is implicitly defined by the edge case: for every intersected edge, two voxel cell face neighbors are necessarily connected. Thus a face case $c^f_{i, j,k}$ can be generated from the edge case and represented by a 6-bit value.

一旦体素单元的边相交，就可以生成 12 位边案例 $c^e_{i, j,k}$ 来表示相交状态（参见第 3.3 节）。如果体素单元的十二条边中的任何一个边发生相交，则在体素单元的中心生成一个点。此外，对于 $c^e_{i, j,k}$ 的每个相交边，生成一个正交于并平分该边的四边形多边形。四边形将生成的体素单元中心点与位于共享相交边的相邻体素单元中的点连接起来。因此，根据相交的边，生成的点可以直接连接到最多六个面的相邻单元，并且可以生成最多12个四边形（图10）。这种连接性定义了平滑模板，它是将每个生成点连接到其体素面邻居的局部边网络，这些邻居也必须包含生成点。请注意，平滑模板是由边案例隐式定义的：对于每个相交的边，两个体素单元面邻居必须连接。因此，面案例 $c^f_{i, j,k}$ 可以从边案例生成并由 6 位值表示。

在这里插入图片描述
Figure 10: A comparison of voxel cell subdivision when all eight cell vertices lie in different regions. Discrete MC (left) generates 12 shared points (in red), and eight duplicated triangles. SurfaceNets generates a single interior point (in red), and 12 shared quads connecting neighboring voxels.

图 10：当所有八个单元顶点位于不同区域时体素单元细分的比较。Discrete MC（左）生成 12 个共享点（红色）和 8 个重复三角形。 SurfaceNets 生成单个内部点（红色）和连接相邻体素的 12 个共享四边形。

After processing all voxel cells, a quadrilateral surface mesh is generated. The mesh partitions the volume into regions (or objects) containing voxels with identical labels $l_i$ . Typically this mesh is nonmanifold since a voxel cell may be partitioned into eight separate regions with differing label values at each of its eight vertices. Note however that adjacent regions share boundary quadrilaterals and points, and each quadrilateral can be encoded with the label of the two regions on either side of it (a region adjacency two-tuple indicating the labeled regions on either side of it). At this point in the algorithm, the mesh appears “voxelized” consistent with its origins as a dual surface generation process. In the second major step of the SurfaceNets algorithm, the mesh is smoothed to remove sharp features and provide a pleasing visual result.

处理完所有体素单元后，生成四边形表面网格。网格将体积划分为包含具有相同标签 $l_i$ 的体素的区域（或对象）。通常，该网格是非流形的，因为体素单元可以划分为八个单独的区域，其八个顶点中的每个顶点具有不同的标签值。但请注意，相邻区域共享边界四边形和点，并且每个四边形都可以使用其两侧的两个区域的标签进行编码（区域邻接二元组指示其两侧的标记区域）。此时，在算法中，网格显示为“体素化”，与其作为双表面生成过程的跟源一致。在 SurfaceNets 算法的第二个主要步骤中，网格被平滑以去除尖锐特征并提供令人愉悦的视觉结果。

The smoothing process employs a modified Laplacian approach. For each point $p_i$ its associated smoothing stencil is used to iteratively move $p_i$ towards the averaged center point of its $N$ connected points $p_j$ . A convergence factor $λ$ controls the amount of motion in a given iteration:

平滑过程采用改进的拉普拉斯方法。对于每个点 $p_i$ ，其关联的平滑模板用于迭代地将 $p_i$ 移向其 $N$ 连接点 $Δp_j$ 的平均中心点。收敛因子 $λ$ 控制给定迭代中的运动量：
在这里插入图片描述
The total motion of $p_i$ is constrained relative to its generating voxel cell. The constraints can be defined as a fraction of the hexahedral voxel cell, or more simply as a constraint sphere centered at the voxel center point. Finally, the smoothing stencils can be modified in such a way as to improve the smoothed mesh. For example, the motion of points along sharp edges can be constrained to the edge by judicious modification of the smoothing stencils (see [4] for further details).

$p_i$ 的总运动相对于其生成的体素单元受到约束。约束可以定义为六面体体素单元的一部分，或更简单地定义为以体素中心点为中心的约束球体。最后，可以修改平滑模板以改进平滑网格。例如，通过明智地修改平滑模板，可将尖锐边缘上的点的运动限制在边上（有关更多详细信息，请参阅[4]）。

2.3 Surface Extraction

Highly-organized structured data such as a volume lends itself to a variety of parallel approaches. Our method takes advantage of cache locality by processing data in the fastest varying data direction (i.e., along the volume $x$ edges); separating the processing into simple, independent computational passes along volume edges; reducing overall computation by performing a small amount of initial extra work (i.e., determine computational edge trimming); eliminating incremental memory allocation; and avoiding duplicate processing (such as determining voxel edge intersections multiple times on shared edges).

高度组织的结构化数据（例如体素）适合多种并行方法。我们的方法通过在最快变化的数据方向（即沿着体素 $x$ 边）处理数据来利用缓存局部性；将处理分成沿体积边的简单、独立的计算步骤；通过执行少量初始额外工作（即确定计算边修剪起始位置）来减少总体计算；消除增量内存分配；并避免重复处理（例如在共享边上多次确定体素边交点）。

The surface extraction portion of the algorithm is implemented as four passes across the data. The first three passes count the number of output points and triangles, bound the contour extent with computational edge trim values, and produce the metadata necessary to generate output in the fourth pass. In this last and final pass, output points, quadrilateral surface primitives, smoothing stencils, and the region adjacency two-tuple are generated and directly written into pre-allocated output arrays without the need for mutexes or locks.

该算法的表面提取部分通过四次数据传递来实现。前三步计算输出点和三角形的数量，用计算的边修剪值限制轮廓范围，并生成在第四遍中生成输出所需的元数据。在最后一步中，生成输出点、四边形表面基元、平滑模板和区域邻接二元组，并将其直接写入预先分配的输出数组中，而不需要互斥或锁。

Note that the voxel triad $t_{i, j,k}$ is a core concept to the algorithm described below. It carries five important bits of information about its associated voxel cell $v_{i, j,k}$ . These are: the classification of the scalar $s_{i, j,k}$ (inside a labeled region, or background); whether the three $x, y, z$ voxel edges emanating from the cell origin intersect some $Q_i$ ; and whether the associated voxel cell produces an output point. Triads can be computed independently in parallel and then combined to produce an edge case number as described previously and shown in Fig. 4. Moreover, the triad encodes whether its $x - y - z$ axes should produce output quadrilaterals, and generate an output point. When combined, the generation of quadrilaterals from the triads eliminates multiple edge intersection operations, and produces an output mesh without duplicate faces.

请注意，体素三元组 $t_{i, j,k}$ 是下面描述的算法的核心概念。它携带有关其关联体素单元 $v_{i, j,k}$ 的五个重要信息。这些是：标量 $s_{i, j,k}$ 的分类（在标记区域或背景内）；从单元原点发出的三个 $x, y, z$ 体素边是否与某些 $Q_i$ 相交；以及关联的体素单元是否产生输出点。三元组可以并行独立计算，然后组合起来产生一个边案例数，如图 4 所示。此外，三元组编码其 $x - y - z$ 轴是否应产生输出四边形，并生成输出观点。组合时，从三元组生成四边形消除了多个边相交操作，并生成没有重复面的输出网格。

在这里插入图片描述
Figure 4: The voxel cell edge case is determined by combining the eight voxel triads located at the eight vertices of the voxel cell. Only portions of the triads (bolded arrows) are used to determine the edges intersected by the labeled regions.

图 4：体素单元边案例是通过组合位于体素单元八个顶点的八个体素三元组来确定的。仅使用三元组的部分（粗体箭头）来确定与标记区域相交的边。

In the following, we describe each of the four passes of surface extraction. While these passes are described in terms of processing $x$ edges, it is possible to process $y$ or $z$ edges instead. However, in typical data layouts, increasing $x$ corresponds to contiguous memory locations, thereby reducing cache misses.

下面，我们将逐一介绍表面提取的四个步骤。虽然这些处理过程是以处理 $x$ 边来描述的，但也可以用处理 $y$ 或 $z$ 边来代替。不过，在典型的数据布局中，增加 $x$ 可对应连续的内存位置，从而减少缓存丢失。

Pass 1: Process volume $x$ edges $E_{j,k}.$ For each volume $x$ edge $E_{j,k}$ , all voxel $x$ edges (i.e., e j,k) composing $E_{j,k}$ are visited and marked when their interval intersects Qi. The left and right trim positions $x^L_{j,k}$ and $x^R_{j,k}$ are noted, as well as the number of $x$ intersections along $E_{j,k}$ . Each $E_{j,k}$ is processed independently and in parallel. (Note that the trim position $x^L_{j,k}$ indicates where Qi first intersects $E_{j,k}$ on its left side; and $x^R_{j,k}$ bounds where Q last intersects $E_{j,k}$ on its right side. Additional details describing computational edge trimming will be provided shortly.) Classification information about each triad is gathered (whether the voxel value $s_{i, j,k} \in L$ or a background voxel $s_{i, j,k} \in B$ ).

第 1 步：处理体素 $x$ 边 $E_{j,k}$ 。 对于每个体积 $x$ 边 $E_{j,k}$ ，组成 $E_{j,k}$ 的所有体素 $x$ 边（即 e j,k）在其间隔与 Qi 相交时被访问并标记。记录了左右修剪位置 $x^L_{j,k}$ 和 $x^R_{j,k}$ ，以及沿 $E_{j,k}$ 的 $x$ 交点数量。每个 $E_{j,k}$ 都是独立并行处理的。（请注意，修剪位置 $x^L_{j,k}$ 表示 Qi 在其左侧首先与 $E_{j,k}$ 相交的位置；而 $x^R_{j,k}$ 则在 Q 最后相交的位置处边界 $E_{j,k}$ 在其右侧。稍后将提供描述计算边修剪的其他详细信息。）收集有关每个三元组的分类信息（无论体素值 $s_{i, j,k} \in L$ ）或背景体素 $s_{i, j,k} \in B$ )。

Pass 2: Process $y$ and $z$ voxel-cell edges. For each volume cell row $V_{j,k}$ , the triads between the trim interval $[x^L_{j,k}$ , $x^R_{j,k})$ are visited. The $y$ and $z$ triad edges are intersected against $Q_i$ . Depending on the outcome of this classification process, the trim interval may be adjusted to produce a final trim interval $\overline{x}^L _{jk}, \overline{x}^R_{jk})$ . Classification information gathered in Pass 1 is used to accelerate these intersection checks. For example, for a voxel edge with endpoints $s_0, s_1)$ , if $s_0, s_1∈ B$ then no edge intersection exists.

第 2 步：处理 $y$ 和 $z$ 体素单元边。 对于每个体积单元行 $V_{j,k}$ ，访问修剪间隔 $[x^L_{j,k}$ , $x^R_{j,k})$ 之间的三元组。 $y$ 和 $z$ 三重轴边与 $Q_i$ 相交。根据此分类过程的结果，可以调整修剪间隔以产生最终修剪间隔 $\overline{x}^L _{jk}, \overline{x}^R_{jk})$ 。第 1 步中收集的分类信息用于加速这些交点检查。例如，对于具有端点 $s_0, s_1)$ 的体素边，如果 $s_0, s_1∈ B$ 则不存在边交集。

Pass 3: Configure output. Each voxel row $V_{j,k}$ is processed independently and in parallel. For each $v_{i, j,k}$ in a row, the voxel edge case number $c_{i, j,k}$ is computed by combining eight voxel triads forming the voxel cell (Section 3.3). If $c_{i, j,k} = 0$ , then no point is generated interior to the associated cell $v_{i, j,k}$ . Otherwise, a point and smoothing stencil are produced. If any of $t_{i, j,k}$ edges are intersected, a quadrilateral polygon is produced. The number of output points, quadrilaterals, and smoothing stencil edges is recorded as per edge metadata. Once all $V_{j,k}$ are processed, the metadata is updated via a prefix sum operation. This produces an indexing which maps the input data into output geometric primitives and stencils. To complete this pass, memory is precisely allocated for the output—no additional allocation is needed during the output generation process of Pass 4.

第 3 步：配置输出。 每个体素行 $V_{j,k}$ 均独立且并行处理。对于一行中的每个 $v_{i, j,k}$ ，体素边编号 $c_{i, j,k}$ 是通过组合形成体素单元的八个体素三元组来计算的（第 3.3 节）。如果 $c_{i, j,k} = 0$ ，则在关联单元格 $v_{i, j,k}$ 内部不会生成任何点。否则，将生成点和平滑模板。如果任何 $t_{i, j,k}$ 与边相交，则会生成四边形。输出点、四边形和平滑模板边的数量根据边元数据进行记录。一旦处理完所有 $V_{j,k}$ ，元数据就会通过前缀和运算进行更新。这会产生一个索引，将输入数据映射到输出几何图元和模板。为了完成此遍，为输出精确分配了内存——在第 4 步的输出生成过程中不需要额外的分配。

Pass 4: Generate output. In the final pass, each $V_{j,k}$ is processed in parallel by forward iteration (Section 3.6) over the $t_{i, j,k}$ within the trim interval, producing center points and stencils in voxel cells as appropriate. Quadrilaterals are also output (if necessary) as each voxel triad indicates.

第 4 步：生成输出。 在最后一步中，每个 $V_{j,k}$ 通过修剪内的 $t_{i, j,k}$ 上的前向迭代（第 3.6 节）并行处理间隔，根据需要在体素单元中生成中心点和模板。四边形也会按照每个体素三元组的指示输出（如有必要）。

The trim interval plays an important role in the algorithm as it is used to rapidly skip data. Not only can the beginning and ending portions of $V_{j,k}$ be skipped, but entire rows and slices of data as well. The triads $t_{i, j,k}$ serve two important functions. First, they decouple computations into independent passes along volume edges. It is only when voxel cell information such as edge cases $c^e_{i, j,k}$ are needed that are they combined using fast bit-wise operations. Secondly, the $t_{i, j,k}$ control the generation of quadrilateral polygons. By combining the quadrilaterals generated from the eight $t_{i, j,k}$ associated with the voxel cell vertices, the complete surface net is generated for each $v_{i, j,k}$ . Using the $c_{i, j,k}$ and $t_{i, j,k}$ to control primitive generation eliminates the need to merge coincident points and/or polygons, with points, quadrilaterals, and stencil edges generated only once and written directly into the previously allocated output arrays (Pass 3) without memory write contention.

修剪间隔在算法中起着重要作用，因为它用于快速跳过数据。不仅可以跳过 $V_{j,k}$ 的开始和结束部分，还可以跳过整个数据行和切片。三元组 $t_{i, j,k}$ 有两个重要功能。首先，它们将计算解耦为沿着体积边的独立通道。仅当需要诸如边案例 $c^e_{i, j,k}$ 之类的体素单元信息时，才使用快速按位运算将它们组合起来。其次， $t_{i,j,k}$ 控制四边形的生成。通过组合由与体素单元顶点相关的八个 $t_{i, j,k}$ 生成的四边形，为每个 $v_{i, j,k}$ 生成完整的表面网络。使用 $c_{i, j,k}$ 和 $t_{i, j,k}$ 来控制图元生成，无需合并重合点和/或多边形，点、四边形和模板边仅生成一次并直接写入先前分配的输出数组（第 3 步），无需内存写入争用。

2.4 Smoothing

Once the surface has been extracted, and smoothing stencils defined, the second smoothing step of the Parallel SurfaceNets algorithm proceeds. To implement Equation 1 in parallel requires approaches that avoid data races. While mutex or spin/speculative locks could be used, double-buffering avoids these slower mechanisms and ensures determinism. Basically, two arrays of points $P_0$ and $P_1$ are created. Initializing $P_0$ with the output points of the surface extraction process, we used these points to compute $P_1$ . Next, we swap $P_0$ and $P_1$ and repeat (i.e., double buffer). Alternating swap / compute is used until the desired result is achieved (typically a modest number of iterations 25 is used, although the algorithm is fast enough that the number of iterations and convergence factors can be adjusted interactively).

提取表面并定义平滑模板后，将继续进行 Parallel SurfaceNets 算法的第二个平滑步骤。要并行实现公式 1，需要采用避免数据争用的方法。虽然可以使用互斥锁或自旋/推测锁，但双缓冲可以避免这些较慢的机制并确保确定性。基本上，创建了两个点数组 $P_0$ 和 $P_1$ 。使用表面提取过程的输出点初始化 $P_0$ ，我们使用这些点来计算 $P_1$ 。接下来，我们交换 $P_0$ 和 $P_1$ 并重复（即双缓冲区）。使用交替的交换/计算，直到达到期望的结果（通常使用适度的迭代次数 25，尽管该算法足够快以至于可以交互地调整迭代次数和收敛因子）。

2.5 Surface Triangulation

Smoothing inevitably causes the mesh quadrilaterals to become non-planar. A final step in the algorithm is to triangulate the mesh. This can be trivially performed in parallel by generating two triangles from each quadrilateral. A variety of methods can be used to select the choice of tessellating diagonal: greedy, shortest diagonal, minimum area, and most co-planar are typical. The choice of triangulation method makes a modest impact on performance and appearance.

平滑处理不可避免地会导致网格四边形变得不平整。算法的最后一步是对网格进行三角化处理。这一步可以通过从每个四边形生成两个三角形来并行执行。选择网格对角线的方法有多种：典型的有贪婪对角线、最短对角线、最小面积和最共平面等。三角剖分方法的选择对性能和外观的影响不大。

3 IMPLEMENTATION DETAILS

In the following section we highlight and discuss some of the implementation details of the Parallel SurfaceNets algorithm.

3.1 Scalar Classification

As described previously, the classification of the scalar $s_{i, j,k}$ (inside a labeled region, or background) is retained by the voxel triad $t_{i, j,k}$ . In situations where the number of segmentation labels is large, scalar classification may significantly impact overall algorithm performance. We use specialized processes for classification as a function of the number of labels $N$ contained in $L$ . A simple cache of last used label in $L$ , and the most recent background label in $B$ greatly speeds classification. For a single label $N = 1$ , a simple comparison with the cache returns set membership (O(1) query time). For intermediate numbers of $L$ not in cache, label classification proceeds by comparison against a vector of $N$ values (search time O(N) where $N$ is small. For large $N$ , comparison is made against a set of values providing O(log(N)) query complexity.

如前所述，标量 $s_{i, j,k}$ （标记区域或背景内）的分类由体素三元组 $t_{i, j,k}$ 保留。在分割标签数量很大的情况下，标量分类可能会显着影响整体算法性能。我们使用专门的过程进行分类，作为 $L$ 中包含的标签 $N$ 数量的函数。 $L$ 中最后使用的标签和 $B$ 中最近的背景标签的简单缓存极大地加快了分类速度。对于单个标签 $N = 1$ ，与缓存的简单比较会返回集合的成员（O(1) 查询时间）。对于不在缓存中的中间数 $L$ ，标签分类通过与 $N$ 值的向量进行比较来进行（搜索时间 O(N)，其中 $N$ 较小。对于较大的 $N$ ，则与集合进行比较提供 O(log(N)) 查询复杂度的值。

3.2 Volume Metadata

Volume metadata are used to cache intermediate results, limit the total amount of computation, and support the bookkeeping necessary to manage the generation of output. Our algorithm implementation requires two auxiliary metadata: the voxel triads, and the volume $x$ edge metadata. One voxel triad per voxel value is created (using an 8-bit unsigned char) which as described previously carry 5-bits of classification information about each voxel. To simplify implementation at the boundary, the volume of triads is padded out by one triad in each of the $+ / - x, y, an d z$ directions. Thus a volume of dimensions $M \times N \times O$ , allocates a block of triad data of dimensions $(M + 2) \times (N + 2) \times (O + 2)$ . The volume $x$ edge metadata carries information about the primitives produced by each row of voxel cells $V_{j,k}$ : initially the number of points, the number of quadrilaterals, and the number of smoothing stencil edges is represented, but after the prefix sum in Pass 3 the volume $x$ edge metadata contains ids that control where the output is written in Pass 4. In addition, the metadata encodes the edge trim: the starting and ending voxels along $V_{j,k}$ that contribute to the output. These five values are allocated for each padded volume $x$ edge using a 2D array of size $(N + 2) \times (O + 2)$ .

体元数据用于缓存中间结果，限制计算总量，并支持管理输出生成所需的记录。我们的算法实现需要两个辅助元数据：体素三元组和体积 $x$ 边元数据。每个体素值创建一个体素三元组（使用 8 位无符号字符），如前所述，每个体素三元组携带 5 位分类信息。为了简化边界的执行，三元组的体积在 $+ / - x 、 y 和 z$ 方向各填充了一个三元组。因此，一个尺寸为 $M \times N \times O$ 的体积分配了一个尺寸为 $(M + 2) \times (N + 2) \times (O + 2)$ 的三元组数据块。 $x$ 边元数据包含由每行体素单元 $V_{j,k}$ 生成的基元信息：最初表示点数、四边形数和平滑模版边数，但在第 3 步前缀求和后， $x$ 边元数据包含控制第 4 步输出写入位置的 id。此外，元数据还对边修剪进行了编码：有助于输出的沿 $V_{j,k}$ 的起点和终点体素。使用大小为 $(N + 2) \times (O + 2)$ 的二维数组为每个填充体素 $x$ 边分配这五个值。

3.3 Generating the Edge and Stencil Face Cases

The voxel cell edge case $c^e_{i, j,k}$ can be easily computed from the voxel triads associated with eight voxel cell points (actually only seven triads are needed, and of those seven only portions of the triads). Recall that each triad carries information indicating whether the $x, y, z$ triad edges intersect some isosurface $Q_i$ . These data can be combined on the fly with simple bit-operations to generate a 12-bit edge case. Fig. 4 shows how portions of the triads are extracted and combined to produce the edge case. Similarly, a 6-bit stencil face case $e^f_{i, j,k}$ can be computed from the edge case. For each intersected edge, the two voxel cell faces using that edge are selected. This selection process is repeated for each edge to set bits corresponding to the six faces of the voxel cell.

体素单元边案例 $c^e_{i, j,k}$ 可以很容易地从与八个体素单元点相关的体素三元组计算出来（实际上只需要七个三元组，并且这七个三元组仅是三元组的一部分）。回想一下，每个三重轴都带有指示 $x, y, z$ 三轴边是否与某个等值面 $Q_i$ 相交的信息。这些数据可以通过简单的位操作即时组合以生成 12 位边案例。图 4 显示了如何提取和组合三元组的部分以产生边案例。类似地，可以根据边案例计算 6 位模板面案例 $e^f_{i, j,k}$ 。对于每个相交的边，选择使用该边的两个体素单元面。对每条边重复此选择过程，以设置与体素单元的六个面相对应的位。

3.4 Edge Trim

Computational edge trimming is the process of pre-computing information or metadata in such a way as to reduce the need for subsequent computation, thereby reducing the total computational load. In the parallel SurfaceNets algorithm, computational trimming is used to significantly reduce the total effort necessary to extract the surface mesh. In typical application, large portions of the volume consist of background voxel values, especially towards the boundaries of the volume. Thus it is possible to rapidly skip portions of rows, entire rows, and even entire data slices while computing, significantly improving algorithm performance. As described previously, computational trimming is represented by the two tuple edge trim $x^L, x^R)$ which are simply the left and right indices along each $E_{j,k}$ which may produce output primitives (points, quadrilaterals, and stencils).

计算边修剪是预先计算信息或元数据的过程，以减少后续计算的需要，从而减少总计算负载。在 PSN 算法中，使用计算修剪来显着减少提取表面网格所需的总工作量。在典型应用中，体积的大部分由背景体素值（尤其是在体积的边界）组成。这样就可以在计算时快速跳过部分行、整行甚至整个数据切片，显着提高算法性能。如前所述，计算修剪由两个元组边修剪 $x^L, x^R)$ 表示，它们只是沿着每个 $E_{j,k}$ 的左右索引，可以产生输出基元（点、四边形和模板）。

Computational trimming is determined over the course of the first two passes. In the first pass, evaluation of cell $x$ edges indicate the first and last voxels $v_{i, j,k}$ on $V_{j,k}$ which intersect the set $Q_i$ . These determine the initial edge trim $x^L, x^R)$ . In the second pass, the cell y- and z-edges are evaluated which may adjust the initial edge trim to produce the adjusted trim $\overline{x}L, \overline{x}R$ . Note that there are certain pathological cases where the initial $x$ edge trim is empty, and contours intersect only the $y$ and/or$ z$cell edges. While the concept of edge trim is somewhat similar to run-length encoding, the approach described here only eliminates computation at the left and right side of the voxel edges $E_{j,k}$ – interior runs of empty computation are not tracked. This approach is used to minimize memory overhead and performance impacts as compared to full run-length encoding, while still providing significant savings in many applications.

计算修剪是在前两步的过程中确定的。在第一步中，在单元格 $x$ 边估计 $V_{j,k}$ 与集合 $Q_i$ 相交的第一个和最后一个体素 $v_{i, j,k}$ 。这些决定了初始边修剪 $x^L, x^R)$ 。在第二遍中，计算单元 $y$ 和 $z$ 边，这可以调整初始边修剪以产生调整后的修剪 $\overline{x}L, \overline{x}R$ 。请注意，在某些病态情况下，初始 $x$ 边修剪为空，并且轮廓仅与 $y$ 和/或 $z$ cell 边相交。虽然边修剪的概念有点类似于行程编码，但此处描述的方法仅消除了体素边 $E_{j,k}$ 左侧和右侧的计算 - 不跟踪内部的空计算。与完整行程编码相比，此方法用于最大限度地减少内存开销和性能影响，同时仍然在许多应用程序中提供显著的节省。

3.5 Triad Output Point Generation

In Pass 3 of the algorithm, the eight $t_{i, j,k}$ of a voxel cell $v_{i, j,k}$ are combined to produce an edge $c^e_{i, j,k}$ and face $c^f_{i, j,k}$ case. When $c^e_{i, j,k} \neq 0$ then a point is to be generated in the center of $v_{i, j,k}$ . Recall that the $t_{i, j,k}$ maintains a bit of information indicating whether a point is generated, so a non-zero $c^f_{i, j,k}$ can be used to set this bit (the so-called PRODUCE POINT bit). However, during threaded processing, the $t_{i, j,k}$ are being combined in neighboring voxel rows $V_{i, j,k}$ —thus setting this bit introduces a potential race condition. There are several ways to address this situation including: 1) do not set the bit (i.e., recompute $c^f_{i, j,k}$ whenever needed; 2) allocate a separate array to represent PRODUCE POINT bits; and 3) process voxel rows $V_{j,k}$ in a coordinated fashion in such a way as to avoid data races. Through numerical experimentation, we determined that a 2 × 2 checker boarding pattern of $E_{j,k}$ avoids data races with minimal impact on performance, and without the need to allocate additional memory. To perform the checker boarding, we group $E_{j,k}$ into bundles of four neighbors (Fig. 5), numbering them 0 − 3 in each bundle. Then by processing all edges numbered n in order (with four total threaded processing loops), race conditions are avoided since there is a guarantee that no $t_{i, j,k}$ will be modified (i.e., its PRODUCE POINT bit set) during processing.

在算法的第 3 步，将一个体素单元 $v_{i, j,k}$ 的八个 $t_{i, j,k}$ 结合起来，生成边 $c^e_{i, j,k}$ 和面 $c^f_{i, j,k}$ 的案例。当 $c^e_{i, j,k} \neq 0$ 时，将在 $v_{i, j,k}$ 的中心生成一个点。回想一下， $t_{i, j,k}$ 保留了一个bit信息，表明是否生成了一个点，因此非零的 $c^f_{i, j,k}$ 可以用来设置这个bit（即所谓的 PRODUCE POINT bit）。然而，在线程处理过程中， $t_{i, j,k}$ 会在相邻的体素行 $V_{i, j,k}$ 中进行组合，因此设置该bit可能会带来竞争条件。有几种方法可以解决这种情况，包括 1）不设置该bit（即在需要时重新计算 $c^f_{i,j,k}$ ）；2）分配一个单独的数组来表示 PRODUCE POINT bit；3）以协调的方式处理体素行 $V_{j,k}$ ，以避免数据竞争。通过数值实验，我们确定 $E_{j,k}$ 的 2 × 2 校验模式模式可避免数据竞争，对性能的影响最小，且无需分配额外内存。为了执行棋盘格模式，我们将 $E_{j,k}$ 分成四条相邻的边束（图 5），在每个边束中将它们编号为 0 - 3。然后按顺序处理编号为 n 的所有边（共有四个线程处理循环），这样就可以避免出现竞争，因为在处理过程中可以保证没有 $t_{i, j,k}$ 被修改（即其 PRODUCE POINT bit 被设置）。
在这里插入图片描述

Figure 5: In Pass 3, voxel cell rows $V_{j,k}$ (as viewed down the $x$ axis) are processed in a checkerboard pattern to avoid data races.

图 5：在第 3 步中，体素单元行 $V_{j,k}$ （沿 $x$ 轴向下查看）以棋盘图案进行处理，以避免数据争用。

3.6 Voxel Row Iterator

During the fourth and final pass, $V_{j,k}$ are processed to produce output points, quadrilaterals, and smoothing stencils. This process is facilitated through the use of a voxel row edge iterator. The iterator is initialized with the point ids determined in the prefix sum operation in the third pass. Due to the smoothing stencils, which connect voxels in the current row to those in neighboring rows, a 3 × 3 iteration kernel is used to advance the point ids along $V_{j,k}$ as well as its immediate voxel neighbors in the $\pm j$ and $\pm k$ directions (Fig. 6). These point ids are used to define a voxel center point, the (up to) three quadrilaterals generated by the $t_{i, j,k}$ , and the smoothing stencil edges. The performance of the row iterator is significantly improved through the use of the PRODUCE POINT bit in $t_{i, j,k}$ . As the iterator advances, each of the nine voxel triads in the iteration kernel are accessed and the PRODUCE POINT bit queried. If a bit value is set, then the point id is incremented. Subsequently when generating output, the iterator provides the appropriate point ids for producing output points, quadrilaterals, and stencils.

在第四步间， $V_{j,k}$ 被处理以生成输出点、四边形和平滑模板。通过使用体素行边迭代器可以促进此过程。迭代器使用第三遍中前缀和运算中确定的点 ID 进行初始化。由于平滑模板将当前行中的体素连接到相邻行中的体素，因此使用 3 × 3 迭代内核沿 $V_{j,k}$ 及其在 $\pm j$ 和 $\pm j$ 中的直接体素邻居推进点 id $\pm k$ 方向（图 6）。这些点 id 用于定义体素中心点、由 $t_{i, j,k}$ 生成的（最多）三个四边形以及平滑模板边。通过使用 $t_{i, j,k}$ 中的 PRODUCE POINT bit，行迭代器的性能得到显着提高。随着迭代器前进，迭代内核中的九个体素三元组中的每一个都会被访问并查询 PRODUCE POINT bit。如果设置了bit值，则点 ID 会递增。随后，在生成输出时，迭代器提供适当的点 ID 来生成输出点、四边形和模板。

在这里插入图片描述
Figure 6: Voxel row iterator for generating output points, quadrilaterals, and smoothing stencils. A potential triad x-edge quadrilateral is shown in orange.
图 6：用于生成输出点、四边形和平滑模板的体素行迭代器。潜在的三重轴 x 边四边形以橙色显示。

3.7 Smoothing Cache

The parallel SurfaceNets algorithm is often used in interactive workflows. Typically the surface extraction step is performed just once as the labeled objects to extract are known a priori. However, surface smoothing is often adjusted interactively to produce compelling results. This includes adjusting the number of smoothing iterations, and modifying the convergence factor $λ$ . To facilitate this, our implementation caches the output of the surface extraction step, so that iterative smoothing workflows do not suffer from the cost of repeated surface generation.

并行 SurfaceNets 算法经常用于交互式工作流程。通常，表面提取步骤仅执行一次，因为要提取的标记对象是先验已知的。然而，表面平滑通常是交互式调整的，以产生令人信服的结果。这包括调整平滑迭代次数，以及修改收敛因子 $λ$ 。为了实现这一点，我们的实现缓存了表面提取步骤的输出，以便迭代平滑工作流程不会受到重复表面生成的成本的影响。

3.8 Load Balancing

Many parallel isocontouring algorithms devote significant effort towards logically subdividing and load balancing the computation. Typical approaches include organizing volumes into rectangular subvolumes or using a spatial structure such as an octree to manage the flow of execution. The challenge with isocontouring is that a priori it is not known through which portion of the volume the isosurface will pass, meaning that some form of pre-processing (evaluating min-max scalar region within sub-volumes for example) is required to balance the workload across computational threads. In the Parallel SurfaceNets algorithm, the basic task is processing of an edge, in which each edge (or associated voxel row) can be processed independently. In our implementation, we chose to use the Thread Building Blocks (TBB) library [11] using the parallel for construct to loop in parallel over all edges. Behind the scenes, TBB manages a thread pool to process this list of edges, and as the workload across an edge may vary significantly, new edge tasks are stolen and assigned to the thread pool for further processing. Thus by designing the algorithm around appropriately sized computational worklets, task stealing is an effective approach to ensuring that processors remain busy.

许多并行等值线算法投入大量精力来逻辑细分和负载平衡计算。典型的方法包括将体积组织成矩形子体积或使用空间结构（例如八叉树）来管理执行流程。等值线的挑战在于，先验地不知道等值面将通过体积的哪一部分，这意味着需要某种形式的预处理（例如评估子体积内的最小-最大标量区域）来平衡跨计算线程的工作负载。在Parallel SurfaceNets算法中，基本任务是处理边，其中每条边（或关联的体素行）都可以独立处理。在我们的实现中，我们选择使用线程构建块 (TBB) 库 [11]，使用并行 for 构造在所有边上并行循环。在幕后，TBB 管理一个线程池来处理此边列表，并且由于边上的工作负载可能会有很大差异，因此新的边任务会被窃取并分配给线程池以进行进一步处理。因此，通过围绕适当大小的计算工作集设计算法，任务窃取是确保处理器保持忙碌的有效方法。

4 PERFORMANCE EVALUATION

…
MC and FE also exhibit some interesting behavior. Because FE executes a separate processing pass for each label value, its performance is significantly impacted as the number of labels increases, since each processing pass requires a complete traversal of the input volume—this is apparent from the linear behavior of FE performance as the number of labels increases Fig. 9. In fact, if the number of labels is reduced to small numbers, FE eventually becomes faster than PSN. This is because FE uses geometric reasoning to extract manifold isosurfaces; an advantage over Parallel SurfaceNets as PSN assumes non-manifold surfaces since multiple materials (up to eight) can join in a single voxel cell. Also note that the curves for PSN and MM in Fig. 9 do not track closely as might be expected. As implemented, MM performs an expensive initialization process to classify every edge in the volume, which tends to flatten its slowdown curve. In comparison, PSN uses computational trimming, so that many yand z-voxel edges do not require explicit classification. Thus the impact on PSN performance as the number of labels increase is closely related to processing those $v_{i, j,k}$ that actually produce mesh points and polygons.

MC 和 FE 也有一些有趣的表现。由于 FE 对每个标签值执行单独的处理过程，因此随着标签数量的增加，其性能会受到显着影响，因为每个处理过程都需要完整遍历输入量——这从 FE 性能的线性行为中可以明显看出，因为标签数量标签数量增加（图 9）。事实上，如果标签数量减少到很小，FE 将会比 PSN 更快。这是因为 FE 使用几何推理来提取流形等值面；而 PSN 假定的是非流形表面，多种材料（最多八种）可以连接到一个体素单元中。另请注意，图 9 中的 PSN 和 MM 曲线并未像预期的那样紧密跟踪。在实现时，MM 执行昂贵的初始化过程来对体积中的每个边进行分类，这往往会使其减速曲线变平。相比之下，PSN 使用计算修剪，因此许多 y 和 z 体素边不需要显式分类。因此，随着标签数量的增加，对 PSN 性能的影响与处理那些实际产生网格点和多边形的 $v_{i, j,k}$ 密切相关。

5 SUMMARY AND FUTURE WORK

We have developed a high-performance, scalable isocontouring algorithm for discrete, segmented scalar data. The Parallel SurfaceNets algorithm produces an output triangle mesh that encloses each segmented region. Based on evaluation across five different segmented datasets, sequential performance was greater than three other comparable workflows, while parallel performance on 16-threads with commodity hardware was consistently one to two order of magnitude faster. The PSN algorithm achieves its speed by performing independent threaded operations across volume edges, resulting in scalable performance through task-stealing, edge-based computational tasks. It eliminates sequential bottlenecks such as incremental memory insertion and centralized binning to remove duplicate points, while adopting a multi-pass approach combined with computational trimming to progressively eliminate processing in subsequent parallel passes. The algorithm produces an output mesh that fully shares points and triangles across neighboring objects. Moreover, triangles shared between objects can be annotated to indicate which objects are adjacent to each other, opening up the possibility of topological query and analysis.

我们为离散、分段标量数据开发了一种高性能、可扩展的等值线算法。 Parallel SurfaceNets 算法生成包围每个分段区域的输出三角形网格。根据对五个不同分段数据集的评估，顺序性能优于其他三个可比较的工作流程，而使用商用硬件的 16 线程上的并行性能始终快一到两个数量级。 PSN 算法通过跨体素边执行独立的线程操作来实现其速度，从而通过任务窃取、基于边的计算任务实现可扩展的性能。它消除了顺序瓶颈，例如增量内存插入和集中装箱以删除重复点，同时采用多通道方法与计算修剪相结合，以逐步消除后续并行通道中的处理。该算法生成一个输出网格，该网格在相邻对象之间完全共享点和三角形。此外，可以注释对象之间共享的三角形来指示哪些对象彼此相邻，从而开启了拓扑查询和分析的可能性。

Since large segmented volumes may exceed computer memory resources, distributed, hybrid parallel computing approaches may be required in future applications. Such hybrid approaches could combine local, shared-memory threading with distributed computing to process subvolumes from a larger input volume. As currently implemented, PSN can be employed in this manner, with special attention place on stitching output meshes together at subvolume boundaries. However, given the enormous size of meshes that can be generated by PSN in datasets such as brain connectomics, another approach may be to generate output meshes on demand as part of an interactive visualization process or data query. PSN is fast enough that subvolumes can be dynamically processed as they become visible during interaction, or in response to requests to view certain structures. Other avenues of investigation include on-the-fly decimation [14] of output polygon streams to reduce the size of the output.

由于大的分段体可能会超出计算机内存资源，因此在未来的应用中可能需要分布式混合并行计算方法。这种混合方法可以将本地共享内存线程与分布式计算相结合，以处理来自较大输入体的子体。正如目前所实现的，PSN 可以以这种方式使用，特别注意在子体边界处将输出网格缝合在一起。然而，考虑到 PSN 可以在脑连接组学等数据集中生成巨大的网格，另一种方法可能是按需生成输出网格，作为交互式可视化过程或数据查询的一部分。 PSN 足够快，当子体在交互过程中变得可见时，或者响应查看某些结构的请求时，可以动态处理子体。其他研究途径包括输出多边形流的动态抽取[14]以减小输出的大小。

An important feature of SurfaceNets is the generation of the two-tuple region adjacency information. By combining this with annotation ontology and classification, it us possible to perform topological queries such as locating objects adjacent to other structures, e.g., locate all organs adjacent to a vascular structure. This capability if of particular importance in medical applications where anatomical atlases are under construction for educational, research, and clinical applications [15]. We believe future applications may create Atlas Information Systems (AIS) as an analogue to GIS systems, with the atlas serving as a visual interface to access data associated with the various segmented objects, with further support for geometric, topological, and ontological queries.

SurfaceNets 的一个重要特征是二元组区域邻接信息的生成。通过将其与注释本体和分类相结合，我们可以执行拓扑查询，例如定位与其他结构相邻的对象，例如，定位与血管结构相邻的所有器官。这种能力在医学应用中特别重要，其中解剖图谱正在为教育、研究和临床应用而构建[15]。我们相信未来的应用程序可能会创建类似于 GIS 系统的地图集信息系统 (AIS)，其中地图集作为可视化界面来访问与各种分段对象相关的数据，并进一步支持几何、拓扑和本体查询。

In the spirit of reproducible science, we have implemented the algorithm in both 2D and 3D and contributed them to the VTK system [12] using the C++ programming language. Open source implementations are available as vtkSurfaceNets2D and vtkSurfaceNets3D. Also available in VTK are the classes vtkDiscreteMarchingCubes and vtkDiscreteFlyingEdges3D which are used to compare against the Parallel SurfaceNets algorithm. PSN smoothing is performed by the vtkConstrainedSmoothingFilter; Taubin’s algorithm [16] used with MC and FE comparisons is implemented in the vtkWindowedSincPolyDataFilter.

本着可重复科学的精神，我们在 2D 和 3D 中实现了该算法，并使用 C++ 编程语言将它们贡献给 VTK 系统 [12]。开源实现可用作 vtkSurfaceNets2D 和 vtkSurfaceNets3D。 VTK 中还提供类 vtkDiscreteMarchingCubes 和 vtkDiscreteFlyingEdges3D，它们用于与 Parallel SurfaceNets 算法进行比较。 PSN平滑由vtkConstrainedSmoothingFilter执行；与 MC 和 FE 比较一起使用的 Taubin 算法 [16] 在 vtkWindowedSincPolyDataFilter 中实现。

先到这儿。