【论文阅读】Flying edges: 可扩展的高性能等值线算法-CSDN博客

本文链接：https://blog.csdn.net/m0_50910915/article/details/141475751

Flying edges: A high-performance scalable isocontouring algorithm

在这里插入图片描述

Flying edges: A high-performance scalable isocontouring algorithm

ABSTRACT

To this end we have developed a high-performance isocontouring algorithm for structured data that is designed to be inherently scalable. Processing is performed completely independently along edges over multiple passes. This novel algorithm also employs computational trimming based on geometric reasoning to eliminate unnecessary computation, and removes the parallel bottleneck due to coincident point merging.
As a result the algorithm performs well in serial or parallel execution, and supports heterogeneous parallel computation combining data parallel and shared memory approaches. Further it is capable of processing data too large to fit entirely inside GPU memory, does not suffer additional costs due to preprocessing and search structures, and is the fastest non-preprocessed isocontouring algorithm of which we are aware on shared memory, multicore systems. The software is currently available under a permissive, open source licence in the VTK visualization system.

为此，我们开发了一种用于结构化数据的高性能等值线算法，该算法旨在本质上可扩展。加工是沿着边完全独立地进行多次处理。该算法还采用基于几何推理的计算修剪来消除不必要的计算，并消除了重合点合并造成的并行瓶颈。

因此，该算法在串行或并行执行中表现良好，并支持数据并行和共享内存方法相结合的异构并行计算。此外，它能够处理太大而无法完全放入 GPU 内存的数据，不会因预处理和搜索结构而产生额外成本，并且是我们所知的共享内存、多核系统上最快的非预处理等值线算法。

1 INTRODUCTION

The current computing era is notable for its rapidly increasing data size, and the rapid evolution of computing systems towards massively parallel architectures. The increasing resolution of sensors and computational models has driven data size since the earliest days of computing, and is now reaching the point where parallel approaches are absolutely essential to processing the resulting large data. However, despite the widespread acknowledgement that parallel methods are essential to future computational advances, many important visualization algorithms in use today were developed with serial or coarse-grained data parallel computing models in mind. Such algorithms are generally not able to take effective advantage of emerging massively parallel systems, and therefore struggle to scale with increasing data size. Towards this end, we are challenging ourselves to rethink approaches to visualization algorithms with the goal of better leveraging emerging parallel hardware and systems. In this paper we revisit isocontouring, one of the most important and well-studied visualization techniques in use today.

当前的计算时代以其数据量的快速增长以及计算系统向大规模并行架构的快速发展而闻名。自计算诞生以来，传感器和计算模型分辨率的不断提高就推动了数据规模的增长，现在已经达到了并行方法对于处理由此产生的大数据绝对必要的程度。然而，尽管人们普遍认为并行方法对于未来计算的进步至关重要，但当今使用的许多重要的可视化算法都是在考虑串行或粗粒度数据并行计算模型的情况下开发的。此类算法通常无法有效利用新兴的大规模并行系统，因此难以随着数据大小的增加而扩展。为此，我们正在挑战自己，重新思考可视化算法的方法，以更好地利用新兴的并行硬件和系统。在本文中，我们重新审视等值线算法，这是当今使用的最重要且经过充分研究的可视化技术之一。

1.1 Considerations

Modern computational hardware is moving towards massive parallelism. There are a number of reasons for this including physical constraints from power consumption and frequency scaling concerns [13, 25] which limit the speed at which a single processor can run. At the same time chip densities continue to increase, meaning that manufacturers are adding hardware in the form of additional computing cores and supporting infrastructure such as memory cache and higher-speed data buses. The end result, however, is that computing systems are undergoing dramatic change, requiring algorithmic implementations to evolve and reflect this new computational reality.

现代计算硬件正在朝着大规模并行发展。造成这种情况的原因有很多，包括功耗和频率缩放问题的物理限制 [13, 25]，这限制了单个处理器的运行速度。与此同时，芯片密度不断增加，这意味着制造商正在以额外计算核心的形式添加硬件，并支持内存缓存和更高速数据总线等基础设施。然而，最终结果是计算系统正在经历巨大的变化，需要算法实现来发展并反映这种新的计算现实。

Taking advantage of massive parallelism places particular burdens on software implementations. Typically data movement is discouraged, requiring judicious use of local memory cache and management of the memory hierarchy. Computational tasks and data structures are simplified so that they may be readily pushed onto pipelined hardware architectures. Conditional branching is discouraged as this tends to result in idle processing cores and interrupts concurrent execution; and bottlenecks due to I/O contention significantly inhibit potential speed up. In general, an algorithm performs best if it is implemented with simple computational threads (even if this means extra work is performed); executes independently across local, cached memory; and reads and writes data from pre-allocated memory locations avoiding I/O blocking whenever possible.

利用大规模并行性给软件实现带来了特殊的负担。通常不鼓励数据移动，需要明智地使用本地内存缓存和管理内存层次结构。计算任务和数据结构被简化，以便它们可以轻松地推送到流水线硬件架构上。不鼓励使用条件分支，因为这往往会导致处理核心空闲并中断并发执行； I/O 争用造成的瓶颈显着抑制了潜在的加速。一般来说，如果算法是用简单的计算线程实现的，那么它的性能最佳（即使这意味着需要执行额外的工作）；跨本地缓存内存独立执行；并从预先分配的内存位置读取和写入数据，尽可能避免 I/O 阻塞。

We believe that it is important to reimagine essential visualization algorithms with this new computing reality in mind. While efforts are underway to develop and extend programming languages which can automatically parallelize algorithms, despite decades of effort only limited success has been demonstrated [26], and today’s computational challenges require solutions now. One especially important visualization algorithm is isocontouring, which reveals features of underlying scalar fields. However, despite years of research, many implementations of this technique have not been designed in the context of massive parallelism, and are hobbled by excessive data movement and related bottlenecks, or serial execution dependencies.
For example, naive implementations of the popular Marching Cubes (MC) [18] algorithm process interior voxel edges four times, and visit interior data points eight times. In addition, merging coincident points often introduces a serial bottleneck (due to the use of a spatial or edge-based locator), and the output of the algorithm is data dependent, meaning that arrays must be dynamically allocated and that there is no clear mapping of input to output data, resulting in yet another parallel processing bottleneck.

我们认为，考虑到这一新的计算能力，重新设计基本的可视化算法非常重要。尽管人们正在努力开发和扩展可自动实现算法并行化的编程语言，但经过几十年的努力，仅取得了有限的成功[26]。等值线算法是一种特别重要的可视化算法，它能揭示底层标量场的特征。然而，尽管经过多年的研究，这项技术的许多实现方法并不是在大规模并行的背景下设计的，而是受制于过多的数据移动和相关瓶颈，或串行执行依赖性。
例如，流行的 Marching Cubes（MC）[18] 算法的简单实现要处理内部体素边四次，访问内部数据点八次。此外，合并重合点通常会带来串行瓶颈（由于使用了空间或基于边的定位器），而且算法的输出与数据相关，这意味着数组必须动态分配，输入数据与输出数据之间没有明确的映射关系，从而造成另一个并行处理瓶颈。

1.2 Motivation and Related Work

…
Despite these many advances, however, we desired to create a general algorithm amenable to parallelization on a variety of hardware systems (e.g., CPU and GPU) with the ability to handle large data, especially given that we often encounter data that is too large to fit on a single GPU (given current technology). We also wanted to investigate methods to remove parallel computing bottlenecks in MC, which while easily parallelized, suffers from the inefficiencies noted previously. Our visualization workflow is also different than what is often assumed in much of the literature. While many existing isocontouring algorithms are used in an exploratory manner to reveal the structure of scalar fields, our usual approach is to process very large data using known isovalues, with the goal of generating the corresponding isosurfaces as fast as possible, avoiding the extra costs of building and storing supplemental acceleration structures.

尽管取得了上述诸多进展，但我们仍希望创建一种通用算法，以便在各种硬件系统（如 CPU 和 GPU）上进行并行化，并能够处理大型数据，特别是考虑到我们经常遇到的数据量太大，单个 GPU 无法容纳（考虑到目前的技术）。我们还想研究如何消除 MC 中的并行计算瓶颈，虽然 MC 很容易并行化，但存在前面提到的效率低下问题。我们的可视化工作流程也与许多文献中通常假设的不同。许多现有的等值线算法都是以探索的方式揭示标量场的结构，而我们通常的方法是使用已知的等值线处理非常大的数据，目的是尽可能快地生成相应的等值面，避免构建和存储补充增加结构的额外成本。

For example, medical CT imaging techniques produce data in which known isovalues correspond to different biological structures such as skin and bone. Similarly, in computational fluid dynamics various functions such as mass density or Q-criterion are well understood and previous experience typically suggests appropriate isovalues. It is not uncommon for our datasets to exceed GPU memory, thereby incurring significant transfer overhead across the associated PCIE bus, as we often process large volumes on the order of $2048^3$ resolution with double precision scalar values (e.g., ∼9GB for $1024^3$ doubles, or ∼68GB for $2048^3$ doubles). Thus many of the exploratory isocontouring techniques, which are typically based on preprocessing the data to build a rapid search structure such as an octree [30], interval tree [7], or span space [17], are not suitable to our workflow. Such preprocessing steps may add significant time to what is a non-exploratory workflow, while also introducing complex, auxiliary data structures which often consume significant memory and/or disk resources.

例如，医学 CT 成像技术产生的数据中，已知等值对应于不同的生物结构，例如皮肤和骨骼。类似地，在计算流体动力学中，质量密度或 Q 准则等各种函数都很好理解，并且以前的经验通常会建议适当的等值。我们的数据集超出 GPU 内存的情况并不罕见，从而在相关 PCIE 总线上产生大量传输开销，因为我们经常处理具有双精度标量值的 $2048^3$ 分辨率量级的大数据量（例如， $1024^3$ 个双精度数为 ∼9GB，或者∼68GB 用于 $2048^3$ 个双精度）。因此，许多探索性等值线技术通常基于数据预处理来构建快速搜索结构，例如八叉树[30]、区间树[7]或跨度空间[17]，不适合我们的工作流程。此类预处理步骤可能会给非探索性工作流程增加大量时间，同时还引入通常消耗大量内存和/或磁盘资源的复杂辅助数据结构。

Out-of-core isosurfacing techniques [6, 29] may be problematic too. While these algorithms are designed to process data that is much larger than machine memory, the cost of I/O (e.g., writing out entire solutions) at extreme scale can be prohibitive [19, 25]. Instead, we often revert to in situ processing [2], meaning that visualization algorithms are executed alongside the running simulation or data acquisition process. The benefit of this approach is that expensive I/O can be significantly reduced by extracting key features such as isosurfaces, slices, streamlines, or other solution features, each of which is significantly smaller than the full dataset. Using such methods, researchers can intelligently instrument (and conditionally program) their simulation runs to produce meaningful feature extracts that focus in on key areas of interest, avoiding the need to write out an entire output dataset [1]. Hence simple isocontouring algorithms that can be easily executed in situ with the simulation are essential for larger data.

核外等值面技术 [6, 29] 也可能存在问题。虽然这些算法旨在处理比机器内存大得多的数据，但极端规模的 I/O（例如，写出整个解决方案）的成本可能令人望而却步 [19, 25]。相反，我们经常采用原位处理的方法 [2]，这意味着可视化算法是与运行模拟或数据采集过程一起执行的。这种方法的好处是，通过提取关键特征（例如等值面、切片、流线或其他解决方案特征）可以显着减少昂贵的 I/O，其中每个特征都比完整数据集小得多。利用这些方法，研究人员可以对模拟运行进行智能化调节（和有条件编程），以产生有意义的特征提取，并将重点放在感兴趣的关键领域上，从而避免写出整个输出数据集的需要 [1]。因此，可以通过模拟在原位轻松执行的简单等高线算法对于更大的数据至关重要。

Given these considerations, the motivating goal of this work was to develop a fast, scalable isocontouring algorithm requiring no preprocessing or additional I/O. We also challenged ourselves to develop simple, independent algorithmic tasks that would scale well on a variety of different types of massively parallel hardware.

考虑到这些因素，这项工作的目标是开发一种快速、可扩展的等值线算法，无需预处理或额外的 I/O。我们还挑战自己开发简单、独立的算法任务，这些任务可以在各种不同类型的大规模并行硬件上很好地扩展。

2 ALGORITHM

In this section we begin by providing a high-level overview of the Flying Edges (FE) algorithm, including introducing descriptive notation. Next we address key algorithmic features. Finally we address key implementation details.

在本节中，我们首先提供 Flying Edges (FE) 算法的高级概述，包括介绍描述性符号。接下来我们讨论关键的算法特征。最后我们讨论关键的实施细节。

2.1 Notation

在这里插入图片描述
Figure 1: The cell axes $a_{ijk}$ coordinate system. The algorithm processes the $a_{ijk}$ in parallel along $x$ -edges $E_{jk}$ to count intersections and the number of output triangles. First and last intersections along $E_{jk}$ are used to perform computational trimming when generating output in parallel over the grid cell rows $R_{jk}$ .

图 1：单元轴 $a_{ijk}$ 坐标系。该算法沿 $x$ 边 $E_{jk}$ 并行处理 $a_{ijk}$ 以计算交点和输出三角形的数量。沿 $E_{jk}$ 的第一个和最后一个交点用于在网格单元行 $R_{jk}$ 上并行生成输出时执行计算修剪。

The algorithm operates on a structured grid with N scalar values arranged on a topologically regular lattice of data points of dimensions $(l \times m \times n)$ with values $s_{ijk}$ . Grid cells are defined from the eight adjacent points associated with scalar values in the $i + (0, 1), j + (0, 1), k + (0, 1)$ directions. Grid edges $E_{ijk}$ run in the row $E_{jk}$ , column $E_{ik}$ , and stack $E_{ij}$ directions; and are composed of the cell edges $e_{ijk}$ ; grid cell rows $R_{ijk}$ consist of all cells $v_{ijk}$ touching both $E_{ijk}$ and $E_{(i+1)( j+1)(k+1)}$ . So for example, an x-row edge of length $n$ with $0 \leq i < n$ :

该算法在结构化网格上运行，其中 N 个标量值排列在维度为 $(l \times m \times n)$ 且值为 $s_{ijk}$ 的数据点的拓扑规则网格上。网格单元由与 $i + (0, 1), j + (0, 1), k + (0, 1)$ 方向上的标量值关联的八个相邻点定义。网格边 $E_{ijk}$ 沿行 $E_{jk}$ 、列 $E_{ik}$ 运行，在 $E_{ij}$ 方向堆叠；并由单元边 $e_{ijk}$ 组成；网格单元行 $R_{ijk}$ 由接触 $E_{ijk}$ 和 $E_{(i+1)( j+1)(k+1)}$ 的所有单元格 $v_{ijk}$ 组成。例如，长度为 $n$ 且 $0 \leq i < n$ 的 x 行边：

在这里插入图片描述

Each $v_{ijk}$ has an associated cell axes $a_{ijk}$ which refers to the three cell edges emanating from the point located at $s_{ijk}$ in the positive row x, column y, and stack z directions. Refer to Figure 1.

每个 $v_{ijk}$ 都有一个相关的单元轴 $a_{ijk}$ ，它指的是从位于 $s_{ijk}$ 的点沿正向行 x、列 y 和堆叠 z 方向发出的三条单元边。请参考图 1。

The purpose of the algorithm is to generate an approximation to the isocontour $Q$ . The isocontour is defined by an isovalue $q$ : $Q(q) = {p | F(p) = q}$ where $F (p)$ maps the point $p ∈ R^n$ to a realvalued number (i.e., the isovalue $q$ ). The resulting approximation surface $S$ is a continuous, piecewise linear surface such as a triangle mesh (in 3D). Note also that we say that $Q$ intersects $e_{ijk}$ when some p lies along the cell edge $e_{ijk}$ . We assume that the scalar value varies linearly over $e_{ijk}$ so $Q$ may intersect at most at only one point along the edge.

该算法的目的是生成等值线 $Q$ 的近似值。等值线由等值 $q$ 定义： $Q(q) = {p | F(p) = q}$ 其中 $F (p)$ 将点 $p ∈ R^n$ 映射到实数值（即等值 $q$ ）。生成的近似曲面 $S$ 是连续的分段线性曲面，例如三角形网格（3D 中）。还要注意，当某个 p 位于单元边 $e_{ijk}$ 上时，我们说 $Q$ 与 $e_{ijk}$ 相交。我们假设标量值在 $e_{ijk}$ 上线性变化，因此 $Q$ 最多只能在边上仅相交一个点。

2.2 Overview

Highly-organized structured data lends itself to a variety of parallel approaches. Our method, which is based on the independent processing of grid edges, provides a number of benefits:
• The algorithm takes advantage of cache locality by processing data in the fastest varying data direction (i.e., assuming that an $i - j - k$ grid varies fastest in the $i$ -direction, the edges oriented along the $i$ data rows or based on the notation above, the voxel cell $x$ -edges $e_{jk}$ );
• it separates data processing into simple, independent computational steps that eliminate memory write contention and most serial, synchronous operations;
• it reduces overall computation by performing a small amount of initial extra work (i.e., computational trimming) to limit the extent of subsequent computation;
• it ensures that each voxel edge is only intersected once by using the cell axes $a_{ijk}$ to control iteration along the grid edges $E_{jk}$ ;
• it eliminates the need for dynamic memory reallocation, output memory is allocated only once;
• and the algorithm eliminates the parallel bottlenecks due to point merging.
While others have used edge-based approaches to characterize the quality of isocontour meshes [10], or to ensure watertight meshes using octree edge-trees [16], we use our edge-based approach as an organizing structure for parallel computation; both as a means of traversing data as well as creating independent computational tasks.

高度组织的结构化数据适合多种并行方法。我们的方法基于网格边的独立处理，具有许多优点：

该算法通过在最快变化的数据方向上处理数据来利用缓存局部性（即，假设 $i - j - k$ 网格在 $i$ 方向上变化最快，边沿 $i$ 数据行定向，或基于上面的符号，体素单元 $x$ -edges $e_{jk}$ );
它将数据处理分成简单、独立的计算步骤，消除内存写入争用和大多数串行、同步操作；
它通过执行少量初始额外工作（即计算修剪）来限制后续计算的范围，从而减少总体计算量；
通过使用单元轴 $a_{ijk}$ 来控制沿网格边 $E_{jk}$ 的迭代，确保每个体素边仅相交一次；
无需动态内存重新分配，输出内存仅分配一次；
该算法消除了点合并带来的并行瓶颈。

虽然其他人使用基于边的方法来表征等值面网格的质量[10]，或者使用八叉树边树[16]来确保密闭的网格，但我们使用基于边的方法作为并行计算的组织结构；既可以作为遍历数据的方式，也可以作为创建独立计算任务的方式。

One of the most challenging characteristics of any isocontouring algorithm is that it produces indeterminate output, i.e., the number of output points and primitives (e.g., triangles) is not known a priori. This presents a challenge to parallel implementations since processing threads work best when output memory can be preallocated and partitioned to receive output. So for example algorithms such as MC–which on first glance seem embarrassingly parallel–suffer significant performance degradation when inserting output entities into dynamic output lists due to repeated memory reallocation. A natural way to address this challenge is to use multiple passes to first characterize the expected output by counting output points and primitives, followed by additional passes to actually generate the isocontour. The key is to minimize the effort required to configure the output. Our approach also takes advantage of the multi-pass approach to guide and dramatically reduce subsequent computation, thus the early passes can be considered as a form of preprocessing to set up later passes for optimized computation.

任何等值线算法最具挑战性的特征之一是它产生不确定的输出，即输出点和图元（例如三角形）的数量是未知的。这对并行实现提出了挑战，因为当可以预先分配和分区输出内存以接收输出时，处理线程工作得最好。因此，例如 MC 等算法，在将输出实体插入动态输出列表时，由于重复的内存重新分配，性能会显着下降。解决这一挑战的自然方法是使用多步骤，首先通过计算输出点和图元来表征预期输出，然后使用额外的步骤来实际生成等值线。关键是最大限度地减少配置输出所需的工作量。我们的方法还利用多步方法来指导并显著减少后续计算，因此早期步骤可以被视为预处理的一种形式，来设置后续步骤以实现优化计算。

The algorithm is implemented in four passes: only the first pass traverses the entirety of the data, the others operate on the resulting local metadata or locally on rows to generate intersection points and/or gradients. At a high level, these passes perform the following operations:

Traverse the grid row-by-row to compute grid $x$ -edge cases; count the number of $x$ intersections; and set the row computation trim limits.
Traverse each grid cell row, between adjusted trim limits, and count the number of $y$ - and $z$ -edge intersections on the $a_{ijk}$ , as well as the number of output triangles generated.
Sum the number of $x$ -, $y$ -, and $z$ -points created across the $a_{ijk}$ of each row, and the number of output triangles; allocate output arrays.
Traverse each grid cell row, between adjusted trim limits, and using the $a_{ijk}$ generate points and produce output triangles into the preallocated arrays.

The first three passes count the number of output points and triangles, determine where the contour proximally exists and set computational trim values, and produce the metadata necessary to generate output. In the fourth and final pass, output points and primitives are generated and directly written into pre-allocated output arrays without the need for mutexes or locking. In the following we describe each pass in more detail and then follow with a discussion of key implementation concepts.

该算法分四步实现：只有第一阶段遍历整个数据，其他阶段对生成的本地元数据或本地行进行操作以生成交点和/或梯度。在高层次上，这些过程执行以下操作：

逐行遍历网格以计算网格 $x$ 边案例；计算 $x$ 交点的数量；并设置行计算修剪限制。
遍历调整后的修剪限制之间的每个网格单元行，并计算 $a_{ijk}$ 上 $y$ 和 $z$ 边交点的数量，以及生成的输出三角形的数量。
将每行的 $a_{ijk}$ 上创建的 $x$ 、 $y$ 和 $z$ 点的数量以及输出三角形的数量相加；分配输出数组。
在调整后的修剪限制之间遍历每个网格单元行，并使用 $a_{ijk}$ 生成点并将输出三角形生成到预分配的数组中。

前三阶段计算输出点和三角形的数量，确定等值面的近端位置并设置计算修整值，以及生成输出所需的元数据。在最后阶段中，生成输出点和图元并将其直接写入预先分配的输出数组中，无需互斥或锁定。下面更详细地描述每个过程，然后讨论关键实现概念。

Pass 1: Process grid $x$ -edges $E_{jk}$ . For each grid $x$ -edge $E_{jk}$ , all cell $x$ -edge intervals (i.e., cell edges $e_{jk}$ ) composing $E_{jk}$ are visited and marked when their interval intersects $Q$ (i.e., an edge case number is computed for each $e_{jk}$ ). The left and right trim positions $xL_{jk}$ and $xR_{jk}$ are noted, as well as the number of $x$ -intersections along $E_{jk}$ . Each $E_{jk}$ is processed independently and in parallel. (Note that the trim position $xL_{jk}$ indicates where $Q$ first intersects $E_{jk}$ on its left side; and $xR_{jk}$ indicates where $Q$ last intersects $E_{jk}$ on its right side. Additional trimming details will be provided shortly.)

第 1 步：处理网格 $x$ -edges $E_{jk}$ 。 对于每个网格 $x$ -edge $E_{jk}$ ，组成 $E_{jk}$ 的所有单元格 $x$ -edge 间隔（即单元格边 $e_{jk}$ ）在它们的间隔相交时被访问并标记为 $Q$ （即，为每个 $e_{jk}$ 计算边案例数）。记录了左右修剪位置 $xL_{jk}$ 和 $xR_{jk}$ ，以及沿 $E_{jk}$ 的 $x$ 交点数量。每个 $E_{jk}$ 都是独立并行处理的。（请注意，修剪位置 $xL_{jk}$ 表示 $Q$ 在其左侧首次与 $E_{jk}$ 相交的位置； $xR_{jk}$ 表示 $Q$ 最后与 $E_{jk}$ 相交的位置稍后将提供更多修剪细节。）

Pass 2: Process grid rows $R_{jk}$ . For each grid row $R_{jk}$ , the cells $v_{ijk}$ between the adjusted trim interval $\overline{x}L _{jk}, \overline{x}R_{jk})$ are visited and their MC case numbers are computed from the edge-based case table from Pass 1 (the original $s_{ijk}$ values are not reread). Knowing the cell case value $c_{ijk}$ it is possible to perform a direct table lookup to determine whether the $y$ - and $z$ -cell axes edges $a^{y}_{ijk}$ and $a^{z}_{ijk}$ intersect the isocontour, incrementing the $y$ - and $z$ -cell intersection count as appropriate. In addition, the $c_{ijk}$ directly indicates the number of triangles generated as the contour passes through $v_{ijk}$ . Again, each $R_{jk}$ can be processed in parallel. At the conclusion of the second pass, the number of output points and triangles is known and is used to allocate the output arrays in the next pass. Note that because the adjusted trim interval may be empty, it is possible to rapidly skip data (rows and entire slices) via computational trimming (see Figure 2).
Note that while the algorithm is designed to operate across $x$ -edges due to it being the (typically) fastest varying data direction,it can be readily be recast using $y$ - or $z$ -edges. In such cases, the results remain invariant, as the generated MC cases (after combining the edge cases in the proper order) are equivalent.

第 2 步：处理网格行 $R_{jk}$ 。 对于每个网格行 $R_{jk}$ ，访问调整后的修剪区间 $\overline{x}L _{jk}, \overline{x}R_{jk})$ 之间的单元格 $v_{ijk}$ ，并且它们的 MC 案例编号是根据第 1 步基于边的案例表计算的（原始 $s_{ijk}$ 值不会重新读取）。知道单元格案例值 $c_{ijk}$ 后，可以直接查表来确定 $y$ 和 $z$ 单元格轴边 $a^{y}_{ijk}$ 和 $^{z}_{ijk}$ 与等值线相交，根据需要增加 $y$ 和 $z$ 单元相交计数。另外， $c_{ijk}$ 直接表示等值面经过 $v_{ijk}$ 时生成的三角形数量。同样，每个 $R_{jk}$ 都可以并行处理。在第二遍结束时，输出点和三角形的数量已知，并用于在下一遍中分配输出数组。请注意，由于调整后的修剪间隔可能为空，因此可以通过计算修剪快速跳过数据（行和整个切片）（见图 2）。
请注意，虽然该算法被设计为跨 $x$ 边运行，因为它是（通常）最快变化的数据方向，但它可以很容易地使用 $y$ 或 $z$ 边进行重新转换。在这种情况下，结果保持不变，因为生成的 MC 案例（以正确的顺序组合边案例后）是等效的。

在这里插入图片描述
Figure 2: A trimmed contour after the second pass (in 2D). On the right side is a metadata array that tracks information describing the interaction of the contour with the $E_{jk}$ edges, including the number of $x$ - and $y$ -edge intersections, the number of output primitives generated, and trim positions. Trimming can significantly reduce computational work.

图 2：经过第二步处理后的修剪轮廓（二维）。右侧是一个元数据数组，用于跟踪描述轮廓与 $E_{jk}$ 边相交的信息，包括 $y$ 和 $z$ 边交点的数量、生成的输出基元数量以及修剪位置。修剪可以大大减少计算工作量。

Pass 3: Configure output. At the conclusion of Pass 2, for each $R_{jk}$ the number of $x$ -, $y$ -, and $z$ -point intersections is known, as well as the number of output triangles. The third pass accumulates this information on each $E_{jk}$ , assigning starting ids for the $x$ -, $y$ -, and $z$ -points and triangles to be generated in the final pass along each $R_{jk}$ . Note that this accumulation pass can be performed in parallel as a prefix sum operation in $O (m / t + l o g (t))$ where the $m$ is the number of $E_{jk}$ and $t$ is the number of threads of execution [4].

第 3 步：配置输出。 在第 2 步结束时，对于每个 $R_{jk}$ ， $x$ 、 $y$ 和 $z$ 点交点的数量也是已知的作为输出三角形的数量。第三步是在每个 $E_{jk}$ 上积累这些信息，为 $x$ 、 $y$ 和 $z$ 点以及每个 $R_{jk}$ 上最后生成的三角形分配起始 id。注意，这个累加过程可以作为前缀求和操作并行执行，运行时间为 $O (m / t + l o g (t))$ ，其中 $m$ 是 $E_{jk}$ 的数量， $t$ 是执行线程的数量[4]。

Pass 4: Generate output across $R_{jk}$ . In the final pass, each $R_{jk}$ is processed in parallel by moving the $a_{ijk}$ along the row within the adjusted trim interval $\overline{x}L _{jk}, \overline{x}R_{jk})$ , computing $x$ 、 $y$ 和 $z$ edge intersection points on the cell axes as appropriate. This requires reading some of the $s_{ijk}$ again to compute gradients (if requested) and interpolate edge intersections with $Q$ . Triangles are also output (if necessary) as each cell is visited. Again, the adjusted trim edge interval is used to rapidly skip data, with the cell case values $c_{ijk}$ indicating which (if any) points need be generated on the $a_{ijk}$ . Because the previous pass determined the start triangle and point ids for each $R_{jk}$ , an increment operator (based in edge use table $U(c_i)$ ) is used to update point and triangle ids as each cell is processed along the row (Figure 5). This eliminates the need to merge points, and points are only generated once and written directly into the previously allocated output arrays without memory write contention. It should be noted that the generation of intersection points and gradients, and the production of triangle connectivity, can proceed independently of one another for further parallel optimization.
Assuming that the grid is of dimension $n^3$ , the algorithm complexity is driven by the initial pass over the grid values $O(n^3)$ invoking $n^2/t$ total threads of execution. The second pass traverses the $x$ −edge classifications, generally of size less than $O(n^3)$ due to computational trimming, also with $n^2/t$ total thread executions. The third pass sums the number of output primitives as described previously in $O(n^2/t + log(t))$ . The fourth and final pass processes up to $n^2/t$ threads, although trimming often reduces the total workload significantly.

第 4 步：跨 $R_{jk}$ 生成输出。 最后，通过在调整后的修剪区间 $\overline{x}L _{jk}, \overline{x}R_{jk})$ 内沿行移动 $a_{ijk}$ ，并行处理每个 $R_{jk}$ ，根据需要计算单元轴上的 $x$ 、 $y$ 和 $z$ 边交点。这需要再次读取一些 $s_{ijk}$ 来计算梯度（如果需要）并插入与 $Q$ 的边交点。访问每个单元格时还会输出三角形（如果需要）。同样，调整后的修剪边区间用于快速跳过数据，单元格案例值 $c_{ijk}$ 指示需要在 $a_{ijk}$ 上生成哪些（如果有）点。因为上一步确定了每个 $R_{jk}$ 的起始三角形和点 id，所以在处理每个单元格时，使用增量运算符（基于边使用表 $U(c_i)$ ）来更新点和三角形 id行（图 5）。这消除了合并点的需要，并且点仅生成一次并直接写入先前分配的输出数组，而没有内存写入争用。应该注意的是，交点和梯度的生成以及三角形连接的生成可以彼此独立地进行，以进行进一步的并行优化。
假设网格的维度为 $n^3$ ，算法复杂度由网格值 $O(n^3)$ 的初始传递调用 $n^2/t$ 执行线程总数决定。第二步遍历 $x$ 边分类，由于计算修剪，其大小通常小于 $O(n^3)$ ，总线程执行数也为 $n^2/t$ 。第三步是对输出基元的数量进行求和，如前所述，耗时为 $O(n^2/t+log(t))$ 。。第四步最多可处理 $n^2/t$ 个线程，尽管修剪通常会显著减少总工作负载。

在这里插入图片描述
Figure 5: Processing a cell row $R_{ijk}$ to generate output, eliminating point merging. At the beginning of each cell row, the edge metadata–determined from the prefix sum of Pass 3–contains the starting output triangle ids, and $x$ -、 $y$ - and $z$ -point ids, and is used to initialize a cell row traversal process (illustrated at the top of the figure and shown in 2D). Then to traverse to the next adjacent cell along $R_{ijk}$ the edge use table $U(c_i)$ is applied to increment the point ids appropriately. The $a_{ijk}$ are used to generate the new points, while the other points are simply referenced as they will be produced in a separate thread. Newly generated points and triangles are directly written into previously allocated memory using their output ids.

图 5：处理单元格行 $R_{ijk}$ 以生成输出，消除点合并。在每个单元行的开头，边元数据（根据第 3 步的前缀和确定）包含起始输出三角形 id 以及 $x$ 、 $y$ 和 $z$ 点 id，并用于初始化单元格行遍历过程（如图顶部所示并以 2D 形式显示）。然后，为了沿着 $R_{ijk}$ 遍历到下一个相邻单元格，应用边使用表 $U(c_i)$ 来适当增加点 id。 $a_{ijk}$ 用于生成新点，而其他点仅被引用，因为它们将在单独的线程中生成。新生成的点和三角形使用其输出 id 直接写入先前分配的内存中。

3 IMPLEMENTATION

In the following section we highlight and discuss some of the implementation features of the Flying Edges algorithm.

3.1 Edge-Based Case Tables

Voxel-based contour case tables were popularized by MC and have been extended to other topological cell types such as tetrahedra [28]. Computing a case number $c_{ijk}$ for a cell involves comparing the scalar field value at each cell vertex against the isocontour value $q:c_{ijk} =1 if s_{ijk} ≥ q;0 otherwise$ ; and then performing a shifted logical OR operation to determine a case value 0 ≤ $c_{ijk}$ < 256. In naive implementations of MC, many unnecessary accesses to and comparisons against a particular vertex scalar value $\overline{s}$ occur as the algorithm marches from cell to adjacent cell. In Flying Edges, typically $s_{ijk}$ are accessed only once unless it is necessary that a particular ̄ s is required for subsequent computation (e.g., gradient computation or edge interpolation). This is because the total number of data values $N$ is usually much greater than the number of si jk which actually contribute to the computation of the isocontour.

基于体素的等值面案例表由 MC 推广，并已扩展到其他拓扑单元类型，例如四面体 [28]。计算单元的案例编号 $c_{ijk}$ 涉及将每个单元顶点处的标量场值与等值线值进行比较 $q:c_{ijk} =1$ 如果 $s_{ijk} ≥ q$ ;否则为 0；然后执行移位逻辑或运算以确定案例值 0 ≤ $c_{ijk}$ < 256。在 MC 的简单实现中，会发生许多不必要的对特定顶点标量值 $\overline{s}$ 的访问和比较随着算法从单元格行进到相邻单元格。在 Flying Edges 中，通常仅访问一次 $s_{ijk}$ ，除非后续计算（例如，梯度计算或边插值）需要特定的 $\overline{s}$ 。这是因为数据值的总数 $N$ 通常远大于实际有助于等值线计算的 $s_{ijk}$ 的数量。

Case computation is performed in parallel along each grid edge $E_{jk}$ . The edge case for each $e_{jk}$ that compose the $E_{jk}$ is determined from the scalar values of its two end points, with $2^2$ total states possible. The resulting case value $e^c_{jk}$ indicates whether each cell x-edge intersects the isosurface, and if so (as described previously in Pass 1), the number of x-intersections is incremented by one. During subsequent processing, the four edge case values $\{e^c_{jk},e^c_{(j+1)k},e^c_{j(k+1)},e^c_{(j+1)(k+1)}\}$ can be combined to determine the cell case value $c_{ijk}$ (see Figure 3). The edge-based case table is equivalent to the standard MC table which considers the states of eight vertices as compared to four parallel cell x-edges. We use an edge-based case table because it removes computational dependencies which degrade parallel performance, allowing the computation of edge cases to proceed completely independently.

案例计算沿着每个网格边 $E_{jk}$ 并行执行。组成 $E_{jk}$ 的每个 $e_{jk}$ 的边案例是根据其两个端点的标量值确定的，总共可能有 $2^2$ 个状态。生成的 case 值 $e^c_{jk}$ 指示每个像元 x 边是否与等值面相交，如果是这样（如前面第 1 步中所述），则 x 相交的数量增加 1。在后续处理过程中，四个边案例值 $\{e^c_{jk},e^c_{(j+1)k},e^c_{j(k+1)},e^c_{(j+ 1)(k+1)}\}$ 可以组合起来确定单元格案例值 $c_{ijk}$ （见图 3）。基于边的案例表相当于标准 MC 表，它考虑八个顶点的状态与四个平行单元 x 边的比较。我们使用基于边的案例表，因为它消除了会降低并行性能的计算依赖性，从而允许边案例的计算完全独立地进行。
在这里插入图片描述
Figure 3: An edge-based case table combines four $x$ -edge cases to produce an equivalent Marching Cubes vertex-based case table. Edge cases can be computed independently in parallel and combined when necessary to generate a MC case when processing $R_{jk}$ .

图 3：基于边的案例表结合四个 $x$ 边案例，生成一个等效的基于顶点的 Marching Cubes 案例表。在处理 $R_{jk}$ 时，可以并行独立计算边案例，并在必要时合并生成 MC 案例。

Further efficiencies in the algorithm result from exploiting the implicit topological information contained in each MC case $c_i$ . Obviously, given a particular case number, the $c_i$ defines the numbers of points and triangles that will be generated, and exactly which edges are involved in the production of the output primitives. An important construct is the edge use table $U(c_i)$ which for each $c_i$ indicates which of the 12 voxel edges intersects Q. By using this information efficiently it is possible to rapidly perform operations such as counting the number of $y$ − and $z$ − edge intersections, and incrementing point and triangle ids across cell rows (during Pass 4) to generate crack-free isocontours without a spatial locator.

利用每个 MC 案例 $c_i$ 中包含的隐式拓扑信息可以进一步提高算法效率。显然，给定特定的案例编号， $c_i$ 定义将生成的点和三角形的数量，以及输出图元的生成涉及哪些边。一个重要的构造是边使用表 $U(c_i)$ ，它对于每个 $c_i$ 指示 12 个体素边中的哪一个与 Q 相交。通过有效利用这些信息，可以快速执行各种操作，如计算 y $和 z$ 交点的数量，以及在单元行中递增点和三角形 ID（在第 4 步），从而在不使用空间定位器的案例下生成无裂缝等值线。

3.2 Computational Trimming

As used here, computational trimming is the process of precomputing information or metadata in such a way as to reduce the need for subsequent computation, thereby reducing the total computational load. More specifically, in the Flying Edges algorithm, computational trimming is used to significantly reduce the total effort necessary to generate the isocontour. It is possible to rapidly skip portions of rows, entire rows, and even entire data slices while computing. This is accomplished by taking advantage of the topological characteristics of the continuous, piecewise linear contour surface $S$ , and using the first $x$ -edge interval pass to bound the computations necessary to perform later passes (including interpolation and primitive generation). In practice, computational trimming significantly accelerates algorithm performance.

此处所指的计算修剪是预先计算信息或元数据的过程，以减少后续计算的需要，从而减少总计算负载。更具体地说，在 Flying Edges 算法中，使用计算修剪来显着减少生成等值线所需的总工作量。在计算时可以快速跳过部分行、整行甚至整个数据切片。这是通过利用连续分段线性近似 $S$ 的拓扑特征，并使用第一个 $x$ 边区间来限制执行后续步骤（包括面片拟合和图元生成）所需的计算来实现的。在实践中，计算修剪显著提高了算法性能。

Basically computational trimming takes advantage of the information available from the initial pass in which $x$ -edge intervals indicate something about the location of the isocontour. For example (Figure 4), if there are no $e _{jk}$ intersections along any of the four $E_{jk}$ that bound a particular cell row $R_{jk}$ , then an isocontour exists in those cells along $R_{jk}$ if and only if it passes through some $y$ -edge (2D) and/or $z$ -edge (3D). In such circumstances, because $Q$ is continuous, a single intersection check with any $y$ -edge (in 2D) or $y - z$ cell face (3D) is enough to determine whether the isocontour intersects any part of $R_{jk}$ . A similar argument can be made over contiguous runs of cells in $R_{jk}$ in a manner reminiscent to run-length encoding. In our implementation, to reduce algorithmic complexity and memory overhead we chose to keep just two trim positions for each $E_{jk}$ : the trim position on the left side where the isocontour first intersects $E_{jk}$ ; and the trim position on the right side where the isocontour last intersects $E_{jk}$ .

本质上，计算修剪利用了初始通道中可用的信息，其中 $x$ 边间隔指示有关等值线位置的信息。例如（图 4），如果沿着绑定特定单元格行 $R_{jk}$ 的四个 $E_{jk}$ 中的任何一个都没有 $e_{jk}$ 交点，则沿着这些单元格中存在等值线 $R_{jk}$ 当且仅当它穿过某些 $y$ 边 (2D) 和/或 $z$ 边 (3D) 时。在这种情况下，由于 $Q$ 是连续的，因此与任何 $y$ 边（2D 中）或 $y - z$ 单元面（3D）进行一次相交检查就足以确定等值线是否与 $R_{jk}$ 的任何部分相交。在 $R_{jk}$ 中的连续单元上也可以用类似的方法进行论证，这让人联想到运行长度编码（run-length encoding）。在我们的实现中，为了降低算法复杂度和内存开销，我们为每个 $E_{jk}$ 只保留了两个修剪位置：左侧等值线与 $E_{jk}$ 第一次相交的修剪位置；右侧等值线与 $E_{jk}$ 最后一次相交的修剪位置。

在这里插入图片描述
Figure 4: A 2D depiction of an adjusted trim edge interval. Shown are the trim position of intersection with the isocontour $Q$ to the furthest left $\overline{x}L _{jk}$ and furthest right $\overline{x}R_{jk}$ on the cell row $R_{jk}$ . The trim position, in combination with topological reasoning about the location of the continuous $Q$ , is used to eliminate unnecessary computation.

图 4：调整后的修剪边区间的二维描述。图中显示的是单元格行 $R_{jk}$ 上最左 $\overline{x}L _{jk}$ 和最右 $\overline{x}R_{jk}$ 与等值线 $Q$ 相交的修剪位置。修剪位置与有关连续 $Q$ 位置的拓扑推理相结合，可用于消除不必要的计算。

While the first pass identifies intersected x-edges (i.e., generates edge case values) and determines trim positions along $E_{jk}$ ; subsequent passes process cell rows $R_{jk}$ which (in 3D) are controlled by the four grid edges ${E_{jk},E_{(j+1)k},E_{j(k+1)},E_{(j+ 1)(k+1)}\}$ . Thus to process cell rows, an adjusted trim interval $\overline{x}L _{jk}, \overline{x}R_{jk})$ is computed where $\overline{x}L _{jk}$ is the leftmost trim position, and $\overline{x}R_{jk}$ is the rightmost trim position along the four edges defining $R_{jk}$ . Additionally, intersection checks are made with the cell faces at the positions $\overline{x}L _{jk}$ and $\overline{x}R_{jk}$ to ensure that the isocontour is not running parallel to the four $E_{jk}$ . If any intersection is found, the adjusted trim positions are reset to the leftmost ( $\overline{x}L _{jk}=0$ ) and/or rightmost ( $\overline{x}R_{jk} = (n − 1)$ ) locations (assuming the grid $x$ -dimension is $n$ ), as the isocontour must run to the grid boundary.

第一步识别相交的 x 边（即生成边值）并确定沿 $E_{jk}$ 的修剪位置；随后的步骤处理单元行 $R_{jk}$ ，这些单元行（在三维中）由四条网格边 ${E_{jk},E_{(j+1)k},E_{j(k+1)},E_{(j+ 1)(k+1)}\}$ 控制。因此，在处理单元行时，需要计算调整后的修剪区间 $\overline{x}L _{jk}, \overline{x}R_{jk})$ ，其中 $\overline{x}L _{jk}$ 是最左边的修剪位置， $\overline{x}R_{jk}$ 是沿着 $R_{jk}$ 定义的四条边的最右边的修剪位置。此外，还要对 $\overline{x}L _{jk}$ 和 $\overline{x}R_{jk}$ 位置的单元面进行相交检查，以确保等值线不平行于四个 $E_{jk}$ 。如果发现任何交集，调整后的修剪位置将重置为最左边（ $\overline{x}L _{jk}=0$ ）和/或最右边（ $\overline{x}R_{jk}= (n - 1)$ ）的位置（假设网格的 $x$ 维度为 $n$ ），因为等值线必须延伸到网格边界。

3.3 Point Merging

In many isocontouring algorithms such as MC and its derivatives, each interior cell edge is visited four times (since in a structured grid four cells share an edge). If the edge intersects the isocontour Q, four coincident points will be generated, requiring additional work and resources to interpolate and then store the points. In implementation, it is common to use some form of spatial locator or edge-based hash table to merge these coincident points to produce a crack-free polygonal surface. While there are algorithmic concerns from the use of the locator (e.g., extra memory requirements) probably the bigger impact is due to the computational bottleneck introduced by such an approach in a parallel environment. Multiple threads feeding into a single locator requires data locking which negatively impacts performance. Some implementations address this issue by instantiating multiple locators, one in each subsetted grid block which are processed independently, which can then be merged at the conclusion of surface generation at the additional cost of a parallel, tree-based compositing operation.

在许多等值线算法（例如 MC 及其衍生算法）中，每个内部单元边都会被访问四次（因为在结构化网格中，四个体素单元共享一条边）。如果边与等值面 Q 相交，将生成四个重合点，需要额外的工作和资源来插值然后存储这些点。在实现中，通常使用某种形式的空间定位器或基于边的哈希表来合并这些重合点以产生无缝的多边形表面。虽然使用定位器存在算法问题（例如，额外的内存需求），但更大的影响可能是由于这种方法在并行环境中引入的计算瓶颈。多个线程输入单个定位器需要数据锁定，这会对性能产生负面影响。一些实现通过实例化多个定位器来解决这个问题，每个子集网格块中都有一个定位器，这些定位器是独立处理的，然后可以在表面生成结束时进行合并，但需要并行、基于树的合成操作的额外成本。
Flying Edges eliminates all of these problems. First, because edges are processed by traversal of the cell axes $a_{ijk}$ across $R_{jk}$ , each of the $x, y, z$ cell edges is interpolated only once, generating at most one point per cell edge. Secondly, because the number of triangles, and the number of $x$ -edge, $y$ -edge and $z$ -edge intersection points (and hence point ids) is known along all cell rows at the outcome of the third pass, data arrays of the correct size can be preallocated to receive the output data. Further, as cell rows $R_{jk}$ are processed in the fourth pass, the $c_{ijk}$ implicitly defines how to increment each of the point ids as traversal moves from cell to neighboring cell along $R_{jk}$ (Figure 5). Geometry (interpolated point coordinates) and topology (triangles defined by three point ids) can be computed independently and directly written to the preallocated and partitioned output. Thus a locator or hash table is not required, thereby eliminating the point merging bottleneck while producing a crack-free $S$ across multiple threads.

Flying Edges 消除了所有这些问题。首先，由于边是通过遍历 $R_{jk}$ 的单元轴 $a_{ijk}$ 来处理的，因此每个 $x 、 y 、 z$ 单元边仅插值一次，每个单元边最多生成一个点。其次，因为在第三步的结果中，沿所有单元格行的三角形数量以及 $x$ 、 $y$ 、 $z$ 边上交点（以及点 ID）的数量是已知的，可以预先分配正确大小的数据数组来接收输出数据。此外，当单元格行 $R_{jk}$ 在第四步中处理时， $c_{ijk}$ 隐式定义了当遍历沿 $R_{jk}$ 从单元格移动到相邻单元格时如何递增每个点 id （图5）。几何（插值点坐标）和拓扑（由三个点 ID 定义的三角形）可以独立计算并直接写入预分配和分区的输出。因此不需要定位器或哈希表，从而消除点合并瓶颈，同时跨多个线程生成无缝的 $S$ 。

在这里插入图片描述
Figure 5: Processing a cell row $R_{ijk}$ to generate output, eliminating point merging. At the beginning of each cell row, the edge metadata–determined from the prefix sum of Pass 3–contains the starting output triangle ids, and $x$ -, $y$ -, and $z$ -point ids, and is used to initialize a cell row traversal process (illustrated at the top of the figure and shown in 2D). Then to traverse to the next adjacent cell along $R_{ijk}$ the edge use table $U(c_i)$ is applied to increment the point ids appropriately. The $a_{ijk}$ are used to generate the new points, while the other points are simply referenced as they will be produced in a separate thread. Newly generated points and triangles are directly written into previously allocated memory using their output ids.

图 5：处理单元格行 $R_{ijk}$ 以生成输出，消除点合并。在每个单元行的开头，边元数据（根据第 3 步的前缀和确定）包含起始输出三角形 id、 $x$ 、 $y$ 和 $z$ 点 id，并使用初始化单元格行遍历过程（如图顶部所示并以 2D 形式显示）。然后，为了沿着 $R_{ijk}$ 遍历到下一个相邻单元格，应用边使用表 $U(c_i)$ 来适当增加点 id。 $a_{ijk}$ 用于生成新点，而其他点仅被引用，因为它们将在单独的线程中生成。新生成的点和三角形使用其输出 id 直接写入先前分配的内存中。

3.4 Cell Axes on the Grid Boundary

The $a_{ijk}$ are key to efficient iteration over the grid since they ensure that every cell edge is processed only one time. However, on the positive x, y, and z grid boundaries, the $a_{ijk}$ are incomplete, requiring special handling. For example, when an x-edge terminates on the positive y-z grid face, the last point along the $E_{jk}$ is not associated with any cell, and hence the $a_{ijk}$ is undefined. Conceptually we create a partial $a_{ijk}$ consisting of just the y and z edges and process them. Similar, but more complex situations occur for the $E_{jk}$ located near the x-y and x-z grid boundaries. In implementation, dealing with these special cases adds some complexity to what is a relatively simple algorithm. Alternative ways to deal with this complexity is to pad the volume on the positive grid boundaries, or skip processing the last layer of cells on the positive grid boundaries.

$a_{ijk}$ 是网格高效迭代的关键，因为它们确保每个单元边仅处理一次。但是，在正 x、y 和 z 网格边界上， $a_{ijk}$ 不完整，需要特殊处理。例如，当 x 边终止于正 y-z 网格面上时，沿 $E_{jk}$ 的最后一个点不与任何单元格关联，因此 $a_{ijk}$ 未定义。从概念上讲，我们创建一个仅由 y 和 z 边组成的部分 $a_{ijk}$ 并对其进行处理。类似但更复杂的情况发生在位于 x-y 和 x-z 网格边界附近的 $E_{jk}$ 处。在实现中，处理这些特殊情况给相对简单的算法增加了一些复杂性。处理这种复杂性的替代方法是在正网格边界上填充体积，或者跳过处理正网格边界上的最后一层单元。

3.5 Degeneracies

Degeneracies occur when one or more cell vertices take on the value of the isocontour. Much of the literature ignores degenerate situations due to the assumed rarity of such occurrences, but in practice they are common. For example, integral-valued or limited precision scalar fields are finite in extent, and/or the choice of isovalue frequently is set to key field values which produces degeneracies.
当一个或多个单元顶点具有等值线的值时，就会出现退化现象。许多文献都忽略了退化情况，认为这种情况很少发生，但实际上退化情况很常见。例如，积分值或精度有限的标量场的范围是有限的，和/或等值线的选择经常设置为关键场值，从而产生退化。

Degeneracies typically result in the generation of non-manifold isocontours, or the creation of zero length (line) or zero area (triangle) primitives. This is typically due to the isocontouring “nicking” adjoining edges, at a common vertex, to produce such degenerate primitives. If the purpose of contour generation is strictly visual display, then such situations matter little as these primitives are generally handled properly during rendering. However, if the isocontour is to be processed in an analysis pipeline (e.g., topological analysis, subsequent interior mesh generation), degeneracies can introduce difficulties into the workflow. In many implementations of MC, degenerate primitives are culled prior to insertion into dynamic output arrays. However in FE, the initial passes count the number of output points and triangles and then allocates memory for them; once degenerate primitives are identified it is not possible to modify the output.

退化通常会导致生成非等值线，或创建零长度（直线）或零面积（三角形）基元。这通常是由于等高线在共同顶点处 “切割 ”了相邻的边，从而产生了这种退化基元。如果生成等值面的目的完全是为了视觉显示，那么这种情况并不重要，因为这些基元通常会在渲染过程中得到妥善处理。但是，如果要在分析管道中处理等值面（如拓扑分析、后续的内部网格生成），退化可能会给工作流程带来困难。在许多 MC 实现中，退化基元会在插入动态输出阵列之前被剔除。然而，在 FE 中，初始传递会计算输出点和三角形的数量，然后为它们分配内存；一旦识别出退化基元，就无法修改输出。

In the algorithm outlined here, the treatment of degeneracies can be expanded through modification of the edge-based case table and the way in which it is calculated. As described previously, edge bits are set when a cell edge intersects the isosurface $Q$ using a semi-open interval check; in the expanded case table an open edge interval is used (i.e., meaning that $s_{ijk}$ are classified in one of three ways: equal to, less than, or greater than $q$ ). Thus the number of entries in a vertex-based case table for voxel cells would increase to a total size of $3^8$ entries (versus $2^8$ for standard MC). While we have experimented with such alternative case tables in 2D, our workflow is such that an expanded case table is not needed at this time (a topic for future research).

在这里概述的算法中，可以通过修改基于边的案例表及其计算方式来扩展退化处理。如前所述，当单元边与等值面 $Q$ 相交时，使用半开区间检查设置边位；在扩展案例表中，使用开放边区间（即意味着 $s_{ijk}$ 按三种方式之一进行分类：等于、小于或大于 $q$ ）。因此，体素单元的基于顶点的案例表中的条目数将增加到总大小 $3^8$ 条目（相对于标准 MC 的 $2^8$ ）。虽然我们已经在 2D 中尝试了此类替代案例表，但我们的工作流程目前不需要扩展案例表（未来研究的主题）。

3.6 Load Balancing

Many parallel isocontouring algorithms devote significant effort towards logically subdividing and load balancing the computation. Typically approaches include organizing volumes into rectangular sub-grids or using a spatial structure such as an octree to manage the flow of execution [33, 9]. The challenge with isocontouring is that a priori it is generally not possible to know through which portion of the volume the isosurface will pass, meaning that some form of pre-processing (evaluating min-max scalar region within sub-volumes for example), or task stealing is used to balance the workload across computational threads. In the Flying Edges algorithm, the basic task is processing of an edge, in which each edge (or associated cell row) can be processed completely independently. In our implementation, we chose to use the Thread Building Blocks (TBB) library using the [parallel_for] construct to loop in parallel over all edges [14]. Behind the scenes, TBB manages a thread pool to process this list of edges, and since the workload across an edge may vary significantly, new edge tasks are stolen and assigned to the thread pool in order to ensure continued processing. Note that the algorithm does not require mutexes or other locks as the output data is allocated and partitioned prior to data being written.

许多并行等值线算法投入大量精力在逻辑细分和负载平衡计算。通常的方法包括将体积组织成矩形子网格或使用八叉树等空间结构来管理执行流程 [33, 9]。等值线算法的挑战在于，先验通常不可能知道等值面将穿过体积的哪一部分，这意味着需要某种形式的预处理（例如评估子体积内的最小-最大标量区域），或者任务窃取用于平衡计算线程之间的工作负载。在Flying Edges算法中，基本任务是处理一条边，其中每条边（或相关的单元格行）可以完全独立地处理。在我们的实现中，我们选择使用线程构建块 (TBB) 库，使用 [parallel_for] 构造在所有边上并行循环 [14]。在幕后，TBB 管理一个线程池来处理此边列表，并且由于跨边的工作负载可能会有很大差异，因此新的边任务会被窃取并分配给线程池，以确保持续处理。请注意，该算法不需要互斥锁或其他锁，因为输出数据在写入数据之前就已分配和分区。

3.7 Memory Overhead

The algorithm computes and stores information in the first two passes. $X$ -edge cases are described using two bits due to the four possible states defined from the two end vertex classifications. Thus with total grid size $N$ , then approximately $2 N$ bits of additional memory are consumed to represent the x-edge cases (compared to our typical grids with a 32-bit float or integer per $s_{ijk}$ ). In addition, edge metadata is stored for each $E_{jk}$ , which consists of six integer values: the number of x-, y-, and z-cell axes edge intersections; the number of triangles produced along the associated $R_{jk}$ ; and the left and right trim positions. Assuming that the grid is of dimensions $n^3$ , then the metadata requires $6n^2$ integer values (compared to the $n^3 = N$ grid size). Thus the total memory overhead in bytes, assuming eight byte integers, is $8·6n^2 + 2·n^3/8$ .

该算法在前两步中计算并存储信息。由于从两端顶点分类定义了四种可能的状态，因此使用两bit来描述 $X$ 边案例。因此，对于总网格大小 $N$ ，则消耗大约 $2 N$ bit 的额外内存来表示 x 边案例（与每个 $s_{ijk}$ 具有 32 位浮点数或整数的典型网格相比）。此外，还为每个 $E_{jk}$ 存储边元数据，它由六个整数值组成：x、y和z单元轴边交点的数量；沿关联的 $R_{jk}$ 生成的三角形数量；以及左右修剪位置。假设网格的尺寸为 $n^3$ ，则元数据需要 $6n^2$ 个整数值（与 $n^3 = N$ 网格大小相比）。因此，假设有八个字节整数，总内存开销（以字节为单位）为 $8·6n^2 + 2·n^3/8$ 。

4 RESULTS

5 CONCLUSION AND FUTURE WORK

We have developed a high-performance, scalable isocontouring algorithm for structured scalar data. To our knowledge it is currently the fastest non-preprocessed isocontouring algorithm on threaded, multi-core CPUs available today. The development of this algorithm has been motivated by the emergence of new parallel computing models, with the recognition that many current and future algorithms require rethinking if we are to take full advantage of modern computing hardware. Each pass of the Flying Edges is independently parallelizable across edges, resulting in scalable performance through task-stealing, edge-based computational tasks. Our results demonstrate this, although the actual computational load of any isocontouring algorithm based on an indexed lookup is relatively small, requiring just a few floating point operations to interpolate intersection points along edges. As a result, for very large data, much of the computing effort involves loading and/or accessing data across the memory hierarchy. Thus with large numbers of threads, memory bounds may limit the overall scalability.

我们为结构化标量数据开发了一种高性能、可扩展的等值线算法。据我们所知，它是目前线程多核 CPU 上最快的非预处理等值线算法。该算法的开发是由新的并行计算模型的出现推动的，我们认识到如果我们要充分利用现代计算硬件，则需要重新考虑许多当前和未来的算法。 Flying Edges 的每次传递都可以跨边独立并行，从而通过任务窃取、基于边的计算任务实现可扩展的性能。我们的结果证明了这一点，尽管任何基于索引查找的等值线算法的实际计算量都相对较小，只需要一些浮点运算即可沿边插入交点。因此，对于非常大的数据，大部分计算工作涉及跨内存层次结构加载和/或访问数据。因此，对于大量线程，内存限制可能会限制整体可扩展性。
To address memory constraints, distributed, hybrid parallel approaches may perform better as separate machines (and therefore memory spaces) can process a subset of the entire volume. Along similar lines, we have envisioned a parallel, hybrid implementation that performs the initial, two FE counting passes across all distributed, grid subsets, and then communicates point and triangle id offsets to each subsetted grid. Then each machine can process its data to produce seamless isocontours without the need to merge points across distributed grids.

为了解决内存限制，分布式混合并行方法可能会表现得更好，因为单独的机器（内存空间）可以处理整个卷的子集。沿着类似的思路，我们设想了一种并行的混合实现，它在所有分布式网格子集中执行初始的两次 FE 计数，然后将点和三角形 ID 偏移量传递给每个子集网格。然后，每台机器都可以处理其数据以生成无缝等值线，而无需跨分布式网格合并点。

Another way to view the Flying Edges algorithm is that it is a form of data traversal, with geometric constraints (continuous surfaces) informing the extent to which data is processed. Similar geometric constraints exist for related visualization algorithms such as scalar-based cutting and clipping. Also, the algorithm may be readily adapted to any topologically regular grid. In the future we plan on extending the algorithm to process rectilinear and structured grids, and clipping and cutting algorithms, based on an abstraction of the edge-based traversal process, with appropriate specialization required in the way edge interpolations (i.e., point coordinates and interpolated data) are generated.

看待 Flying Edges 算法的另一种方式是，它是一种数据遍历形式，通过几何约束（连续曲面）告知数据处理的程度。相关可视化算法（例如基于标量的切割和裁剪）也存在类似的几何约束。此外，该算法可以很容易地适应任何拓扑规则的网格。将来，我们计划将算法扩展到处理直线和结构化网格，以及基于基于边的遍历过程的抽象的裁剪和切割算法，并在边插值方式（即点坐标和插值）方面需要适当的专业化数据）生成。

We have already begun implementing this algorithm on GPUs. While our initial experiments are promising, there is some uncertainty as to whether the extra work required to track computational trimming is warranted. On one hand, processing entire edges $E_{jk}$ without early termination simplifies computation and reduces the likelihood of stalling the computational cores. On the other hand, computational trimming can significantly reduce the amount of data to be visited, and consequently the total computation performed. Future numerical experiments will provide guidance, although it is possible that differing hardware may produce contrary indications based on the particulars of the computing architecture. Finally, GPU processing of large volumes requires complex methods to stream data on and off the GPU when card memory is exceeded. The speed of Flying Edges is such that with very large data, the cost of data transfer to and from the GPU often exceeds the time to actually process the data on a multi-core CPU.

我们已经开始在 GPU 上实现该算法。虽然我们最初的实验很有希望，但跟踪计算修剪所需的额外工作是否有必要还存在一些不确定性。一方面，在不提前终止的情况下处理整个边 $E_{jk}$ 简化了计算并降低了计算核心停顿的可能性。另一方面，计算修剪可以显着减少要访问的数据量，从而减少执行的总计算量。未来的数值实验将提供指导，尽管不同的硬件可能会根据计算架构的细节产生相反的指示。最后，当卡内存超出时，大容量的 GPU 处理需要复杂的方法来在 GPU 上或从 GPU 上流式传输数据。 Flying Edges 的速度如此之快，以至于对于非常大的数据，进出 GPU 的数据传输成本通常超过在多核 CPU 上实际处理数据的时间。

概括