【论文阅读】CPU/GPU异构架构上的射线单元并行行进立方体算法-CSDN博客

本文链接：https://blog.csdn.net/m0_50910915/article/details/142378265

rupMC: a ray-unit parallel marching cubes algorithm on CPU/GPU heterogeneous architectures

ABSTRACT
1. Introduction
2. Method
3. Experiments
4. Conclusion and future work

在这里插入图片描述

rupMC: a ray-unit parallel marching cubes algorithm on CPU/GPU heterogeneous architectures
ARTICLE HISTORY
Received 1 November 2023
Accepted 29 February 2024

蛮新的

ABSTRACT

The marching cubes (MC) algorithm is widely used for extracting isosurfaces from volume data and 3D visualizations because of its effectiveness and robustness but require extensive memory and computing time for large-scale applications. Additionally, MC isosurfaces lack topologic information, making them difficult to use in some geologic applications. To overcome these limitations, this study proposes an enhanced MC using CPU/GPU heterogeneous architecture called the ray-unit parallel MC (rupMC) algorithm. First, ray units form the basic voxel to determine how the surface intersects to reduce repeated computations and enhance efficiency. Then, rupMC uses multiple computing processes and threads on a CPU/GPU heterogeneous architecture to process points concurrently. Finally, the unique surface intersection indices are preserved to compose the surface triangles, and the topological surface information is directly embedded in the triangle compositions. Experiments on five stratum datasets of varying sizes demonstrated that, rupMC achieved approximately dozens of times faster than other serial MC and 4 times faster than a parallel DMC. rupMC demonstrated high scalability and adaptability to various CPUs/GPUs and datasets of various sizes. rupMC has remarkable capabilities for efficiently and feasibly extracting precise surface intersections and triangles, making it well-suited for large-scale and high-density applications.

行进立方体 (MC) 算法因其有效性和鲁棒性而被广泛用于从体数据和 3D 可视化中提取等值面，但在大规模应用中需要大量内存和计算时间。此外，MC 等值面缺乏拓扑信息，使得它们难以在某些地质应用中使用。为了克服这些限制，本研究提出了一种使用 CPU/GPU 异构架构的增强型 MC，称为光线单元并行 MC (rupMC) 算法。首先，射线单元构成基本体素，确定表面如何相交，减少重复计算，提高效率。然后，rupMC在CPU/GPU异构架构上使用多个计算进程和线程来并发处理点。最后，保留唯一的表面相交索引来组成表面三角形，并将拓扑表面信息直接嵌入到三角形组合中。对五个不同大小的层数据集的实验表明，rupMC 的速度比其他串行 MC 快大约几十倍，比并行 DMC 快 4 倍。 rupMC 展示了对各种 CPU/GPU 和各种大小的数据集的高可扩展性和适应性。 rupMC 具有高效、可行地提取精确表面交点和三角形的卓越能力，使其非常适合大规模和高密度应用。

1. Introduction

…
This study proposes an enhanced MC algorithm on a CPU/GPU heterogeneous architecture called ray-unit parallel MC (rupMC), which addresses the three aforementioned problems. First, a ray unit is used for the marching computations, which consists of a point and three rays along the three directions of the data composition. The lookup index for the ray unit is still created based on the state of the eight vertices of a cube and serves as a pointer in the precalculated table that provides all edge intersections for a given cube configuration. Edge intersections are preserved and calculated only once in the corresponding ray unit. Second, the algorithm is parallelized on a CPU/GPU heterogeneous architecture. The Input/Output (I/O) process is parallelized by multiple CPU processes. Once a sub-dataset is input, it is transmitted to another CPU process and computed on a GPU. Simultaneously, the next sub-dataset is input. Such pipelined task distribution saves time by reading large amounts of data. The computations are also parallelized using multiple GPU threads to further improve computational efficiency. Furthermore, an adaptive domain decomposition technique is implemented to enable the efficient processing of datasets of any size across various CPU/GPU environments with varying computing capacities. Third, based on the ray unit, the unique indices of the intersections are preserved to compose the surface triangles, and the topological information of the surface is directly embedded in the triangle compositions.

本研究提出了一种基于 CPU/GPU 异构架构的增强型 MC 算法，称为射线单元并行 MC（rupMC），它解决了上述三个问题。首先，射线单元用于行进计算，它由一个点和沿着数据组合的三个方向的三个射线组成。射线单元的查找索引仍然是基于立方体的八个顶点的状态创建的，并充当预先计算表中的指针，该表提供给定立方体配置的所有边交点。边缘相交在相应的射线单元中仅保留和计算一次。其次，该算法在CPU/GPU异构架构上并行化。输入/输出 (I/O) 进程由多个 CPU 进程并行执行。一旦输入子数据集，它就会被传输到另一个 CPU 进程并在 GPU 上计算。同时，输入下一个子数据集。这种管道式任务分配通过读取大量数据节省了时间。计算还使用多个 GPU 线程并行化，以进一步提高计算效率。此外，还实施了自适应域分解技术，以便能够在具有不同计算能力的各种 CPU/GPU 环境中有效处理任何大小的数据集。第三，基于射线单元，保留交点的唯一索引来组成表面三角形，并将表面的拓扑信息直接嵌入到三角形组合中。

The experiments show that rupMC reduces the number of repeated final products and generates the same surface triangles with a unique intersection index. rupMC significantly reduces computation time and maintains the topological information of the surface triangles. The proposed rupMC demonstrates high scalability and adaptability to various CPUs/GPUs and datasets of various sizes. rupMC has remarkable capabilities for efficiently and feasibly extracting precise surface intersections and triangles, making it well-suited for large-scale and high-density applications.

实验表明，rupMC 减少了重复最终产品的数量，并生成具有唯一相交索引的相同表面三角形。 rupMC 显着减少了计算时间并保留了表面三角形的拓扑信息。所提出的 rupMC 展示了对各种 CPU/GPU 和各种大小的数据集的高可扩展性和适应性。 rupMC 具有高效、可行地提取精确表面交点和三角形的卓越能力，使其非常适合大规模和高密度应用。

2. Method

2.1. Introduction of the marching cubes algorithm

…
The first issue with MC is associated with the cubic unit used for basic computing. Owing to the regular arrangement of the cubes, each cube edge is shared by four adjacent cubes. Using cubes as the basic marching unit for the calculation results in four repeated computations and preservation of the surface intersections, wasting computing time and memory. Thus, a new unit for marching computing is necessary to reduce redundant computations and results.
The second issue with MC is its high I/O and computational intensity. The MC algorithm produces an accurate surface when high-density sampling point data are used but requires large reading and writing times. In particular, for some real-time visualization applications, the extensive I/O time is a significant limitation. In addition, the computations for each marching unit are similar and independent of the computations for the other cubes. Therefore, a different approach is necessary to improve the I/O and computational efficiency.
The last issue with MC is the preservation of surface intersections and triangular compositions. MC is used not only for data visualization but also for some specific spatial analyses, such as the reconstruction of raw structural topological relationships between geological bodies, which require topologic details. Therefore, an approach is required that preserves the topological information of the isosurface triangles.

MC 的第一个问题与用于基本计算的立方单位有关。由于立方体的规则排列，每个立方体边缘由四个相邻立方体共享。使用立方体作为基本行进单元进行计算会导致四次重复计算并保留表面交点，浪费计算时间和内存。因此，需要一个新的行进计算单元来减少冗余计算和结果。

FE和SN也是采用这样的基本思路

MC 的第二个问题是其高 I/O 和计算强度。当使用高密度采样点数据但需要大量读写时间时，MC算法可以产生精确的表面。特别是，对于一些实时可视化应用，大量的 I/O 时间是一个很大的限制。此外，每个行进单元的计算是相似的并且独立于其他立方体的计算。因此，需要采用不同的方法来提高 I/O 和计算效率。
MC 的最后一个问题是保留表面相交和三角形组合。 MC不仅用于数据可视化，还用于一些特定的空间分析，例如重建地质体之间的原始结构拓扑关系，这需要拓扑细节。因此，需要一种保留等值面三角形的拓扑信息的方法。

2.2. rupMC: a ray-unit parallel marching cubes algorithm

To address the three aforementioned issues of MC, this study proposes a ray unit parallel MC algorithm called rupMC. As shown in Figure 4, the three key improvements in rupMC are as follows.
(1) A ray unit, which consists of a pixel and three rays along the three data composition directions, is used for the marching computations. The lookup index for the ray unit is still created based on the state of the eight cube vertices and serves as a pointer in the predefined table that provides all edge intersections for a given cube configuration. Edge intersections are preserved and calculated only once in the corresponding ray units.
(2) The algorithm is parallelized using a CPU/GPU heterogeneous architecture. The I/O process is parallelized by multiple CPU processes, and the computations are parallelized on GPUs to further improve the computational efficiency. Furthermore, an adaptive domain decomposition technique is implemented to enable the efficient processing of datasets of any size across various CPU/GPU environments with varying computing capacities.
(3) Based on the ray unit, the unique indices of the intersections are preserved to consist of surface triangles, and the topological information of the surface is directly embedded in the triangle compositions.

(1) 射线单元用于行进计算，该射线单元由一个像素和沿三个数据合成方向的三条射线组成。射线单元的查找索引仍然是基于八个立方体顶点的状态创建的，并用作预定义表中的指针，该表提供给定立方体配置的所有边交叉点。边缘相交在相应的射线单元中仅保留和计算一次。
(2)算法采用CPU/GPU异构架构并行化。 I/O过程由多个CPU进程并行，计算在GPU上并行，进一步提高计算效率。此外，还实施了自适应域分解技术，以便能够在具有不同计算能力的各种 CPU/GPU 环境中有效处理任何大小的数据集。
(3)基于射线单元，保留交点的唯一索引来组成表面三角形，并将表面的拓扑信息直接嵌入到三角形组合中。

2.2.1. Ray-unit computing units

A cube is the basic computing unit in the MC algorithm and is created from eight points, four each from two adjacent slices. Owing to the regular arrangement of the cubes, each cube edge is shared by four adjacent cubes. Thus, using cubes as the marching unit to identify and calculate edge intersections will require four repeated computations and preservations, wasting computing time and memory space. Thus, a new unit for marching computing is necessary to reduce redundant computations and results.
By simplifying the cube unit, the ray unit becomes sufficient to represent a three-dimensional dataset without overlapping edges (as shown in Figure 5). The ray unit consists of a point and three rays along the three directions of the data composition. Thus, each point has a corresponding ray unit. The lookup index for the ray unit is still created based on the state of the eight cube vertices and serves as a pointer in the predefined table that provides all edge intersections for a particular cube configuration.
In rupMC, two sets of marks are created: one for recording the count of surface triangles in the cube, and the other to mark whether there are surface intersections on each edge of the ray unit. These marks are used for the calculation and preservation of the surface intersections (details are described in Section 2.2.3).

在这里插入图片描述

Figure 5. Ray-unit of rupMC.

与FE基本单元非常相似，可见FE论文图1中的单元轴 $a_{ijk}$

立方体是 MC 算法中的基本计算单元，由八个点创建，每个点来自两个相邻的切片。由于立方体的规则排列，每个立方体边缘由四个相邻立方体共享。因此，使用立方体作为行进单元来识别和计算边缘交点将需要四次重复计算和保存，浪费计算时间和内存空间。因此，需要一个新的行进计算单元来减少冗余计算和结果。
通过简化立方体单元，射线单元足以表示没有重叠边缘的三维数据集（如图 5 所示）。射线单元由一个点和沿数据三个方向的三条射线组成。因此，每个点都有一个对应的射线单元。射线单元的查找索引仍然是基于八个立方体顶点的状态创建的，并用作预定义表中的指针，该表提供特定立方体配置的所有边交点。
在rupMC中，创建了两组标记：一组用于记录立方体中表面三角形的数量，另一组用于标记射线单元的每条边上是否存在表面相交。这些标记用于计算和保存表面相交（详细信息请参见第 2.2.3 节）。

2.2.2. Parallelization on CPU/GPU architectures

MC calculates the edge intersections and surface triangles using the raw point values and marks generated in each ray unit. The computations of each ray unit are independent of those of the other units, which implies that the computational processes can be parallelized. rupMC takes advantage of the parallel computing capability of the CPU and GPU and uses multiple computational processes and threads to improve computing efficiency. The I/O processes of rupMC are parallelized based on the Message Passing Interface (OpenMPI), which is a high-performance message passing library for communication in CPUs (https://www.open-mpi.org/). The computing processes of rupMC are parallelized by the Compute Unified Device Architecture (CUDA), a parallel computing programming model for general computing on GPUs (https://developer.nvidia.com/ about-cuda).

MC 使用每个射线单元中生成的原始点值和标记来计算边缘相交和表面三角形。每个射线单元的计算独立于其他单元的计算，这意味着计算过程可以并行化。 rupMC利用CPU和GPU的并行计算能力，使用多个计算进程和线程来提高计算效率。 rupMC 的 I/O 进程基于消息传递接口 (OpenMPI) 进行并行化，OpenMPI 是一个用于 CPU 通信的高性能消息传递库。 rupMC的计算过程通过统一计算设备架构（CUDA）进行并行化，CUDA是一种用于GPU上通用计算的并行计算编程模型。

The parallel-computing framework of rupMC on a CPU/GPU heterogeneous architecture is shown in Figure 4. First, the parameters of the data are input into the algorithm by CPU Process 0. rupMC divides the entire dataset into multiple sub-datasets adaptively based on the number of CPU processes such that each sub-dataset can be accommodated and processed by each process. Each subdata point is then input and transmitted from Process 0 to the corresponding computing processes.
Second, each process receives a sub-dataset from Process 0 and adaptively divides the entire subdataset into multiple subdomains based on the available memory space of the GPU. Subsequently, the data of each subdomain are transmitted from the CPU memory to the GPU memory for processing. Each GPU thread is responsible for the data input and all computations of a ray unit. Thus, multiple GPU threads generate lookup indices and marks for multiple points simultaneously. Subsequently, the edge intersections and surface triangle constructions are calculated using GPU threads. Next, the results are transmitted from the GPU memory back to the CPU memory. Subsequently, all loops on the GPU are completed and the sub-results are transmitted from the corresponding processes to Process 1.
Finally, Process 1 receives the sub-results and writes them into the final files.
在这里插入图片描述
Figure 4. rupMC flowchart.

rupMC在CPU/GPU异构架构上的并行计算框架如图4所示。首先，数据的参数由CPU Process 0输入到算法中。rupMC根据CPU进程的数量，使得每个进程可以容纳和处理每个子数据集。然后将每个子数据点从进程0输入并传输到相应的计算进程。
其次，每个进程从进程0接收一个子数据集，并根据GPU的可用内存空间自适应地将整个子数据集划分为多个子域。随后，各个子域的数据从CPU内存传输到GPU内存进行处理。每个GPU线程负责光线单元的数据输入和所有计算。因此，多个 GPU 线程同时生成多个点的查找索引和标记。随后，使用 GPU 线程计算边缘相交和表面三角形结构。接下来，结果从 GPU 内存传输回 CPU 内存。随后，GPU上的所有循环都完成，子结果从相应的进程传输到进程1。
最后，进程1接收子结果并将其写入最终文件。

2.2.3. Index of intersections

The MC calculates the surface intersections and triangular compositions in each cube whose edges are shared by the four adjacent cubes. Thus, the locations of the intersections are repeated, and it is difficult to show the topological relationships between the surface triangles.
rupMC substitutes the ray unit for a cube and reduces the number of redundant computations. Whether intersections exist on the edges of a ray unit, and the number of surface triangles are marked. Thus, using the corresponding marks, rupMC only calculates the intersections at the edge of the corresponding ray unit, eliminating repeated computations and results. Surface triangles are composed of unique intersection indices; therefore, it is easy to determine the topological connections between surface triangles. For a detailed example, please refer to Figure 9 of our experiments .

MC 计算每个立方体中的表面相交和三角形组成，其边缘由四个相邻立方体共享。因此，交点的位置是重复的，并且难以显示表面三角形之间的拓扑关系。
rupMC用射线单元代替立方体，减少了冗余计算的数量。标记射线单元的边缘是否存在交点，以及表面三角形的数量。这样，rupMC利用相应的标记，仅计算相应射线单元边缘处的交点，消除了重复的计算和结果。表面三角形由唯一的交点索引组成；因此，很容易确定表面三角形之间的拓扑连接。详细示例参阅图 9。

3. Experiments

3.1. Datasets and test environment

A set of stratum data (Figure 6) was used in the experiments to assess the performance of rupMC, which provided 102 × 151 × 702 soil moisture content values at a 15 m × 15 m × 1 m spatial resolution. To assess the accuracy and computational performance of rupMC better, the original data were interpolated into four datasets of different sizes (204 × 302 × 1404, 306 × 453 × 2106, 408 × 604 × 2808, and 510 × 755 × 3510).
We compared rupMC to four other methods: the VTK-implemented MC, the VTKimplemented Flying Edges method (FE, Schroeder, Maynard, and Geveci 2015), C++ -implemented MC, and a parallel Dual MC (DMC, Grosso and Zint 2022) method. MC is typically implemented using the Visualization Toolkit (VTK), which is open-source software for image processing and visualization (https://vtk.org). We reimplemented MC as a serial program using C++ to better evaluate the computational performance. The FE method is a high-performance iso-contouring algorithm for structured data, and is available in VTK. The parallel dual-MC approach generates meshes with better triangular and quadrilateral quality, which was implemented by CUDA on a GPU.
The experiments were conducted on a computer; the hardware and software environments are listed in Table 1. The VTK-implemented MC, VTK-implemented FE, C++ -implemented MC, DMC, and rupMC were tested on Computer 1.

在这里插入图片描述
Figure 6. Original stratum data, which provides 102 × 151 × 702 values at 15 m × 15 m × 1 m spatial resolution.

实验中使用了一组地层数据（图6）来评估rupMC的性能，它以15 m × 15 m × 1 m的空间分辨率提供了102 × 151 × 702个土壤水分含量值。为了更好地评估rupMC的准确性和计算性能，将原始数据插值到四个不同大小的数据集（204×302×1404、306×453×2106、408×604×2808和510×755×3510）。

数据略显单一

我们将 rupMC 与其他四种方法进行了比较：VTK 实现的 MC、VTK 实现的 Flying Edges 方法（FE、Schroeder、Maynard 和 Geveci 2015）、C++ 实现的 MC 以及并行 Dual MC（DMC、Grosso 和 Zint 2022）方法。我们使用 C++ 将 MC 重新实现为串行程序，以更好地评估计算性能。 FE 方法是一种适用于结构化数据的高性能等值线算法，可在 VTK 中使用。并行 Dual MC 方法生成具有更好的三角形和四边形质量的网格，该方法由 CUDA 在 GPU 上实现。
实验是在计算机上进行的；硬件和软件环境如表1所示。
在这里插入图片描述

3.2. Overall accuracy and visualization

The stratum datasets were extracted using the VTK-implemented MC, VTK-implemented FE, C++ -implemented MC, DMC, and rupMC. The visualizations of the isosurface extracted by these methods were compared to assess the accuracy. As shown in Figure 7, the visualizations of the VTK-implemented MC, VTK-implemented FE, C++ -implemented MC, and rupMC were the same. DMC used the MC algorithm to extract the isosurface and simplified these triangles, leading to fewer triangles but a rougher surface. The isosurfaces extracted by DMC were closed; thus, only the outermost isosurface was visible.
The counts of the surface intersections (PTS in Figure 8) and triangles (TRA in Figure 8) obtained using these methods are shown in Figure 8. VTK is designed mainly for computer graphics and visualization. Thus, the MC algorithm in VTK used a simplified method to generate surface triangles for rapid visualization (Avila 2010). Therefore, the number of surface intersections and triangles generated by the VTK-implemented MC were lower, and their accuracies were lower than those generated by the MC. Compared with the C++ -implemented MC, rupMC created the same surface triangles but with less intersection preservation, mainly because of the reduction in redundant computations. When the lookup index serves as a pointer in the table and gives all edge intersections for a given cube configuration, there are intersections on the edges of the ray unit, and the counts of the surface triangles are marked. Thus, rupMC calculated only the intersections on the edge of the corresponding ray unit, thereby reducing the number of repeated results. The results show that with these intersections and triangle marks, rupMC effectively reduced the number of repeated final products. The DMC used the face-coloring method to simplify quads that contain repeated intersections; thus, the number of surface intersections was low. However, the half-edge data structure in the DMC must be recomputed after the simplification step, which required many global memory accesses on the GPU and was time-consuming (Grosso and Zint 2022).
The surface triangle compositions and point indices generated by the C++ -implemented MC and rupMC are shown in Figure 9. Compared to the C++ -implemented MC, rupMC generated intersections with a unique index, leading to a reduction in repeated intersections. Owing to the unique index of each intersection, it was easy to find a common edge between two surface triangles and the corresponding topologically adjacent relationships.
在这里插入图片描述
Figure 7. Visualizations of iso-surface extracted by five methods: (1) VTK-implemented MC; (2) VTK-implemented FE; (3) C++ -implemented MC; (4) DMC; (5) rupMC.

使用 VTK 实现的 MC、VTK 实现的 FE、C++ 实现的 MC、DMC 和 rupMC 提取地层数据集。对通过这些方法提取的等值面的可视化进行比较以评估准确性。如图7所示，VTK实现的MC、VTK实现的FE、C++实现的MC和rupMC的可视化是相同的。 DMC使用MC算法提取等值面并简化这些三角形，从而导致三角形更少但表面更粗糙。 DMC提取的等值面是闭合的；因此，只有最外面的等值面是可见的。
使用这些方法获得的表面交点（图8中的PTS）和三角形（图8中的TRA）的计数如图8所示。VTK主要是为计算机图形和可视化而设计的。因此，VTK 中的 MC 算法使用了一种简化的方法来生成表面三角形以实现快速可视化（Avila 2010）。因此，VTK实现的MC生成的表面交点和三角形的数量较少，并且其精度也低于MC生成的结果。与 C++ 实现的 MC 相比，rupMC 创建了相同的表面三角形，但保留的交点较少，这主要是因为冗余计算的减少。当查找索引充当表中的指针并给出给定立方体配置的所有边交点时，射线单元的边上存在交点，并且标记表面三角形的计数。这样，rupMC只计算对应射线单元边缘上的交点，从而减少了重复结果的次数。结果表明，通过这些交叉点和三角形标记，rupMC 有效减少了最终产品的重复数量。 DMC 使用面着色方法来简化包含重复交叉点的四边形；因此，表面交叉点的数量很少。然而，DMC 中的半边数据结构必须在简化步骤后重新计算，这需要 GPU 上的许多全局内存访问并且非常耗时。
C++实现的MC和rupMC生成的表面三角形组成和点索引如图9所示。与C++实现的MC相比，rupMC生成具有唯一索引的交点，从而减少了重复交点。由于每个交点的索引是唯一的，因此很容易找到两个表面三角形之间的公共边以及相应的拓扑相邻关系。

到底有没有重复顶点？

在这里插入图片描述
Figure 8. Counts of surface intersections (PTS) and triangles (TRA) by five methods.

在这里插入图片描述
Figure 9. Surface triangle compositions and intersections generated by C++ -implemented MC and rupMC. (1) Surface triangle compositions with repeated point indices generated by C++ -implemented MC; (2) Surface triangle compositions with point index and corresponding locations generated by C++ -implemented MC; (3) Surface triangle compositions with unique point indices generated by rupMC; (4) Surface triangle compositions with point index and corresponding locations generated by rupMC. The topologic adjacent relationships between surface triangles are shown.

这里表格稍稍有点复杂，看了好一会儿，(3)从左到右对应三角形索引、点的索引和点的X、Y、Z的坐标(大概率为虚拟坐标)，所以一个三角形对应三个点，另外Y坐标因为相同而被省略。（4）的右上部分嵌入了MC的索引，用来比较，比如rupMC中点1对应MC中点0(或8或9)；右下部分则展示了三角形间的拓扑关系，即点的共用情况，如三角形0和三角形1共用了点0和点127。

3.3. Computational performance of rupMC

The overall running times of VTK-implemented MC, VTK-implemented FE, C++ -implemented MC, DMC, and rupMC are listed in Table 2. The results showed that for all five datasets, rupMC on a CPU/GPU heterogeneous architecture achieved speedups of approximately 5 and 4 over the VTK-implemented MC and VTK-implemented FE, over 9 compared with the C++ -implemented MC on a CPU, and over 2 compared with DMC on a GPU, indicating that rupMC significantly improved computational performance.
The speedup was consistent across all five datasets, suggesting that rupMC is scalable for varying data volumes. In addition to the use of multiple processes on the CPU and multithreading on the GPU, the reduction of redundant calculations, along with the adaptive domain decomposition technique, aided in improving the computational performance. Moreover, adaptive domain decomposition enables rupMC to manage datasets of any size and to operate on various types of CPU/GPUs.
The computation times without the I/O process and the speedups of the five methods are listed in Table 3. The speedups of rupMC parallelized the computing processes by the multithread operation of the GPU, which significantly improved computational performance.

在这里插入图片描述

加速还是挺明显，速度为FE的数倍，但是看上去怎么VTK_MC和VTK_FE效率差不多的样子，我自己之前测的时候，两者差距还是比较明显的，maybe跟数据有关？

表 2 列出了 VTK-MC、VTK-FE、C++ MC、DMC 和 rupMC 的总体运行时间。结果表明，对于所有五个数据集，CPU/GPU 异构架构上的 rupMC 与 VTK- MC 和 VTK -FE 相比，分别提高了约 5 和 4，与 CPU 上的 C++ MC 相比，超过 9，与 GPU 上的 DMC 相比，超过 2，表明 rupMC 显著提高了计算性能。
所有五个数据集的加速都是一致的，这表明 rupMC 对于不同的数据量是可扩展的。除了在CPU上使用多进程和在GPU上使用多线程之外，减少冗余计算以及自适应域分解技术有助于提高计算性能。此外，自适应域分解使 rupMC 能够管理任何大小的数据集并在各种类型的 CPU/GPU 上运行。
表3列出了没有I/O过程的计算时间和五种方法的加速比。rupMC的加速比通过GPU的多线程操作并行化计算过程，显著提高了计算性能。

在这里插入图片描述

4. Conclusion and future work

This study proposes an enhanced MC (rupMC) on a CPU/GPU heterogeneous architecture to extract surface intersections and triangles, which has three key improvements. First, a ray unit is used in rupMC as the basic voxel for MC, consisting of a point and three rays along the three directions of the data composition. Second, rupMC uses multiple CPU processes for the I/O of data and multiple GPU threads for computing. This pipelined task distribution on a CPU/GPU heterogeneous architecture makes full use of the devices to process points concurrently. It automatically partitions the spatial domain of raw data into multiple subdomains for various GPUs with different computing capacities. Third, based on the ray unit, the unique indices of the intersections are preserved to compose the surface triangles, and the topological information of the surface is directly embedded in the triangular compositions.
In the experiments, the visualizations of the final products using rupMC were the same as those using VTK-implemented MC, VTK-implemented FE, C ++ -implemented MC, and DMC. Compared with the C++ -implemented MC, rupMC generated fewer intersections with a unique index and created the same surface triangles. The results showed that rupMC effectively reduced the number of repeated final products and made it easier to reveal the topological connections between surface triangles.
In addition, rupMC achieved approximately 50, 160, 10, and 4 speedups compared to the VTK-implemented MC, VTK-implemented FE, C++ -implemented MC, and DMC, respectively, indicating significant improvements in computational performance. Through parallelization, rupMC significantly improved the computing performance and achieved hundreds of speedups for all five datasets over the C++ -implemented MC. By using multi-process and multi-threading on CPU/GPU heterogeneous architectures and changing the basic computing unit, rupMC reduced redundant computations, and achieved approximately two speed-ups over the parallel DMC. In summary, the technique demonstrated its high scalability and adaptability for varying CPUs/ GPUs and datasets of various sizes. rupMC has remarkable capabilities for efficiently and feasibly extracting precise surface intersections and triangles, making it well-suited for large-scale and highdensity applications. The source code of rupMC is publicly available at https://github.com/HPSCIL/ rupMC.
In our future work, rupMC will be adopted for applications such as surface reconstruction from LIDAR point clouds and real-time visualization of geological models.

本研究提出了一种在CPU/GPU异构架构上的增强MC（rupMC）来提取表面顶点和三角形，它具有三个关键改进。首先，rupMC中使用射线单元作为MC的基本体素，由一个点和沿数据组合的三个方向的三个射线组成。其次，rupMC使用多个CPU进程进行数据的I/O，并使用多个GPU线程进行计算。这种在CPU/GPU异构架构上的流水线任务分配充分利用了设备来同时处理点。它自动将原始数据的空间域划分为多个子域，以供具有不同计算能力的各种GPU使用。第三，基于射线单元，保留交点的唯一索引来组成表面三角形，并将表面的拓扑信息直接嵌入到三角形组合中。
在实验中，使用rupMC的最终产品的可视化效果与使用VTK实现的MC、VTK实现的FE、C++实现的MC和DMC的最终产品的可视化效果相同。与 C++ 实现的 MC 相比，rupMC 生成的具有唯一索引的交点更少，并创建了相同的表面三角形。结果表明，rupMC有效减少了重复最终产品的数量，并且更容易揭示表面三角形之间的拓扑连接。
此外，与 VTK- MC、VTK- FE、C++ MC 和 DMC 相比，rupMC 分别实现了约 50、160、10 和 4 倍的加速，表明计算性能有了显着提高。通过并行化，rupMC 显着提高了计算性能，并且在所有五个数据集上比 C++ 实现的 MC 实现了数百倍的加速。通过在CPU/GPU异构架构上使用多进程和多线程并改变基本计算单元，rupMC减少了冗余计算，并且比并行DMC实现了大约两倍的加速。总之，该技术展示了其对不同 CPU/GPU 和各种大小数据集的高可扩展性和适应性。 rupMC 具有高效、可行地提取精确表面交点和三角形的卓越能力，使其非常适合大规模和高密度应用。 rupMC 的源代码可在 https://github.com/HSCIL/rupMC 上公开获取。
在未来的工作中，rupMC 将用于 LIDAR 点云表面重建和地质模型实时可视化等应用。