CUDA sample volumeRender

最新推荐文章于 2024-07-10 14:33:23 发布

shchojj

最新推荐文章于 2024-07-10 14:33:23 发布

阅读量659

点赞数

分类专栏： cuda

原文链接：https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

版权

cuda 专栏收录该内容

12 篇文章 1 订阅

订阅专栏

这篇博客深入探讨了CUDA在三维体积渲染中的应用，详细解释了CUDA线程组织、纹理对象、纹理过滤模式、光栅化过程以及如何利用OpenGL进行结果展示。通过示例代码，展示了如何使用CUDA内核执行并行计算，将数据从主机传输到设备，以及如何设置和销毁纹理对象。此外，还讨论了如何在CUDA中处理光照、深度和颜色累积，以实现高质量的体积渲染效果。

摘要由CSDN通过智能技术生成

CUDA sample volumeRender

如果出现错误，记得判定一下OpenGL是不是在集成显卡上生成的，如果是则把集成显卡禁用掉就行了。

1、thread、block、grid

https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)

https://nyu-cds.github.io/python-gpu/02-cuda/

https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

https://cvw.cac.cornell.edu/GPU/thread

https://zhuanlan.zhihu.com/p/123170285

https://www.cnblogs.com/1024incn/p/4541313.html

https://nichijou.co/cuda2-warp/

将problem划分成coarse sub-problems in parallel by blocks of threads。

将每个sub-problem划分到block中更细的threads中。

kernel也就是一个操作，或者说c++编写的一个函数。所以才会有核函数的概念，一个kernel由一个grid组成。由一系列线程并行执行：

kernel内的线程完成同一操作，也就是 all threads run the same code。
每个线程都会有thread ID，该ID就是用来计算存储地址及其他操作。

grid是由一系列的线程block组成。

毫无疑问block就是由thread组成的。

基本顺序就是program由一系列的function组成，而一个function表示一个kernel grid，一个grid由一系列的block组成，一个block由一系列thread组成。

所以每个kernel grid都有自己独立的工作区。

Automatic Scalability.

每个kernel grid中的block完全独立，不同的block被分配到不同的 Streaming Multiprocessors (SMs),但是同一个block不可以分配到不同的SM。

单个block内的threads可以：

synchronize、
share memory
cooperate(communicate)

来自不同blocks的threads都是完全独立的，位移共享的资源就是global memory。

thread级任务并行性——不同的thread执行不同的任务
block和grid级并行——不同的block或grid执行不同的任务
data并行性——不同的threads和blocks处理memory中不同部分的数据

thread ID：每个block和thread都有自己唯一的索引，只能从运行的kernel中访问。变量blockIdx和threadIdx分别包含被调用block和thread的索引。

The thread index (threadIdx)
The block index (blockIdx)
The size and shape of a block (blockDim)
The size and shape of a grid (gridDim)

Global Memory

Global memory所有thread以及host（CPU）均可访问此内存:

Global memory内存由host分配和释放
用于初始化GPU将要处理的数据

Shared Memory

Shared memory:每个线程块都有自己的shared memory:

只能由block内的threads访问
比local或global momory快得多
需要特殊处理才能获得最佳性能
只在block的生命周期内存在

Local Memory

Local memory每个线程都有自己的private local memory：

只在thread的生命周期内存在
通常由编译器自动处理

Constant and texture memory 这些是所有线程均可访问的只读内存空间:

Constant memory 用于缓存所有功能单元共享的值
texture memory是为硬件提供的texture 操作而优化的

一般CUDA程序的三个主要步骤：

host-to-device transfer也就是数据重host memory to device steps。
加载GPU程序并执行，将数据缓存在芯片上以提高性能。
将结果从device memory复制到host memory，也称为device-to-host transfer。

每个CUDA kernel都以__global__声明说明符开头。程序员使用内置变量为每个thread提供unique glokal ID。

CUDA架构限制了每个块的线程数(1024个线程每个块限制)。
线程块的维度可以通过内置的blockDim变量在内核中访问。
一个block中的所有threads都可以使用一个内部__syncthreads来同步。对于__syncthreads，线程块中的所有线程必须等待，然后才能继续执行。
在<<<...>>>中指定的每个块的thread数和每个网格的block数，可以是int或dim3。这些三尖括号标记了从主机代码到设备代码的调用。它也被称为kernel launch。

Registers—这些是每个thread专用的，这意味着分配给该线程的寄存器对其他线程不可见。编译器做出有关寄存器利用率的决策。
L1/Shared memory (SMEM)—每个SM都有一个快速的on-chip scratched存储器，可用作L1 cache和shared memory。CUDA block中的所有线程可以共享shared memory，并且在给定SM上运行的所有CUDA块可以共享SM提供的物理内存资源。
Read-only memory—每个SM都具instruction cache，constant memory，texture和RO cache，这对kernel代码是只读的。
L2 cache—L2 cache在所有SM之间共享，因此每个CUDA block中的每个线程都可以访问该内存。
Global memory—这是GPU和位于GPU中的DRAM的帧缓冲区大小。

CUDA使用kernel execution configuration<<< ... >>>来告诉CUDA运行时在GPU上启动多少个threads。 CUDA将线程组织到称为“thread block”的组中。kernel可以启动组织为“grid”结构的多个thread blocks。

<<< M , T >>>//表示kernel以M个thread blocks的grid启动。每个thread block具有T个并行线程。

很容一看出来，一个kernel是一个grid，所以需要声明该grid有多少个thread blocks，并且每个thread block有多少个thread。

SP(Streaming Processor):流处理器，是GPU最基本的处理单元，在fermi架构开始被叫做CUDA core。

SIMD SM

SM(Streaming MultiProcessor): 一个SM由多个CUDA core组成，SM还包括特殊运算单元(SFU)，共享内存(shared memory)，寄存器文件(Register File)和调度器(Warp Scheduler)等。

按照SIMD模型，SM最小执行单元为warp，一个warp中有多个线程。SM执行单元SPs共享单个指令fetch/dispatch，这些线程将同一个指令应用于不同数据。因此，一个warp中的所有线程将总是具有相同的执行时间。

warp

warp是SM的基本执行单元。一个warp包含32个并行thread，这32个thread执行于SIMT(Single-Instruction, Multiple-Thread，单指令多线程)模式。也就是说所有thread执行同一条指令，并且每个thread会使用各自的data执行该指令。

一个warp中的线程必然在同一个block中,同一个block不会再两个SM中，也就是block会调用多个warp，如果block的线程数不能整除warp线程数，则最后一个warp没有填满，没填满的warp中的thread是inactive。只要调用warp就会消耗SM资源，只不过有些warp中的线程是inactive。

一个warp中的thread执行相同的指令，有相同的执行时间，如果某一个thread阻塞了，同一个warp的其他thread都会阻塞，因此有了warp divergence。所以warp divergence只会出现在同一个warp中。

同一个warp中的thread执行同一操作，如果因为控制流语句（if）进入不同的分支才触发的warp divergence，那么只要避免控制流语句触发即可。也就是将同一分支放入同一warp。

也就引入了branch efficiency代码比较简单的时候，CUDA编译器自动优化代码。

warp的context包含三个部分：

Program counter
Register
Shared memory

当一个block或得到足够的资源时，就成为active block。block中的warp就称为active warp。active warp又可以被分为下面三类：

Selected warp 被选中的warp
Stalled warp 没准备好要执行的称为Stalled warp
Eligible warp 没被选中，但是已经做好准备被执行的称为Eligible warp

SM中warp调度器每个cycle会挑选active warp送去执行，

warp适合执行需要满足下面两个条件：

32个CUDA core有空
所有当前指令的参数都准备就绪

cudaError_t cudaDeviceSynchronize(void)可以用来停住CUP等待CUDA中的操作完成：

__device__ void __syncthreads(void)当该函数被调用，block中的每个thread都会等待所有其他thread执行到某个点来实现同步。


#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);

__global__ void addKernel(int *c, const int *a, const int *b)
{
    int i = threadIdx.x;//获取一维的线程索引
    c[i] = a[i] + b[i];
}

int main()
{
    const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    // Add vectors in parallel.
    cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);//并行声明返回cuda状态
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }

    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}

// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
{
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);//选择cuda 0作为默认选项
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }

    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));//为c创建cuda数据
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));//为a创建cuda数据
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));//为b创建cuda数据
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    // Copy input vectors from host memory to GPU buffers.
    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);//将a内存拷贝给显存
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);//将b内存拷贝给显存
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    // Launch a kernel on the GPU with one thread for each element.
    addKernel<<<1, size>>>(dev_c, dev_a, dev_b);

    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }
    
    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();//同步
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }

    // Copy output vector from GPU buffer to host memory.
    cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);//显存拷贝到内存
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

Error:
    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);
    
    return cudaStatus;
}

volumeRender_kernel.cu

/*
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

// Simple 3D volume renderer

#ifndef _VOLUMERENDER_KERNEL_CU_
#define _VOLUMERENDER_KERNEL_CU_

#include <helper_cuda.h>
#include <helper_math.h>

typedef unsigned int  uint;
typedef unsigned char uchar;

cudaArray *d_volumeArray = 0;
cudaArray *d_transferFuncArray;

typedef unsigned char VolumeType;
//typedef unsigned short VolumeType;

cudaTextureObject_t	texObject; // For 3D texture
cudaTextureObject_t transferTex; // For 1D transfer function texture

typedef struct
{
    float4 m[3];
} float3x4;

__constant__ float3x4 c_invViewMatrix;  // inverse view matrix

struct Ray
{
    float3 o;   // origin
    float3 d;   // direction
};

// intersect ray with a box
// http://www.siggraph.org/education/materials/HyperGraph/raytrace/rtinter3.htm

__device__
int intersectBox(Ray r, float3 boxmin, float3 boxmax, float *tnear, float *tfar)
{
    // compute intersection of ray with all six bbox planes
    float3 invR = make_float3(1.0f) / r.d;
    float3 tbot = invR * (boxmin - r.o);
    float3 ttop = invR * (boxmax - r.o);

    // re-order intersections to find smallest and largest on each axis
    float3 tmin = fminf(ttop, tbot);
    float3 tmax = fmaxf(ttop, tbot);

    // find the largest tmin and the smallest tmax
    float largest_tmin = fmaxf(fmaxf(tmin.x, tmin.y), fmaxf(tmin.x, tmin.z));
    float smallest_tmax = fminf(fminf(tmax.x, tmax.y), fminf(tmax.x, tmax.z));

    *tnear = largest_tmin;
    *tfar = smallest_tmax;

    return smallest_tmax > largest_tmin;
}

// transform vector by matrix (no translation)
__device__
float3 mul(const float3x4 &M, const float3 &v)
{
    float3 r;
    r.x = dot(v, make_float3(M.m[0]));
    r.y = dot(v, make_float3(M.m[1]));
    r.z = dot(v, make_float3(M.m[2]));
    return r;
}

// transform vector by matrix with translation
__device__
float4 mul(const float3x4 &M, const float4 &v)
{
    float4 r;
    r.x = dot(v, M.m[0]);
    r.y = dot(v, M.m[1]);
    r.z = dot(v, M.m[2]);
    r.w = 1.0f;
    return r;
}

__device__ uint rgbaFloatToInt(float4 rgba)
{
    rgba.x = __saturatef(rgba.x);   // clamp to [0.0, 1.0]
    rgba.y = __saturatef(rgba.y);
    rgba.z = __saturatef(rgba.z);
    rgba.w = __saturatef(rgba.w);
    return (uint(rgba.w*255)<<24) | (uint(rgba.z*255)<<16) | (uint(rgba.y*255)<<8) | uint(rgba.x*255);

}

__global__ void
d_render(uint *d_output, uint imageW, uint imageH,
         float density, float brightness,
         float transferOffset, float transferScale, cudaTextureObject_t	tex,
         cudaTextureObject_t	transferTex)
{
    const int maxSteps = 500;
    const float tstep = 0.01f;
    const float opacityThreshold = 0.95f;
    const float3 boxMin = make_float3(-1.0f, -1.0f, -1.0f);
    const float3 boxMax = make_float3(1.0f, 1.0f, 1.0f);

    uint x = blockIdx.x*blockDim.x + threadIdx.x;
    uint y = blockIdx.y*blockDim.y + threadIdx.y;

    if ((x >= imageW) || (y >= imageH)) return;

    float u = (x / (float) imageW)*2.0f-1.0f;
    float v = (y / (float) imageH)*2.0f-1.0f;

    // calculate eye ray in world space
    Ray eyeRay;
    eyeRay.o = make_float3(mul(c_invViewMatrix, make_float4(0.0f, 0.0f, 0.0f, 1.0f)));
    eyeRay.d = normalize(make_float3(u, v, -2.0f));
    eyeRay.d = mul(c_invViewMatrix, eyeRay.d);

    // find intersection with box
    float tnear, tfar;
    int hit = intersectBox(eyeRay, boxMin, boxMax, &tnear, &tfar);

    if (!hit) return;

    if (tnear < 0.0f) tnear = 0.0f;     // clamp to near plane

    // march along ray from front to back, accumulating color
    float4 sum = make_float4(0.0f);
    float t = tnear;
    float3 pos = eyeRay.o + eyeRay.d*tnear;
    float3 step = eyeRay.d*tstep;

    for (int i=0; i<maxSteps; i++)
    {
        // read from 3D texture
        // remap position to [0, 1] coordinates
        float sample = tex3D<float>(tex, pos.x*0.5f+0.5f, pos.y*0.5f+0.5f, pos.z*0.5f+0.5f);
        //sample *= 64.0f;    // scale for 10-bit data

        // lookup in transfer function texture
        float4 col = tex1D<float4>(transferTex, (sample-transferOffset)*transferScale);
        col.w *= density;

        // "under" operator for back-to-front blending
        //sum = lerp(sum, col, col.w);

        // pre-multiply alpha
        col.x *= col.w;
        col.y *= col.w;
        col.z *= col.w;
        // "over" operator for front-to-back blending
        sum = sum + col*(1.0f - sum.w);

        // exit early if opaque
        if (sum.w > opacityThreshold)
            break;

        t += tstep;

        if (t > tfar) break;

        pos += step;
    }

    sum *= brightness;

    // write output color
    d_output[y*imageW + x] = rgbaFloatToInt(sum);
}

extern "C"
void setTextureFilterMode(bool bLinearFilter)
{
    if (texObject)
    {
        checkCudaErrors(cudaDestroyTextureObject(texObject));//销毁texture对象
    }
    cudaResourceDesc            texRes;//定义资源描述子
    memset(&texRes,0,sizeof(cudaResourceDesc));//将textRes初始化为0

    texRes.resType            = cudaResourceTypeArray;//资源类型
    texRes.res.array.array    = d_volumeArray;//数组数据

    cudaTextureDesc             texDescr;//定义纹理描述子
    memset(&texDescr,0,sizeof(cudaTextureDesc));//将texDescr初始化为0

    texDescr.normalizedCoords = true;//指示纹理读取是否标准化
    texDescr.filterMode       = bLinearFilter ? cudaFilterModeLinear : cudaFilterModePoint;//纹理过滤模式

    texDescr.addressMode[0] = cudaAddressModeWrap;//Wrap地址模式
    texDescr.addressMode[1] = cudaAddressModeWrap;//Wrap地址模式
    texDescr.addressMode[2] = cudaAddressModeWrap;//Wrap地址模式

    texDescr.readMode = cudaReadModeNormalizedFloat;//读取texture作为normal浮点数

    checkCudaErrors(cudaCreateTextureObject(&texObject, &texRes, &texDescr, NULL));//创建一个纹理对象

}

extern "C"
void initCuda(void *h_volume, cudaExtent volumeSize)
{
    // create 3D array
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<VolumeType>();//创建CUDA通道格式描述子
    checkCudaErrors(cudaMalloc3DArray(&d_volumeArray, &channelDesc, volumeSize));//创建volume显存

    // copy data to 3D array
    cudaMemcpy3DParms copyParams = {0};//CUDA 3D显存复制参数
    copyParams.srcPtr   = make_cudaPitchedPtr(h_volume, volumeSize.width*sizeof(VolumeType), volumeSize.width, volumeSize.height);//编码源存储器地址
    copyParams.dstArray = d_volumeArray;//目的地显存地址
    copyParams.extent   = volumeSize;//请求内存拷贝大小
    copyParams.kind     = cudaMemcpyHostToDevice;//类型的转移
    checkCudaErrors(cudaMemcpy3D(&copyParams));//在3D对象之间复制数据

    cudaResourceDesc            texRes;//CUDA资源描述子
    memset(&texRes, 0, sizeof(cudaResourceDesc));//将textRes初始化为0

    texRes.resType            = cudaResourceTypeArray;//资源类型
    texRes.res.array.array    = d_volumeArray;//数组数据

    cudaTextureDesc             texDescr;//定义纹理描述子
    memset(&texDescr, 0, sizeof(cudaTextureDesc));//将texDescr初始化为0

    texDescr.normalizedCoords = true; // access with normalized texture coordinates 使用归一化纹理坐标访问
    texDescr.filterMode       = cudaFilterModeLinear; // linear interpolation 线性差值

    texDescr.addressMode[0] = cudaAddressModeClamp;  // clamp texture coordinates 限制到边缘地址模式
    texDescr.addressMode[1] = cudaAddressModeClamp;
    texDescr.addressMode[2] = cudaAddressModeClamp;

    texDescr.readMode = cudaReadModeNormalizedFloat;//读取texture作为normal浮点数

    checkCudaErrors(cudaCreateTextureObject(&texObject, &texRes, &texDescr, NULL));//创建一个纹理对象

    // create transfer function texture 创建传递函数纹理
    float4 transferFunc[] =
    {
        {  0.0, 0.0, 0.0, 0.0, },
        {  1.0, 0.0, 0.0, 1.0, },
        {  1.0, 0.5, 0.0, 1.0, },
        {  1.0, 1.0, 0.0, 1.0, },
        {  0.0, 1.0, 0.0, 1.0, },
        {  0.0, 1.0, 1.0, 1.0, },
        {  0.0, 0.0, 1.0, 1.0, },
        {  1.0, 0.0, 1.0, 1.0, },
        {  0.0, 0.0, 0.0, 0.0, },
    };

    cudaChannelFormatDesc channelDesc2 = cudaCreateChannelDesc<float4>();//创建CUDA通道格式描述子
    cudaArray *d_transferFuncArray;//CUDA 数组
    checkCudaErrors(cudaMallocArray(&d_transferFuncArray, &channelDesc2, sizeof(transferFunc)/sizeof(float4), 1));//创建d_transferFuncArray显存
    checkCudaErrors(cudaMemcpyToArray(d_transferFuncArray, 0, 0, transferFunc, sizeof(transferFunc), cudaMemcpyHostToDevice));//Copies data between host and device

    memset(&texRes,0,sizeof(cudaResourceDesc));//将texRes初始化为0

    texRes.resType            = cudaResourceTypeArray;//资源类型
    texRes.res.array.array    = d_transferFuncArray;//CUDA 数组

    memset(&texDescr,0,sizeof(cudaTextureDesc));//将texDescr初始化为0

    texDescr.normalizedCoords = true; // access with normalized texture coordinates 使用归一化纹理坐标访问
    texDescr.filterMode       = cudaFilterModeLinear; // linear interpolation 线性差值

    texDescr.addressMode[0] = cudaAddressModeClamp; // wrap texture coordinates

    texDescr.readMode = cudaReadModeElementType;//读取指定元素类型的纹理

    checkCudaErrors(cudaCreateTextureObject(&transferTex, &texRes, &texDescr, NULL));//创建一个纹理对象
}

extern "C"
void freeCudaBuffers()
{
    checkCudaErrors(cudaDestroyTextureObject(texObject));//销毁一个纹理对象
    checkCudaErrors(cudaDestroyTextureObject(transferTex));//  销毁一个纹理对象
    checkCudaErrors(cudaFreeArray(d_volumeArray));//Frees an array on the device
    checkCudaErrors(cudaFreeArray(d_transferFuncArray));//Frees an array on the device
}


extern "C"
void render_kernel(dim3 gridSize, dim3 blockSize, uint *d_output, uint imageW, uint imageH,
                   float density, float brightness, float transferOffset, float transferScale)
{
    d_render<<<gridSize, blockSize>>>(d_output, imageW, imageH, density,
                                      brightness, transferOffset, transferScale,
                                      texObject, transferTex);
}

extern "C"
void copyInvViewMatrix(float *invViewMatrix, size_t sizeofMatrix)
{
    checkCudaErrors(cudaMemcpyToSymbol(c_invViewMatrix, invViewMatrix, sizeofMatrix));//Copies data to the given symbol on the device
}


#endif // #ifndef _VOLUMERENDER_KERNEL_CU_

volumeRender.cpp

/*
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

/*
    Volume rendering sample

    This sample loads a 3D volume from disk and displays it using
    ray marching and 3D textures.

    Note - this is intended to be an example of using 3D textures
    in CUDA, not an optimized volume renderer.

    Changes
    sgg 22/3/2010
    - updated to use texture for display instead of glDrawPixels.
    - changed to render from front-to-back rather than back-to-front.
*/

// OpenGL Graphics includes
#include <helper_gl.h>
#if defined (__APPLE__) || defined(MACOSX)
  #pragma clang diagnostic ignored "-Wdeprecated-declarations"
  #include <GLUT/glut.h>
  #ifndef glutCloseFunc
  #define glutCloseFunc glutWMCloseFunc
  #endif
#else
#include <GL/freeglut.h>
#endif

// CUDA Runtime, Interop, and includes
#include <cuda_runtime.h>
#include <cuda_gl_interop.h>
#include <cuda_profiler_api.h>
#include <vector_types.h>
#include <vector_functions.h>
#include <driver_functions.h>

// CUDA utilities
#include <helper_cuda.h>

// Helper functions
#include <helper_cuda.h>
#include <helper_functions.h>
#include <helper_timer.h>
#include <time.h>
typedef unsigned int uint;
typedef unsigned char uchar;

#define MAX_EPSILON_ERROR 5.00f
#define THRESHOLD         0.30f

// Define the files that are to be save and the reference images for validation
const char *sOriginal[] =
{
    "volume.ppm",
    NULL
};

const char *sReference[] =
{
    "ref_volume.ppm",
    NULL
};

const char *sSDKsample = "CUDA 3D Volume Render";//

//const char *volumeFilename = "Bucky.raw";
//cudaExtent volumeSize = make_cudaExtent(32, 32, 32);
//typedef unsigned char VolumeType;

const char* volumeFilename = "img_itk.raw";
cudaExtent volumeSize = make_cudaExtent(512, 512, 143);//创建cuda显存
typedef unsigned char VolumeType;


//char *volumeFilename = "mrt16_angio.raw";
//cudaExtent volumeSize = make_cudaExtent(416, 512, 112);
//typedef unsigned short VolumeType;

uint width = 512, height = 512;//生成的OpenGL大小
//uint width = 1536, height = 1536;

dim3 blockSize(16, 16);//block大小,二维，16*16=256个线程
dim3 gridSize;//grid大小，即block个数

float3 viewRotation;
float3 viewTranslation = make_float3(0.0, 0.0, -4.0f);
float invViewMatrix[12];

float density = 0.05f;
float brightness = 1.0f;
float transferOffset = 0.0f;
float transferScale = 1.0f;
bool linearFiltering = true;

GLuint pbo = 0;     // OpenGL pixel buffer object
GLuint tex = 0;     // OpenGL texture object
struct cudaGraphicsResource *cuda_pbo_resource; // CUDA Graphics Resource (to transfer PBO) CUDA图形资源(用于传输PBO pixel buffer object)

StopWatchInterface *timer = 0; //定义 StopWatch Interface接口 

// Auto-Verification Code 自动验证码
const int frameCheckNumber = 2;
int fpsCount = 0;        // FPS count for averaging FPS计算平均值
int fpsLimit = 1;        // FPS limit for sampling 采样FPS极限
int g_Index = 0;
unsigned int frameCount = 0;

int *pArgc;
char **pArgv;

#ifndef MAX
#define MAX(a,b) ((a > b) ? a : b)
#endif

extern "C" void setTextureFilterMode(bool bLinearFilter);//创建一个3D texture
extern "C" void initCuda(void *h_volume, cudaExtent volumeSize);//create 3D array
extern "C" void freeCudaBuffers();//释放cuda对象
extern "C" void render_kernel(dim3 gridSize, dim3 blockSize, uint *d_output, uint imageW, uint imageH,
                              float density, float brightness, float transferOffset, float transferScale);//render kernel
extern "C" void copyInvViewMatrix(float *invViewMatrix, size_t sizeofMatrix);//逆矩阵

void initPixelBuffer();//初始化openGL

void computeFPS()
{
    frameCount++;
    fpsCount++;

    if (fpsCount == fpsLimit)
    {
        char fps[256];
        float ifps = 1.f / (sdkGetAverageTimerValue(&timer) / 1000.f);//   1s/返回计时器执行的平均时间
        sprintf(fps, "Volume Render: %3.1f fps", ifps);

        glutSetWindowTitle(fps);//OpenGL标题
        fpsCount = 0;

        fpsLimit = (int)MAX(1.f, ifps);
        sdkResetTimer(&timer);//重置计时器的计数器。
    }
}

// render image using CUDA 使用CUDA渲染图像
void render()
{
    copyInvViewMatrix(invViewMatrix, sizeof(float4)*3);//拷贝逆矩阵

    // map PBO to get CUDA device pointer
    uint *d_output;//opengl指针
    // map PBO to get CUDA device pointer
    checkCudaErrors(cudaGraphicsMapResources(1, &cuda_pbo_resource, 0));//Map graphics resources for access by CUDA
    size_t num_bytes;
    checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void **)&d_output, &num_bytes,
                                                         cuda_pbo_resource));//Get an device pointer through which to access a mapped graphics resource.
    //printf("CUDA mapped PBO: May access %ld bytes\n", num_bytes);

    // clear image
    checkCudaErrors(cudaMemset(d_output, 0, width*height*4));//清除image

    // call CUDA kernel, writing results to PBO
    render_kernel(gridSize, blockSize, d_output, width, height, density, brightness, transferOffset, transferScale);//开始渲染

    getLastCudaError("kernel failed");

    checkCudaErrors(cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0));
}

// display results using OpenGL (called by GLUT)
void display()
{
    sdkStartTimer(&timer);

    // use OpenGL to build view matrix
    GLfloat modelView[16];
    glMatrixMode(GL_MODELVIEW);
    glPushMatrix();
    glLoadIdentity();
    glRotatef(-viewRotation.x, 1.0, 0.0, 0.0);
    glRotatef(-viewRotation.y, 0.0, 1.0, 0.0);
    glTranslatef(-viewTranslation.x, -viewTranslation.y, -viewTranslation.z);
    glGetFloatv(GL_MODELVIEW_MATRIX, modelView);
    glPopMatrix();

    invViewMatrix[0] = modelView[0];
    invViewMatrix[1] = modelView[4];
    invViewMatrix[2] = modelView[8];
    invViewMatrix[3] = modelView[12];
    invViewMatrix[4] = modelView[1];
    invViewMatrix[5] = modelView[5];
    invViewMatrix[6] = modelView[9];
    invViewMatrix[7] = modelView[13];
    invViewMatrix[8] = modelView[2];
    invViewMatrix[9] = modelView[6];
    invViewMatrix[10] = modelView[10];
    invViewMatrix[11] = modelView[14];
    clock_t start, end;
    start = clock();
    render();
    end = clock();
    printf("render time: %f \n", double(end - start) / CLOCKS_PER_SEC);

    // display results
    glClear(GL_COLOR_BUFFER_BIT);

    // draw image from PBO
    glDisable(GL_DEPTH_TEST);

    glPixelStorei(GL_UNPACK_ALIGNMENT, 1);
#if 0
    // draw using glDrawPixels (slower)
    glRasterPos2i(0, 0);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
    glDrawPixels(width, height, GL_RGBA, GL_UNSIGNED_BYTE, 0);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);
#else
    // draw using texture

    // copy from pbo to texture
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
    glBindTexture(GL_TEXTURE_2D, tex);
    glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, 0);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);

    // draw textured quad
    glEnable(GL_TEXTURE_2D);
    glBegin(GL_QUADS);
    glTexCoord2f(0, 0);
    glVertex2f(0, 0);
    glTexCoord2f(1, 0);
    glVertex2f(1, 0);
    glTexCoord2f(1, 1);
    glVertex2f(1, 1);
    glTexCoord2f(0, 1);
    glVertex2f(0, 1);
    glEnd();

    glDisable(GL_TEXTURE_2D);
    glBindTexture(GL_TEXTURE_2D, 0);
#endif

    glutSwapBuffers();
    glutReportErrors();

    sdkStopTimer(&timer);

    computeFPS();
}

void idle()
{
    glutPostRedisplay();
}

void keyboard(unsigned char key, int x, int y)
{
    switch (key)
    {
        case 27:
            #if defined (__APPLE__) || defined(MACOSX)
                exit(EXIT_SUCCESS);
            #else
                glutDestroyWindow(glutGetWindow());
                return;
            #endif
            break;

        case 'f':
            linearFiltering = !linearFiltering;
            setTextureFilterMode(linearFiltering);
            break;

        case '+':
            density += 0.01f;
            break;

        case '-':
            density -= 0.01f;
            break;

        case ']':
            brightness += 0.1f;
            break;

        case '[':
            brightness -= 0.1f;
            break;

        case ';':
            transferOffset += 0.01f;
            break;

        case '\'':
            transferOffset -= 0.01f;
            break;

        case '.':
            transferScale += 0.01f;
            break;

        case ',':
            transferScale -= 0.01f;
            break;

        default:
            break;
    }

    printf("density = %.2f, brightness = %.2f, transferOffset = %.2f, transferScale = %.2f\n", density, brightness, transferOffset, transferScale);
    glutPostRedisplay();
}

int ox, oy;
int buttonState = 0;

void mouse(int button, int state, int x, int y)
{
    if (state == GLUT_DOWN)
    {
        buttonState  |= 1<<button;
    }
    else if (state == GLUT_UP)
    {
        buttonState = 0;
    }

    ox = x;
    oy = y;
    glutPostRedisplay();
}

void motion(int x, int y)
{
    float dx, dy;
    dx = (float)(x - ox);
    dy = (float)(y - oy);

    if (buttonState == 4)
    {
        // right = zoom
        viewTranslation.z += dy / 100.0f;
    }
    else if (buttonState == 2)
    {
        // middle = translate
        viewTranslation.x += dx / 100.0f;
        viewTranslation.y -= dy / 100.0f;
    }
    else if (buttonState == 1)
    {
        // left = rotate
        viewRotation.x += dy / 5.0f;
        viewRotation.y += dx / 5.0f;
    }

    ox = x;
    oy = y;
    glutPostRedisplay();
}

int iDivUp(int a, int b)
{
    return (a % b != 0) ? (a / b + 1) : (a / b);
}

void reshape(int w, int h)
{
    width = w;
    height = h;
    initPixelBuffer();

    // calculate new grid size
    gridSize = dim3(iDivUp(width, blockSize.x), iDivUp(height, blockSize.y));

    glViewport(0, 0, w, h);

    glMatrixMode(GL_MODELVIEW);
    glLoadIdentity();

    glMatrixMode(GL_PROJECTION);
    glLoadIdentity();
    glOrtho(0.0, 1.0, 0.0, 1.0, 0.0, 1.0);
}

void cleanup()
{
    sdkDeleteTimer(&timer);

    freeCudaBuffers();

    if (pbo)
    {
        cudaGraphicsUnregisterResource(cuda_pbo_resource);
        glDeleteBuffers(1, &pbo);
        glDeleteTextures(1, &tex);
    }
    // Calling cudaProfilerStop causes all profile data to be
    // flushed before the application exits
    checkCudaErrors(cudaProfilerStop());
}

void initGL(int *argc, char **argv)
{
    // initialize GLUT callback functions
    glutInit(argc, argv);
    glutInitDisplayMode(GLUT_RGB | GLUT_DOUBLE);
    glutInitWindowSize(width, height);
    glutCreateWindow("CUDA volume rendering");

    if (!isGLVersionSupported(2,0) ||
        !areGLExtensionsSupported("GL_ARB_pixel_buffer_object"))
    {
        printf("Required OpenGL extensions are missing.");
        exit(EXIT_SUCCESS);
    }
}

void initPixelBuffer()
{
    if (pbo)
    {
        // unregister this buffer object from CUDA C ;从CUDA C中注销这个缓冲区对象
        checkCudaErrors(cudaGraphicsUnregisterResource(cuda_pbo_resource));//从CUDA C中注销CUDA图形资源(用于传输PBO pixel buffer object)

        // delete old buffer
        glDeleteBuffers(1, &pbo);// delete old bufferobject
        glDeleteTextures(1, &tex);// delete old texture object
    }

    // create pixel buffer object for display;为显示创建像素缓冲对象
    glGenBuffers(1, &pbo);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
    glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, width*height*sizeof(GLubyte)*4, 0, GL_STREAM_DRAW_ARB);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);

    // register this buffer object with CUDA
    checkCudaErrors(cudaGraphicsGLRegisterBuffer(&cuda_pbo_resource, pbo, cudaGraphicsMapFlagsWriteDiscard));//用CUDA注册这个缓冲区对象，CUDA只会写入该资源，不会读取该资源，从CUDA到openGL

    // create texture for display
    glGenTextures(1, &tex);
    glBindTexture(GL_TEXTURE_2D, tex);
    glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
    glBindTexture(GL_TEXTURE_2D, 0);
}

// Load raw data from disk
void *loadRawFile(char *filename, size_t size)
{
    FILE *fp = fopen(filename, "rb");

    if (!fp)
    {
        fprintf(stderr, "Error opening file '%s'\n", filename);
        return 0;
    }

    void *data = malloc(size);
    size_t read = fread(data, 1, size, fp);
    fclose(fp);

#if defined(_MSC_VER_)
    printf("Read '%s', %Iu bytes\n", filename, read);
#else
    printf("Read '%s', %zu bytes\n", filename, read);
#endif

    return data;
}

void runSingleTest(const char *ref_file, const char *exec_path)
{
    bool bTestResult = true;

    uint *d_output;
    checkCudaErrors(cudaMalloc((void **)&d_output, width*height*sizeof(uint)));//分配显存
    checkCudaErrors(cudaMemset(d_output, 0, width*height*sizeof(uint)));//初始化

    float modelView[16] =
    {
        1.0f, 0.0f, 0.0f, 0.0f,
        0.0f, 1.0f, 0.0f, 0.0f,
        0.0f, 0.0f, 1.0f, 0.0f,
        0.0f, 0.0f, 4.0f, 1.0f
    };//变换矩阵

    invViewMatrix[0] = modelView[0];
    invViewMatrix[1] = modelView[4];
    invViewMatrix[2] = modelView[8];
    invViewMatrix[3] = modelView[12];
    invViewMatrix[4] = modelView[1];
    invViewMatrix[5] = modelView[5];
    invViewMatrix[6] = modelView[9];
    invViewMatrix[7] = modelView[13];
    invViewMatrix[8] = modelView[2];
    invViewMatrix[9] = modelView[6];
    invViewMatrix[10] = modelView[10];
    invViewMatrix[11] = modelView[14];

    // call CUDA kernel, writing results to PBO
    copyInvViewMatrix(invViewMatrix, sizeof(float4)*3);//

    // Start timer 0 and process n loops on the GPU
    int nIter = 10;

    for (int i = -1; i < nIter; i++)
    {
        if (i == 0)
        {
            cudaDeviceSynchronize();//等待GPU同步完成
            sdkStartTimer(&timer);//开启计时器
        }

        render_kernel(gridSize, blockSize, d_output, width, height, density, brightness, transferOffset, transferScale);//渲染函数
    }

    cudaDeviceSynchronize();//等待GPU同步完成
    sdkStopTimer(&timer);//结束计时器
    // Get elapsed time and throughput, then log to sample and master logs 获取运行时间和吞吐量，然后记录到样本日志和主日志中
    double dAvgTime = sdkGetTimerValue(&timer)/(nIter * 1000.0);
    printf("volumeRender, Throughput = %.4f MTexels/s, Time = %.5f s, Size = %u Texels, NumDevsUsed = %u, Workgroup = %u\n",
           (1.0e-6 * width * height)/dAvgTime, dAvgTime, (width * height), 1, blockSize.x * blockSize.y);


    getLastCudaError("Error: render_kernel() execution FAILED");
    checkCudaErrors(cudaDeviceSynchronize());

    unsigned char *h_output = (unsigned char *)malloc(width*height*4);
    checkCudaErrors(cudaMemcpy(h_output, d_output, width*height*4, cudaMemcpyDeviceToHost));

    sdkSavePPM4ub("volume.ppm", h_output, width, height);
    bTestResult = sdkComparePPM("volume.ppm", sdkFindFilePath(ref_file, exec_path), MAX_EPSILON_ERROR, THRESHOLD, true);

    cudaFree(d_output);
    free(h_output);
    cleanup();

    exit(bTestResult ? EXIT_SUCCESS : EXIT_FAILURE);
}


// Program main

int
main(int argc, char **argv)
{
    pArgc = &argc;
    pArgv = argv;

    char *ref_file = NULL;

#if defined(__linux__)
    setenv ("DISPLAY", ":0", 0);
#endif

    //start logs
    printf("%s Starting...\n\n", sSDKsample);

    if (checkCmdLineFlag(argc, (const char **)argv, "file"))
    {
        getCmdLineArgumentString(argc, (const char **)argv, "file", &ref_file);
        fpsLimit = frameCheckNumber;
    }

    if (ref_file)
    {
        findCudaDevice(argc, (const char **)argv);// Initialization code to find the best CUDA Device
    }
    else
    {
        // First initialize OpenGL context, so we can properly set the GL for CUDA.
        // This is necessary in order to achieve optimal performance with OpenGL/CUDA interop.
        initGL(&argc, argv);//首先初始化OpenGL上下文

        findCudaDevice(argc, (const char **)argv);//初始化CUDA
    }

    // parse arguments
    char *filename;

    if (getCmdLineArgumentString(argc, (const char **) argv, "volume", &filename))
    {
        volumeFilename = filename;
    }

    int n;

    if (checkCmdLineFlag(argc, (const char **) argv, "size"))
    {
        n = getCmdLineArgumentInt(argc, (const char **) argv, "size");
        volumeSize.width = volumeSize.height = volumeSize.depth = n;
    }

    if (checkCmdLineFlag(argc, (const char **) argv, "xsize"))
    {
        n = getCmdLineArgumentInt(argc, (const char **) argv, "xsize");
        volumeSize.width = n;
    }

    if (checkCmdLineFlag(argc, (const char **) argv, "ysize"))
    {
        n = getCmdLineArgumentInt(argc, (const char **) argv, "ysize");
        volumeSize.height = n;
    }

    if (checkCmdLineFlag(argc, (const char **) argv, "zsize"))
    {
        n= getCmdLineArgumentInt(argc, (const char **) argv, "zsize");
        volumeSize.depth = n;
    }

    // load volume data
    char *path = sdkFindFilePath(volumeFilename, argv[0]);//查找文件

    if (path == 0)
    {
        printf("Error finding file '%s'\n", volumeFilename);
        exit(EXIT_FAILURE);
    }

    size_t size = volumeSize.width*volumeSize.height*volumeSize.depth*sizeof(VolumeType);//获取文件大小
    void *h_volume = loadRawFile(path, size);//读取raw数据

    initCuda(h_volume, volumeSize);//初始化显存
    free(h_volume);//释放内存数据

    sdkCreateTimer(&timer);//创建计时器

    printf("Press '+' and '-' to change density (0.01 increments)\n"
           "      ']' and '[' to change brightness\n"
           "      ';' and ''' to modify transfer function offset\n"
           "      '.' and ',' to modify transfer function scale\n\n");

    // calculate new grid size
    gridSize = dim3(iDivUp(width, blockSize.x), iDivUp(height, blockSize.y));//根据blocksize和图像的大小计算grid大小

    if (ref_file)
    {
        runSingleTest(ref_file, argv[0]);
    }
    else
    {
        // This is the normal rendering path for VolumeRender
        glutDisplayFunc(display);
        glutKeyboardFunc(keyboard);
        glutMouseFunc(mouse);
        glutMotionFunc(motion);
        glutReshapeFunc(reshape);
        glutIdleFunc(idle);

        initPixelBuffer();

#if defined (__APPLE__) || defined(MACOSX)
        atexit(cleanup);
#else
        glutCloseFunc(cleanup);
#endif

        glutMainLoop();
    }
}