CUDA sample volumeRender

这篇博客深入探讨了CUDA在三维体积渲染中的应用,详细解释了CUDA线程组织、纹理对象、纹理过滤模式、光栅化过程以及如何利用OpenGL进行结果展示。通过示例代码,展示了如何使用CUDA内核执行并行计算,将数据从主机传输到设备,以及如何设置和销毁纹理对象。此外,还讨论了如何在CUDA中处理光照、深度和颜色累积,以实现高质量的体积渲染效果。
摘要由CSDN通过智能技术生成

CUDA sample volumeRender

如果出现错误,记得判定一下OpenGL是不是在集成显卡上生成的,如果是则把集成显卡禁用掉就行了。

1、thread、block、grid

https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)

https://nyu-cds.github.io/python-gpu/02-cuda/

https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

https://cvw.cac.cornell.edu/GPU/thread

https://zhuanlan.zhihu.com/p/123170285

https://www.cnblogs.com/1024incn/p/4541313.html

https://nichijou.co/cuda2-warp/

将problem划分成coarse sub-problems in parallel by blocks of threads。

将每个sub-problem划分到block中更细的threads中。

kernel也就是一个操作,或者说c++编写的一个函数。所以才会有核函数的概念,一个kernel由一个grid组成。由一系列线程并行执行:

  • kernel内的线程完成同一操作,也就是 all threads run the same code。
  • 每个线程都会有thread ID,该ID就是用来计算存储地址及其他操作。 

grid是由一系列的线程block组成。

毫无疑问block就是由thread组成的。

基本顺序就是program由一系列的function组成,而一个function表示一个kernel grid,一个grid由一系列的block组成,一个block由一系列thread组成。

所以每个kernel grid都有自己独立的工作区。

Automatic Scalability.

每个kernel grid中的block完全独立,不同的block被分配到不同的 Streaming Multiprocessors (SMs),但是同一个block不可以分配到不同的SM。

单个block内的threads可以:

  • synchronize、
  • share memory
  • cooperate(communicate)

来自不同blocks的threads都是完全独立的,位移共享的资源就是global memory。

  • thread级任务并行性——不同的thread执行不同的任务
  • block和grid级并行——不同的block或grid执行不同的任务
  • data并行性——不同的threads和blocks处理memory中不同部分的数据

thread ID:每个block和thread都有自己唯一的索引,只能从运行的kernel中访问。变量blockIdx和threadIdx分别包含被调用block和thread的索引。

  • The thread index (threadIdx)
  • The block index (blockIdx)
  • The size and shape of a block (blockDim)
  • The size and shape of a grid (gridDim)

Global Memory

Global memory所有thread以及host(CPU)均可访问此内存:

  • Global memory内存由host分配和释放
  • 用于初始化GPU将要处理的数据

Shared Memory

Shared memory:每个线程块都有自己的shared memory:

  • 只能由block内的threads访问
  • 比local或global momory快得多
  • 需要特殊处理才能获得最佳性能
  • 只在block的生命周期内存在

Local Memory

Local memory每个线程都有自己的private local memory:

  • 只在thread的生命周期内存在
  • 通常由编译器自动处理

Constant and texture memory 这些是所有线程均可访问的只读内存空间:

  • Constant memory 用于缓存所有功能单元共享的值
  • texture memory是为硬件提供的texture 操作而优化的

一般CUDA程序的三个主要步骤:

  •  host-to-device transfer也就是数据重host memory to device steps。
  •   加载GPU程序并执行,将数据缓存在芯片上以提高性能。
  •   将结果从device memory复制到host memory,也称为device-to-host transfer。

每个CUDA kernel都以__global__声明说明符开头。程序员使用内置变量为每个thread提供unique glokal ID。

  • CUDA架构限制了每个块的线程数(1024个线程每个块限制)。
  • 线程块的维度可以通过内置的blockDim变量在内核中访问。
  • 一个block中的所有threads都可以使用一个内部__syncthreads来同步。对于__syncthreads,线程块中的所有线程必须等待,然后才能继续执行。
  • 在<<<...>>>中指定的每个块的thread数和每个网格的block数,可以是int或dim3。这些三尖括号标记了从主机代码到设备代码的调用。它也被称为kernel launch。

  • Registers—这些是每个thread专用的,这意味着分配给该线程的寄存器对其他线程不可见。编译器做出有关寄存器利用率的决策。
  • L1/Shared memory (SMEM)—每个SM都有一个快速的on-chip scratched存储器,可用作L1 cache和shared memory。CUDA block中的所有线程可以共享shared memory,并且在给定SM上运行的所有CUDA块可以共享SM提供的物理内存资源。
  • Read-only memory—每个SM都具instruction cache,constant memory,texture和RO cache,这对kernel代码是只读的。
  • L2 cache—L2 cache在所有SM之间共享,因此每个CUDA block中的每个线程都可以访问该内存。
  • Global memory—这是GPU和位于GPU中的DRAM的帧缓冲区大小。

CUDA使用kernel execution configuration<<< ... >>>来告诉CUDA运行时在GPU上启动多少个threads。 CUDA将线程组织到称为“thread block”的组中。kernel可以启动组织为“grid”结构的多个thread blocks。

<<< M , T >>>//表示kernel以M个thread blocks的grid启动。每个thread block具有T个并行线程。

很容一看出来,一个kernel是一个grid,所以需要声明该grid有多少个thread blocks,并且每个thread block有多少个thread。

SP(Streaming Processor):流处理器, 是GPU最基本的处理单元,在fermi架构开始被叫做CUDA core

SIMD SM

SM(Streaming MultiProcessor): 一个SM由多个CUDA core组成,SM还包括特殊运算单元(SFU),共享内存(shared memory),寄存器文件(Register File)和调度器(Warp Scheduler)等。

按照SIMD模型,SM最小执行单元为warp,一个warp中有多个线程。SM执行单元SPs共享单个指令fetch/dispatch,这些线程将同一个指令应用于不同数据。因此,一个warp中的所有线程将总是具有相同的执行时间。

warp

warp是SM的基本执行单元。一个warp包含32个并行thread,这32个thread执行于SIMT(Single-Instruction, Multiple-Thread,单指令多线程)模式。也就是说所有thread执行同一条指令,并且每个thread会使用各自的data执行该指令。

一个warp中的线程必然在同一个block中,同一个block不会再两个SM中,也就是block会调用多个warp,如果block的线程数不能整除warp线程数,则最后一个warp没有填满,没填满的warp中的thread是inactive。只要调用warp就会消耗SM资源,只不过有些warp中的线程是inactive。

一个warp中的thread执行相同的指令,有相同的执行时间,如果某一个thread阻塞了,同一个warp的其他thread都会阻塞,因此有了warp divergence。所以warp divergence只会出现在同一个warp中。

同一个warp中的thread执行同一操作,如果因为控制流语句(if)进入不同的分支才触发的warp divergence,那么只要避免控制流语句触发即可。也就是将同一分支放入同一warp。

也就引入了branch efficiency代码比较简单的时候,CUDA编译器自动优化代码。

warp的context包含三个部分:

  1. Program counter
  2. Register
  3. Shared memory

当一个block或得到足够的资源时,就成为active block。block中的warp就称为active warp。active warp又可以被分为下面三类:

  1. Selected warp 被选中的warp
  2. Stalled warp 没准备好要执行的称为Stalled warp
  3. Eligible warp 没被选中,但是已经做好准备被执行的称为Eligible warp

SM中warp调度器每个cycle会挑选active warp送去执行,

warp适合执行需要满足下面两个条件:

  1. 32个CUDA core有空
  2. 所有当前指令的参数都准备就绪

cudaError_t cudaDeviceSynchronize(void)可以用来停住CUP等待CUDA中的操作完成:

__device__ void __syncthreads(void)当该函数被调用,block中的每个thread都会等待所有其他thread执行到某个点来实现同步。


#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);

__global__ void addKernel(int *c, const int *a, const int *b)
{
    int i = threadIdx.x;//获取一维的线程索引
    c[i] = a[i] + b[i];
}

int main()
{
    const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    // Add vectors in parallel.
    cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);//并行声明返回cuda状态
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }

    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}

// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
{
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);//选择cuda 0作为默认选项
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }

    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));//为c创建cuda数据
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));//为a创建cuda数据
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));//为b创建cuda数据
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    // Copy input vectors from host memory to GPU buffers.
    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);//将a内存拷贝给显存
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);//将b内存拷贝给显存
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    // Launch a kernel on the GPU with one thread for each element.
    addKernel<<<1, size>>>(dev_c, dev_a, dev_b);

    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }
    
    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();//同步
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }

    // Copy output vector from GPU buffer to host memory.
    cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);//显存拷贝到内存
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

Error:
    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);
    
    return cudaStatus;
}

 

 

volumeRender_kernel.cu

/*
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

// Simple 3D volume renderer

#ifndef _VOLUMERENDER_KERNEL_CU_
#define _VOLUMERENDER_KERNEL_CU_

#include <helper_cuda.h>
#include <helper_math.h>

typedef unsigned int  uint;
typedef unsigned char uchar;

cudaArray *d_volumeArray = 0;
cudaArray *d_transferFuncArray;

typedef unsigned char VolumeType;
//typedef unsigned short VolumeType;

cudaTextureObject_t	texObject; // For 3D texture
cudaTextureObject_t transferTex; // For 1D transfer function texture

typedef struct
{
    float4 m[3];
} float3x4;

__constant__ float3x4 c_invViewMatrix;  // inverse view matrix

struct Ray
{
    float3 o;   // origin
    float3 d;   // direction
};

// intersect ray with a box
// http://www.siggraph.org/education/materials/HyperGraph/raytrace/rtinter3.htm

__device__
int intersectBox(Ray r, float3 boxmin, float3 boxmax, float *tnear, float *tfar)
{
    // compute intersection of ray with all six bbox planes
    float3 invR = make_float3(1.0f) / r.d;
    float3 tbot = invR * (boxmin - r.o);
    float3 ttop = invR * (boxmax - r.o);

    // re-order intersections to find smallest and largest on each axis
    float3 tmin = fminf(ttop, tbot);
    float3 tmax = fmaxf(ttop, tbot);

    // find the largest tmin and the smallest tmax
    float largest_tmin = fmaxf(fmaxf(tmin.x, tmin.y), fmaxf(tmin.x, tmin.z));
    float smallest_tmax = fminf(fminf(tmax.x, tmax.y), fminf(tmax.x, tmax.z));

    *tnear = largest_tmin;
    *tfar = smallest_tmax;

    return smallest_tmax > largest_tmin;
}

// transform vector by matrix (no translation)
__device__
float3 mul(const float3x4 &M, const float3 &v)
{
    float3 r;
    r.x = dot(v, make_float3(M.m[0]));
    r.y = dot(v, make_float3(M.m[1]));
    r.z = dot(v, make_float3(M.m[2]));
    return r;
}

// transform vector by matrix with translation
__device__
float4 mul(const float3x4 &M, const float4 &v)
{
    float4 r;
    r.x = dot(v, M.m[0]);
    r.y = dot(v, M.m[1]);
    r.z = dot(v, M.m[2]);
    r.w = 1.0f;
    return r;
}

__device__ uint rgbaFloatToInt(float4 rgba)
{
    rgba.x = __saturatef(rgba.x);   // clamp to [0.0, 1.0]
    rgba.y = __saturatef(rgba.y);
    rgba.z = __saturatef(rgba.z);
    rgba.w = __saturatef(rgba.w);
    return (uint(rgba.w*255)<<24) | (uint(rgba.z*255)<<16) | (uint(rgba.y*255)<<8) | uint(rgba.x*255);

}

__global__ void
d_render(uint *d_output, uint imageW, uint imageH,
         float density, float brightness,
         float transferOffset, float transferScale, cudaTextureObject_t	tex,
         cudaTextureObject_t	transferTex)
{
    const int maxSteps = 500;
    const float tstep = 0.01f;
    const float opacityThreshold = 0.95f;
    const float3 boxMin = make_float3(-1.0f, -1.0f, -1.0f);
    const float3 boxMax = make_float3(1.0f, 1.0f, 1.0f);

    uint x = blockIdx.x*blockDim.x + threadIdx.x;
    uint y = blockIdx.y*blockDim.y + threadIdx.y;

    if ((x >= imageW) || (y >= imageH)) return;

    float u = (x / (float) imageW)*2.0f-1.0f;
    float v = (y / (float) imageH)*2.0f-1.0f;

    // calculate eye ray in world space
    Ray eyeRay;
    eyeRay.o = make_float3(mul(c_invViewMatrix, make_float4(0.0f, 0.0f, 0.0f, 1.0f)));
    eyeRay.d = normalize(make_float3(u, v, -2.0f));
    eyeRay.d = mul(c_invViewMatrix, eyeRay.d);

    // find intersection with box
    float tnear, tfar;
    int hit = intersectBox(eyeRay, boxMin, boxMax, &tnear, &tfar);

    if (!hit) return;

    if (tnear < 0.0f) tnear = 0.0f;     // clamp to near plane

    // march along ray from front to back, accumulating color
    float4 sum = make_float4(0.0f);
    float t = tnear;
    float3 pos = eyeRay.o + eyeRay.d*tnear;
    float3 step = eyeRay.d*tstep;

    for (int i=0; i<maxSteps; i++)
    {
        // read from 3D texture
        // remap position to [0, 1] coordinates
        float sample = tex3D<float>(tex, pos.x*0.5f+0.5f, pos.y*0.5f+0.5f, pos.z*0.5f+0.5f);
        //sample *= 64.0f;    // scale for 10-bit data

        // lookup in transfer function texture
        float4 col = tex1D<float4>(transferTex, (sample-transferOffset)*transferScale);
        col.w *= density;

        // "under" operator for back-to-front blending
        //sum = lerp(sum, col, col.w);

        // pre-multiply alpha
        col.x *= col.w;
        col.y *= col.w;
        col.z *= col.w;
        // "over" operator for front-to-back blending
        sum = sum + col*(1.0f - sum.w);

        // exit early if opaque
        if (sum.w > opacityThreshold)
            break;

        t += tstep;

        if (t > tfar) break;

        pos += step;
    }

    sum *= brightness;

    // write output color
    d_output[y*imageW + x] = rgbaFloatToInt(sum);
}

extern "C"
void setTextureFilterMode(bool bLinearFilter)
{
    if (texObject)
    {
        checkCudaErrors(cudaDestroyTextureObject(texObject));//销毁texture对象
    }
    cudaResourceDesc            texRes;//定义资源描述子
    memset(&texRes,0,sizeof(cudaResourceDesc));//将textRes初始化为0

    texRes.resType            = cudaResourceTypeArray;//资源类型
    texRes.res.array.array    = d_volumeArray;//数组数据

    cudaTextureDesc             texDescr;//定义纹理描述子
    memset(&texDescr,0,sizeof(cudaTextureDesc));//将texDescr初始化为0

    texDescr.normalizedCoords = true;//指示纹理读取是否标准化
    texDescr.filterMode       = bLinearFilter ? cudaFilterModeLinear : cudaFilterModePoint;//纹理过滤模式

    texDescr.addressMode[0] = cudaAddressModeWrap;//Wrap地址模式
    texDescr.addressMode[1] = cudaAddressModeWrap;//Wrap地址模式
    texDescr.addressMode[2] = cudaAddressModeWrap;//Wrap地址模式

    texDescr.readMode = cudaReadModeNormalizedFloat;//读取texture作为normal浮点数

    checkCudaErrors(cudaCreateTextureObject(&texObject, &texRes, &texDescr, NULL));//创建一个纹理对象

}

extern "C"
void initCuda(void *h_volume, cudaExtent volumeSize)
{
    // create 3D array
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<VolumeType>();//创建CUDA通道格式描述子
    checkCudaErrors(cudaMalloc3DArray(&d_volumeArray, &channelDesc, volumeSize));//创建volume显存

    // copy data to 3D array
    cudaMemcpy3DParms copyParams = {0};//CUDA 3D显存复制参数
    copyParams.srcPtr   = make_cudaPitchedPtr(h_volume, volumeSize.width*sizeof(VolumeType), volumeSize.width, volumeSize.height);//编码源存储器地址
    copyParams.dstArray = d_volumeArray;//目的地显存地址
    copyParams.extent   = volumeSize;//请求内存拷贝大小
    copyParams.kind     = cudaMemcpyHostToDevice;//类型的转移
    checkCudaErrors(cudaMemcpy3D(&copyParams));//在3D对象之间复制数据

    cudaResourceDesc            texRes;//CUDA资源描述子
    memset(&texRes, 0, sizeof(cudaResourceDesc));//将textRes初始化为0

    texRes.resType            = cudaResourceTypeArray;//资源类型
    texRes.res.array.array    = d_volumeArray;//数组数据

    cudaTextureDesc             texDescr;//定义纹理描述子
    memset(&texDescr, 0, sizeof(cudaTextureDesc));//将texDescr初始化为0

    texDescr.normalizedCoords = true; // access with normalized texture coordinates 使用归一化纹理坐标访问
    texDescr.filterMode       = cudaFilterModeLinear; // linear interpolation 线性差值

    texDescr.addressMode[0] = cudaAddressModeClamp;  // clamp texture coordinates 限制到边缘地址模式
    texDescr.addressMode[1] = cudaAddressModeClamp;
    texDescr.addressMode[2] = cudaAddressModeClamp;

    texDescr.readMode = cudaReadModeNormalizedFloat;//读取texture作为normal浮点数

    checkCudaErrors(cudaCreateTextureObject(&texObject, &texRes, &texDescr, NULL));//创建一个纹理对象

    // create transfer function texture 创建传递函数纹理
    float4 transferFunc[] =
    {
        {  0.0, 0.0, 0.0, 0.0, },
        {  1.0, 0.0, 0.0, 1.0, },
        {  1.0, 0.5, 0.0, 1.0, },
        {  1.0, 1.0, 0.0, 1.0, },
        {  0.0, 1.0, 0.0, 1.0, },
        {  0.0, 1.0, 1.0, 1.0, },
        {  0.0, 0.0, 1.0, 1.0, },
        {  1.0, 0.0, 1.0, 1.0, },
        {  0.0, 0.0, 0.0, 0.0, },
    };

    cudaChannelFormatDesc channelDesc2 = cudaCreateChannelDesc<float4>();//创建CUDA通道格式描述子
    cudaArray *d_transferFuncArray;//CUDA 数组
    checkCudaErrors(cudaMallocArray(&d_transferFuncArray, &channelDesc2, sizeof(transferFunc)/sizeof(float4), 1));//创建d_transferFuncArray显存
    checkCudaErrors(cudaMemcpyToArray(d_transferFuncArray, 0, 0, transferFunc, sizeof(transferFunc), cudaMemcpyHostToDevice));//Copies data between host and device

    memset(&texRes,0,sizeof(cudaResourceDesc));//将texRes初始化为0

    texRes.resType            = cudaResourceTypeArray;//资源类型
    texRes.res.array.array    = d_transferFuncArray;//CUDA 数组

    memset(&texDescr,0,sizeof(cudaTextureDesc));//将texDescr初始化为0

    texDescr.normalizedCoords = true; // access with normalized texture coordinates 使用归一化纹理坐标访问
    texDescr.filterMode       = cudaFilterModeLinear; // linear interpolation 线性差值

    texDescr.addressMode[0] = cudaAddressModeClamp; // wrap texture coordinates

    texDescr.readMode = cudaReadModeElementType;//读取指定元素类型的纹理

    checkCudaErrors(cudaCreateTextureObject(&transferTex, &texRes, &texDescr, NULL));//创建一个纹理对象
}

extern "C"
void freeCudaBuffers()
{
    checkCudaErrors(cudaDestroyTextureObject(texObject));//销毁一个纹理对象
    checkCudaErrors(cudaDestroyTextureObject(transferTex));//  销毁一个纹理对象
    checkCudaErrors(cudaFreeArray(d_volumeArray));//Frees an array on the device
    checkCudaErrors(cudaFreeArray(d_transferFuncArray));//Frees an array on the device
}


extern "C"
void render_kernel(dim3 gridSize, dim3 blockSize, uint *d_output, uint imageW, uint imageH,
                   float density, float brightness, float transferOffset, float transferScale)
{
    d_render<<<gridSize, blockSize>>>(d_output, imageW, imageH, density,
                                      brightness, transferOffset, transferScale,
                                      texObject, transferTex);
}

extern "C"
void copyInvViewMatrix(float *invViewMatrix, size_t sizeofMatrix)
{
    checkCudaErrors(cudaMemcpyToSymbol(c_invViewMatrix, invViewMatrix, sizeofMatrix));//Copies data to the given symbol on the device
}


#endif // #ifndef _VOLUMERENDER_KERNEL_CU_

volumeRender.cpp

/*
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

/*
    Volume rendering sample

    This sample loads a 3D volume from disk and displays it using
    ray marching and 3D textures.

    Note - this is intended to be an example of using 3D textures
    in CUDA, not an optimized volume renderer.

    Changes
    sgg 22/3/2010
    - updated to use texture for display instead of glDrawPixels.
    - changed to render from front-to-back rather than back-to-front.
*/

// OpenGL Graphics includes
#include <helper_gl.h>
#if defined (__APPLE__) || defined(MACOSX)
  #pragma clang diagnostic ignored "-Wdeprecated-declarations"
  #include <GLUT/glut.h>
  #ifndef glutCloseFunc
  #define glutCloseFunc glutWMCloseFunc
  #endif
#else
#include <GL/freeglut.h>
#endif

// CUDA Runtime, Interop, and includes
#include <cuda_runtime.h>
#include <cuda_gl_interop.h>
#include <cuda_profiler_api.h>
#include <vector_types.h>
#include <vector_functions.h>
#include <driver_functions.h>

// CUDA utilities
#include <helper_cuda.h>

// Helper functions
#include <helper_cuda.h>
#include <helper_functions.h>
#include <helper_timer.h>
#include <time.h>
typedef unsigned int uint;
typedef unsigned char uchar;

#define MAX_EPSILON_ERROR 5.00f
#define THRESHOLD         0.30f

// Define the files that are to be save and the reference images for validation
const char *sOriginal[] =
{
    "volume.ppm",
    NULL
};

const char *sReference[] =
{
    "ref_volume.ppm",
    NULL
};

const char *sSDKsample = "CUDA 3D Volume Render";//

//const char *volumeFilename = "Bucky.raw";
//cudaExtent volumeSize = make_cudaExtent(32, 32, 32);
//typedef unsigned char VolumeType;

const char* volumeFilename = "img_itk.raw";
cudaExtent volumeSize = make_cudaExtent(512, 512, 143);//创建cuda显存
typedef unsigned char VolumeType;


//char *volumeFilename = "mrt16_angio.raw";
//cudaExtent volumeSize = make_cudaExtent(416, 512, 112);
//typedef unsigned short VolumeType;

uint width = 512, height = 512;//生成的OpenGL大小
//uint width = 1536, height = 1536;

dim3 blockSize(16, 16);//block大小,二维,16*16=256个线程
dim3 gridSize;//grid大小,即block个数

float3 viewRotation;
float3 viewTranslation = make_float3(0.0, 0.0, -4.0f);
float invViewMatrix[12];

float density = 0.05f;
float brightness = 1.0f;
float transferOffset = 0.0f;
float transferScale = 1.0f;
bool linearFiltering = true;

GLuint pbo = 0;     // OpenGL pixel buffer object
GLuint tex = 0;     // OpenGL texture object
struct cudaGraphicsResource *cuda_pbo_resource; // CUDA Graphics Resource (to transfer PBO) CUDA图形资源(用于传输PBO pixel buffer object)

StopWatchInterface *timer = 0; //定义 StopWatch Interface接口 

// Auto-Verification Code 自动验证码
const int frameCheckNumber = 2;
int fpsCount = 0;        // FPS count for averaging FPS计算平均值
int fpsLimit = 1;        // FPS limit for sampling 采样FPS极限
int g_Index = 0;
unsigned int frameCount = 0;

int *pArgc;
char **pArgv;

#ifndef MAX
#define MAX(a,b) ((a > b) ? a : b)
#endif

extern "C" void setTextureFilterMode(bool bLinearFilter);//创建一个3D texture
extern "C" void initCuda(void *h_volume, cudaExtent volumeSize);//create 3D array
extern "C" void freeCudaBuffers();//释放cuda对象
extern "C" void render_kernel(dim3 gridSize, dim3 blockSize, uint *d_output, uint imageW, uint imageH,
                              float density, float brightness, float transferOffset, float transferScale);//render kernel
extern "C" void copyInvViewMatrix(float *invViewMatrix, size_t sizeofMatrix);//逆矩阵

void initPixelBuffer();//初始化openGL

void computeFPS()
{
    frameCount++;
    fpsCount++;

    if (fpsCount == fpsLimit)
    {
        char fps[256];
        float ifps = 1.f / (sdkGetAverageTimerValue(&timer) / 1000.f);//   1s/返回计时器执行的平均时间
        sprintf(fps, "Volume Render: %3.1f fps", ifps);

        glutSetWindowTitle(fps);//OpenGL标题
        fpsCount = 0;

        fpsLimit = (int)MAX(1.f, ifps);
        sdkResetTimer(&timer);//重置计时器的计数器。
    }
}

// render image using CUDA 使用CUDA渲染图像
void render()
{
    copyInvViewMatrix(invViewMatrix, sizeof(float4)*3);//拷贝逆矩阵

    // map PBO to get CUDA device pointer
    uint *d_output;//opengl指针
    // map PBO to get CUDA device pointer
    checkCudaErrors(cudaGraphicsMapResources(1, &cuda_pbo_resource, 0));//Map graphics resources for access by CUDA
    size_t num_bytes;
    checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void **)&d_output, &num_bytes,
                                                         cuda_pbo_resource));//Get an device pointer through which to access a mapped graphics resource.
    //printf("CUDA mapped PBO: May access %ld bytes\n", num_bytes);

    // clear image
    checkCudaErrors(cudaMemset(d_output, 0, width*height*4));//清除image

    // call CUDA kernel, writing results to PBO
    render_kernel(gridSize, blockSize, d_output, width, height, density, brightness, transferOffset, transferScale);//开始渲染

    getLastCudaError("kernel failed");

    checkCudaErrors(cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0));
}

// display results using OpenGL (called by GLUT)
void display()
{
    sdkStartTimer(&timer);

    // use OpenGL to build view matrix
    GLfloat modelView[16];
    glMatrixMode(GL_MODELVIEW);
    glPushMatrix();
    glLoadIdentity();
    glRotatef(-viewRotation.x, 1.0, 0.0, 0.0);
    glRotatef(-viewRotation.y, 0.0, 1.0, 0.0);
    glTranslatef(-viewTranslation.x, -viewTranslation.y, -viewTranslation.z);
    glGetFloatv(GL_MODELVIEW_MATRIX, modelView);
    glPopMatrix();

    invViewMatrix[0] = modelView[0];
    invViewMatrix[1] = modelView[4];
    invViewMatrix[2] = modelView[8];
    invViewMatrix[3] = modelView[12];
    invViewMatrix[4] = modelView[1];
    invViewMatrix[5] = modelView[5];
    invViewMatrix[6] = modelView[9];
    invViewMatrix[7] = modelView[13];
    invViewMatrix[8] = modelView[2];
    invViewMatrix[9] = modelView[6];
    invViewMatrix[10] = modelView[10];
    invViewMatrix[11] = modelView[14];
    clock_t start, end;
    start = clock();
    render();
    end = clock();
    printf("render time: %f \n", double(end - start) / CLOCKS_PER_SEC);

    // display results
    glClear(GL_COLOR_BUFFER_BIT);

    // draw image from PBO
    glDisable(GL_DEPTH_TEST);

    glPixelStorei(GL_UNPACK_ALIGNMENT, 1);
#if 0
    // draw using glDrawPixels (slower)
    glRasterPos2i(0, 0);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
    glDrawPixels(width, height, GL_RGBA, GL_UNSIGNED_BYTE, 0);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);
#else
    // draw using texture

    // copy from pbo to texture
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
    glBindTexture(GL_TEXTURE_2D, tex);
    glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, 0);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);

    // draw textured quad
    glEnable(GL_TEXTURE_2D);
    glBegin(GL_QUADS);
    glTexCoord2f(0, 0);
    glVertex2f(0, 0);
    glTexCoord2f(1, 0);
    glVertex2f(1, 0);
    glTexCoord2f(1, 1);
    glVertex2f(1, 1);
    glTexCoord2f(0, 1);
    glVertex2f(0, 1);
    glEnd();

    glDisable(GL_TEXTURE_2D);
    glBindTexture(GL_TEXTURE_2D, 0);
#endif

    glutSwapBuffers();
    glutReportErrors();

    sdkStopTimer(&timer);

    computeFPS();
}

void idle()
{
    glutPostRedisplay();
}

void keyboard(unsigned char key, int x, int y)
{
    switch (key)
    {
        case 27:
            #if defined (__APPLE__) || defined(MACOSX)
                exit(EXIT_SUCCESS);
            #else
                glutDestroyWindow(glutGetWindow());
                return;
            #endif
            break;

        case 'f':
            linearFiltering = !linearFiltering;
            setTextureFilterMode(linearFiltering);
            break;

        case '+':
            density += 0.01f;
            break;

        case '-':
            density -= 0.01f;
            break;

        case ']':
            brightness += 0.1f;
            break;

        case '[':
            brightness -= 0.1f;
            break;

        case ';':
            transferOffset += 0.01f;
            break;

        case '\'':
            transferOffset -= 0.01f;
            break;

        case '.':
            transferScale += 0.01f;
            break;

        case ',':
            transferScale -= 0.01f;
            break;

        default:
            break;
    }

    printf("density = %.2f, brightness = %.2f, transferOffset = %.2f, transferScale = %.2f\n", density, brightness, transferOffset, transferScale);
    glutPostRedisplay();
}

int ox, oy;
int buttonState = 0;

void mouse(int button, int state, int x, int y)
{
    if (state == GLUT_DOWN)
    {
        buttonState  |= 1<<button;
    }
    else if (state == GLUT_UP)
    {
        buttonState = 0;
    }

    ox = x;
    oy = y;
    glutPostRedisplay();
}

void motion(int x, int y)
{
    float dx, dy;
    dx = (float)(x - ox);
    dy = (float)(y - oy);

    if (buttonState == 4)
    {
        // right = zoom
        viewTranslation.z += dy / 100.0f;
    }
    else if (buttonState == 2)
    {
        // middle = translate
        viewTranslation.x += dx / 100.0f;
        viewTranslation.y -= dy / 100.0f;
    }
    else if (buttonState == 1)
    {
        // left = rotate
        viewRotation.x += dy / 5.0f;
        viewRotation.y += dx / 5.0f;
    }

    ox = x;
    oy = y;
    glutPostRedisplay();
}

int iDivUp(int a, int b)
{
    return (a % b != 0) ? (a / b + 1) : (a / b);
}

void reshape(int w, int h)
{
    width = w;
    height = h;
    initPixelBuffer();

    // calculate new grid size
    gridSize = dim3(iDivUp(width, blockSize.x), iDivUp(height, blockSize.y));

    glViewport(0, 0, w, h);

    glMatrixMode(GL_MODELVIEW);
    glLoadIdentity();

    glMatrixMode(GL_PROJECTION);
    glLoadIdentity();
    glOrtho(0.0, 1.0, 0.0, 1.0, 0.0, 1.0);
}

void cleanup()
{
    sdkDeleteTimer(&timer);

    freeCudaBuffers();

    if (pbo)
    {
        cudaGraphicsUnregisterResource(cuda_pbo_resource);
        glDeleteBuffers(1, &pbo);
        glDeleteTextures(1, &tex);
    }
    // Calling cudaProfilerStop causes all profile data to be
    // flushed before the application exits
    checkCudaErrors(cudaProfilerStop());
}

void initGL(int *argc, char **argv)
{
    // initialize GLUT callback functions
    glutInit(argc, argv);
    glutInitDisplayMode(GLUT_RGB | GLUT_DOUBLE);
    glutInitWindowSize(width, height);
    glutCreateWindow("CUDA volume rendering");

    if (!isGLVersionSupported(2,0) ||
        !areGLExtensionsSupported("GL_ARB_pixel_buffer_object"))
    {
        printf("Required OpenGL extensions are missing.");
        exit(EXIT_SUCCESS);
    }
}

void initPixelBuffer()
{
    if (pbo)
    {
        // unregister this buffer object from CUDA C ;从CUDA C中注销这个缓冲区对象
        checkCudaErrors(cudaGraphicsUnregisterResource(cuda_pbo_resource));//从CUDA C中注销CUDA图形资源(用于传输PBO pixel buffer object)

        // delete old buffer
        glDeleteBuffers(1, &pbo);// delete old bufferobject
        glDeleteTextures(1, &tex);// delete old texture object
    }

    // create pixel buffer object for display;为显示创建像素缓冲对象
    glGenBuffers(1, &pbo);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
    glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, width*height*sizeof(GLubyte)*4, 0, GL_STREAM_DRAW_ARB);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);

    // register this buffer object with CUDA
    checkCudaErrors(cudaGraphicsGLRegisterBuffer(&cuda_pbo_resource, pbo, cudaGraphicsMapFlagsWriteDiscard));//用CUDA注册这个缓冲区对象,CUDA只会写入该资源,不会读取该资源,从CUDA到openGL

    // create texture for display
    glGenTextures(1, &tex);
    glBindTexture(GL_TEXTURE_2D, tex);
    glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
    glBindTexture(GL_TEXTURE_2D, 0);
}

// Load raw data from disk
void *loadRawFile(char *filename, size_t size)
{
    FILE *fp = fopen(filename, "rb");

    if (!fp)
    {
        fprintf(stderr, "Error opening file '%s'\n", filename);
        return 0;
    }

    void *data = malloc(size);
    size_t read = fread(data, 1, size, fp);
    fclose(fp);

#if defined(_MSC_VER_)
    printf("Read '%s', %Iu bytes\n", filename, read);
#else
    printf("Read '%s', %zu bytes\n", filename, read);
#endif

    return data;
}

void runSingleTest(const char *ref_file, const char *exec_path)
{
    bool bTestResult = true;

    uint *d_output;
    checkCudaErrors(cudaMalloc((void **)&d_output, width*height*sizeof(uint)));//分配显存
    checkCudaErrors(cudaMemset(d_output, 0, width*height*sizeof(uint)));//初始化

    float modelView[16] =
    {
        1.0f, 0.0f, 0.0f, 0.0f,
        0.0f, 1.0f, 0.0f, 0.0f,
        0.0f, 0.0f, 1.0f, 0.0f,
        0.0f, 0.0f, 4.0f, 1.0f
    };//变换矩阵

    invViewMatrix[0] = modelView[0];
    invViewMatrix[1] = modelView[4];
    invViewMatrix[2] = modelView[8];
    invViewMatrix[3] = modelView[12];
    invViewMatrix[4] = modelView[1];
    invViewMatrix[5] = modelView[5];
    invViewMatrix[6] = modelView[9];
    invViewMatrix[7] = modelView[13];
    invViewMatrix[8] = modelView[2];
    invViewMatrix[9] = modelView[6];
    invViewMatrix[10] = modelView[10];
    invViewMatrix[11] = modelView[14];

    // call CUDA kernel, writing results to PBO
    copyInvViewMatrix(invViewMatrix, sizeof(float4)*3);//

    // Start timer 0 and process n loops on the GPU
    int nIter = 10;

    for (int i = -1; i < nIter; i++)
    {
        if (i == 0)
        {
            cudaDeviceSynchronize();//等待GPU同步完成
            sdkStartTimer(&timer);//开启计时器
        }

        render_kernel(gridSize, blockSize, d_output, width, height, density, brightness, transferOffset, transferScale);//渲染函数
    }

    cudaDeviceSynchronize();//等待GPU同步完成
    sdkStopTimer(&timer);//结束计时器
    // Get elapsed time and throughput, then log to sample and master logs 获取运行时间和吞吐量,然后记录到样本日志和主日志中
    double dAvgTime = sdkGetTimerValue(&timer)/(nIter * 1000.0);
    printf("volumeRender, Throughput = %.4f MTexels/s, Time = %.5f s, Size = %u Texels, NumDevsUsed = %u, Workgroup = %u\n",
           (1.0e-6 * width * height)/dAvgTime, dAvgTime, (width * height), 1, blockSize.x * blockSize.y);


    getLastCudaError("Error: render_kernel() execution FAILED");
    checkCudaErrors(cudaDeviceSynchronize());

    unsigned char *h_output = (unsigned char *)malloc(width*height*4);
    checkCudaErrors(cudaMemcpy(h_output, d_output, width*height*4, cudaMemcpyDeviceToHost));

    sdkSavePPM4ub("volume.ppm", h_output, width, height);
    bTestResult = sdkComparePPM("volume.ppm", sdkFindFilePath(ref_file, exec_path), MAX_EPSILON_ERROR, THRESHOLD, true);

    cudaFree(d_output);
    free(h_output);
    cleanup();

    exit(bTestResult ? EXIT_SUCCESS : EXIT_FAILURE);
}


// Program main

int
main(int argc, char **argv)
{
    pArgc = &argc;
    pArgv = argv;

    char *ref_file = NULL;

#if defined(__linux__)
    setenv ("DISPLAY", ":0", 0);
#endif

    //start logs
    printf("%s Starting...\n\n", sSDKsample);

    if (checkCmdLineFlag(argc, (const char **)argv, "file"))
    {
        getCmdLineArgumentString(argc, (const char **)argv, "file", &ref_file);
        fpsLimit = frameCheckNumber;
    }

    if (ref_file)
    {
        findCudaDevice(argc, (const char **)argv);// Initialization code to find the best CUDA Device
    }
    else
    {
        // First initialize OpenGL context, so we can properly set the GL for CUDA.
        // This is necessary in order to achieve optimal performance with OpenGL/CUDA interop.
        initGL(&argc, argv);//首先初始化OpenGL上下文

        findCudaDevice(argc, (const char **)argv);//初始化CUDA
    }

    // parse arguments
    char *filename;

    if (getCmdLineArgumentString(argc, (const char **) argv, "volume", &filename))
    {
        volumeFilename = filename;
    }

    int n;

    if (checkCmdLineFlag(argc, (const char **) argv, "size"))
    {
        n = getCmdLineArgumentInt(argc, (const char **) argv, "size");
        volumeSize.width = volumeSize.height = volumeSize.depth = n;
    }

    if (checkCmdLineFlag(argc, (const char **) argv, "xsize"))
    {
        n = getCmdLineArgumentInt(argc, (const char **) argv, "xsize");
        volumeSize.width = n;
    }

    if (checkCmdLineFlag(argc, (const char **) argv, "ysize"))
    {
        n = getCmdLineArgumentInt(argc, (const char **) argv, "ysize");
        volumeSize.height = n;
    }

    if (checkCmdLineFlag(argc, (const char **) argv, "zsize"))
    {
        n= getCmdLineArgumentInt(argc, (const char **) argv, "zsize");
        volumeSize.depth = n;
    }

    // load volume data
    char *path = sdkFindFilePath(volumeFilename, argv[0]);//查找文件

    if (path == 0)
    {
        printf("Error finding file '%s'\n", volumeFilename);
        exit(EXIT_FAILURE);
    }

    size_t size = volumeSize.width*volumeSize.height*volumeSize.depth*sizeof(VolumeType);//获取文件大小
    void *h_volume = loadRawFile(path, size);//读取raw数据

    initCuda(h_volume, volumeSize);//初始化显存
    free(h_volume);//释放内存数据

    sdkCreateTimer(&timer);//创建计时器

    printf("Press '+' and '-' to change density (0.01 increments)\n"
           "      ']' and '[' to change brightness\n"
           "      ';' and ''' to modify transfer function offset\n"
           "      '.' and ',' to modify transfer function scale\n\n");

    // calculate new grid size
    gridSize = dim3(iDivUp(width, blockSize.x), iDivUp(height, blockSize.y));//根据blocksize和图像的大小计算grid大小

    if (ref_file)
    {
        runSingleTest(ref_file, argv[0]);
    }
    else
    {
        // This is the normal rendering path for VolumeRender
        glutDisplayFunc(display);
        glutKeyboardFunc(keyboard);
        glutMouseFunc(mouse);
        glutMotionFunc(motion);
        glutReshapeFunc(reshape);
        glutIdleFunc(idle);

        initPixelBuffer();

#if defined (__APPLE__) || defined(MACOSX)
        atexit(cleanup);
#else
        glutCloseFunc(cleanup);
#endif

        glutMainLoop();
    }
}

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值