CUDA sample volumeRender
如果出现错误,记得判定一下OpenGL是不是在集成显卡上生成的,如果是则把集成显卡禁用掉就行了。
1、thread、block、grid
https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)
https://nyu-cds.github.io/python-gpu/02-cuda/
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
https://cvw.cac.cornell.edu/GPU/thread
https://zhuanlan.zhihu.com/p/123170285
https://www.cnblogs.com/1024incn/p/4541313.html
https://nichijou.co/cuda2-warp/
将problem划分成coarse sub-problems in parallel by blocks of threads。
将每个sub-problem划分到block中更细的threads中。
kernel也就是一个操作,或者说c++编写的一个函数。所以才会有核函数的概念,一个kernel由一个grid组成。由一系列线程并行执行:
- kernel内的线程完成同一操作,也就是 all threads run the same code。
- 每个线程都会有thread ID,该ID就是用来计算存储地址及其他操作。
grid是由一系列的线程block组成。
毫无疑问block就是由thread组成的。
基本顺序就是program由一系列的function组成,而一个function表示一个kernel grid,一个grid由一系列的block组成,一个block由一系列thread组成。
所以每个kernel grid都有自己独立的工作区。
每个kernel grid中的block完全独立,不同的block被分配到不同的 Streaming Multiprocessors (SMs),但是同一个block不可以分配到不同的SM。
单个block内的threads可以:
- synchronize、
- share memory
- cooperate(communicate)
来自不同blocks的threads都是完全独立的,位移共享的资源就是global memory。
- thread级任务并行性——不同的thread执行不同的任务
- block和grid级并行——不同的block或grid执行不同的任务
- data并行性——不同的threads和blocks处理memory中不同部分的数据
thread ID:每个block和thread都有自己唯一的索引,只能从运行的kernel中访问。变量blockIdx和threadIdx分别包含被调用block和thread的索引。
- The thread index (
threadIdx
) - The block index (
blockIdx
) - The size and shape of a block (
blockDim
) - The size and shape of a grid (
gridDim
)
Global memory所有thread以及host(CPU)均可访问此内存:
- Global memory内存由host分配和释放
- 用于初始化GPU将要处理的数据
Shared memory:每个线程块都有自己的shared memory:
- 只能由block内的threads访问
- 比local或global momory快得多
- 需要特殊处理才能获得最佳性能
- 只在block的生命周期内存在
Local memory每个线程都有自己的private local memory:
- 只在thread的生命周期内存在
- 通常由编译器自动处理
Constant and texture memory 这些是所有线程均可访问的只读内存空间:
- Constant memory 用于缓存所有功能单元共享的值
- texture memory是为硬件提供的texture 操作而优化的
一般CUDA程序的三个主要步骤:
- host-to-device transfer也就是数据重host memory to device steps。
- 加载GPU程序并执行,将数据缓存在芯片上以提高性能。
- 将结果从device memory复制到host memory,也称为device-to-host transfer。
每个CUDA kernel都以__global__声明说明符开头。程序员使用内置变量为每个thread提供unique glokal ID。
- CUDA架构限制了每个块的线程数(1024个线程每个块限制)。
- 线程块的维度可以通过内置的blockDim变量在内核中访问。
- 一个block中的所有threads都可以使用一个内部__syncthreads来同步。对于__syncthreads,线程块中的所有线程必须等待,然后才能继续执行。
- 在<<<...>>>中指定的每个块的thread数和每个网格的block数,可以是int或dim3。这些三尖括号标记了从主机代码到设备代码的调用。它也被称为kernel launch。
- Registers—这些是每个thread专用的,这意味着分配给该线程的寄存器对其他线程不可见。编译器做出有关寄存器利用率的决策。
- L1/Shared memory (SMEM)—每个SM都有一个快速的on-chip scratched存储器,可用作L1 cache和shared memory。CUDA block中的所有线程可以共享shared memory,并且在给定SM上运行的所有CUDA块可以共享SM提供的物理内存资源。
- Read-only memory—每个SM都具instruction cache,constant memory,texture和RO cache,这对kernel代码是只读的。
- L2 cache—L2 cache在所有SM之间共享,因此每个CUDA block中的每个线程都可以访问该内存。
- Global memory—这是GPU和位于GPU中的DRAM的帧缓冲区大小。
CUDA使用kernel execution configuration<<< ... >>>来告诉CUDA运行时在GPU上启动多少个threads。 CUDA将线程组织到称为“thread block”的组中。kernel可以启动组织为“grid”结构的多个thread blocks。
<<< M , T >>>//表示kernel以M个thread blocks的grid启动。每个thread block具有T个并行线程。
很容一看出来,一个kernel是一个grid,所以需要声明该grid有多少个thread blocks,并且每个thread block有多少个thread。
SP(Streaming Processor):流处理器, 是GPU最基本的处理单元,在fermi架构开始被叫做CUDA core。
SM(Streaming MultiProcessor): 一个SM由多个CUDA core组成,SM还包括特殊运算单元(SFU),共享内存(shared memory),寄存器文件(Register File)和调度器(Warp Scheduler)等。
按照SIMD模型,SM最小执行单元为warp,一个warp中有多个线程。SM执行单元SPs共享单个指令fetch/dispatch,这些线程将同一个指令应用于不同数据。因此,一个warp中的所有线程将总是具有相同的执行时间。
warp是SM的基本执行单元。一个warp包含32个并行thread,这32个thread执行于SIMT(Single-Instruction, Multiple-Thread,单指令多线程)模式。也就是说所有thread执行同一条指令,并且每个thread会使用各自的data执行该指令。
一个warp中的线程必然在同一个block中,同一个block不会再两个SM中,也就是block会调用多个warp,如果block的线程数不能整除warp线程数,则最后一个warp没有填满,没填满的warp中的thread是inactive。只要调用warp就会消耗SM资源,只不过有些warp中的线程是inactive。
一个warp中的thread执行相同的指令,有相同的执行时间,如果某一个thread阻塞了,同一个warp的其他thread都会阻塞,因此有了warp divergence。所以warp divergence只会出现在同一个warp中。
同一个warp中的thread执行同一操作,如果因为控制流语句(if)进入不同的分支才触发的warp divergence,那么只要避免控制流语句触发即可。也就是将同一分支放入同一warp。
也就引入了branch efficiency代码比较简单的时候,CUDA编译器自动优化代码。
warp的context包含三个部分:
- Program counter
- Register
- Shared memory
当一个block或得到足够的资源时,就成为active block。block中的warp就称为active warp。active warp又可以被分为下面三类:
- Selected warp 被选中的warp
- Stalled warp 没准备好要执行的称为Stalled warp
- Eligible warp 没被选中,但是已经做好准备被执行的称为Eligible warp
SM中warp调度器每个cycle会挑选active warp送去执行,
warp适合执行需要满足下面两个条件:
- 32个CUDA core有空
- 所有当前指令的参数都准备就绪
cudaError_t cudaDeviceSynchronize(void)可以用来停住CUP等待CUDA中的操作完成:
__device__ void __syncthreads(void)当该函数被调用,block中的每个thread都会等待所有其他thread执行到某个点来实现同步。
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;//获取一维的线程索引
c[i] = a[i] + b[i];
}
int main()
{
const int arraySize = 5;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
const int b[arraySize] = { 10, 20, 30, 40, 50 };
int c[arraySize] = { 0 };
// Add vectors in parallel.
cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);//并行声明返回cuda状态
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addWithCuda failed!");
return 1;
}
printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
c[0], c[1], c[2], c[3], c[4]);
// cudaDeviceReset must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceReset failed!");
return 1;
}
return 0;
}
// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
{
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);//选择cuda 0作为默认选项
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
goto Error;
}
// Allocate GPU buffers for three vectors (two input, one output) .
cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));//为c创建cuda数据
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));//为a创建cuda数据
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));//为b创建cuda数据
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);//将a内存拷贝给显存
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);//将b内存拷贝给显存
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
// Launch a kernel on the GPU with one thread for each element.
addKernel<<<1, size>>>(dev_c, dev_a, dev_b);
// Check for any errors launching the kernel
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
goto Error;
}
// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();//同步
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
goto Error;
}
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);//显存拷贝到内存
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
Error:
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
return cudaStatus;
}
volumeRender_kernel.cu
/*
* Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
// Simple 3D volume renderer
#ifndef _VOLUMERENDER_KERNEL_CU_
#define _VOLUMERENDER_KERNEL_CU_
#include <helper_cuda.h>
#include <helper_math.h>
typedef unsigned int uint;
typedef unsigned char uchar;
cudaArray *d_volumeArray = 0;
cudaArray *d_transferFuncArray;
typedef unsigned char VolumeType;
//typedef unsigned short VolumeType;
cudaTextureObject_t texObject; // For 3D texture
cudaTextureObject_t transferTex; // For 1D transfer function texture
typedef struct
{
float4 m[3];
} float3x4;
__constant__ float3x4 c_invViewMatrix; // inverse view matrix
struct Ray
{
float3 o; // origin
float3 d; // direction
};
// intersect ray with a box
// http://www.siggraph.org/education/materials/HyperGraph/raytrace/rtinter3.htm
__device__
int intersectBox(Ray r, float3 boxmin, float3 boxmax, float *tnear, float *tfar)
{
// compute intersection of ray with all six bbox planes
float3 invR = make_float3(1.0f) / r.d;
float3 tbot = invR * (boxmin - r.o);
float3 ttop = invR * (boxmax - r.o);
// re-order intersections to find smallest and largest on each axis
float3 tmin = fminf(ttop, tbot);
float3 tmax = fmaxf(ttop, tbot);
// find the largest tmin and the smallest tmax
float largest_tmin = fmaxf(fmaxf(tmin.x, tmin.y), fmaxf(tmin.x, tmin.z));
float smallest_tmax = fminf(fminf(tmax.x, tmax.y), fminf(tmax.x, tmax.z));
*tnear = largest_tmin;
*tfar = smallest_tmax;
return smallest_tmax > largest_tmin;
}
// transform vector by matrix (no translation)
__device__
float3 mul(const float3x4 &M, const float3 &v)
{
float3 r;
r.x = dot(v, make_float3(M.m[0]));
r.y = dot(v, make_float3(M.m[1]));
r.z = dot(v, make_float3(M.m[2]));
return r;
}
// transform vector by matrix with translation
__device__
float4 mul(const float3x4 &M, const float4 &v)
{
float4 r;
r.x = dot(v, M.m[0]);
r.y = dot(v, M.m[1]);
r.z = dot(v, M.m[2]);
r.w = 1.0f;
return r;
}
__device__ uint rgbaFloatToInt(float4 rgba)
{
rgba.x = __saturatef(rgba.x); // clamp to [0.0, 1.0]
rgba.y = __saturatef(rgba.y);
rgba.z = __saturatef(rgba.z);
rgba.w = __saturatef(rgba.w);
return (uint(rgba.w*255)<<24) | (uint(rgba.z*255)<<16) | (uint(rgba.y*255)<<8) | uint(rgba.x*255);
}
__global__ void
d_render(uint *d_output, uint imageW, uint imageH,
float density, float brightness,
float transferOffset, float transferScale, cudaTextureObject_t tex,
cudaTextureObject_t transferTex)
{
const int maxSteps = 500;
const float tstep = 0.01f;
const float opacityThreshold = 0.95f;
const float3 boxMin = make_float3(-1.0f, -1.0f, -1.0f);
const float3 boxMax = make_float3(1.0f, 1.0f, 1.0f);
uint x = blockIdx.x*blockDim.x + threadIdx.x;
uint y = blockIdx.y*blockDim.y + threadIdx.y;
if ((x >= imageW) || (y >= imageH)) return;
float u = (x / (float) imageW)*2.0f-1.0f;
float v = (y / (float) imageH)*2.0f-1.0f;
// calculate eye ray in world space
Ray eyeRay;
eyeRay.o = make_float3(mul(c_invViewMatrix, make_float4(0.0f, 0.0f, 0.0f, 1.0f)));
eyeRay.d = normalize(make_float3(u, v, -2.0f));
eyeRay.d = mul(c_invViewMatrix, eyeRay.d);
// find intersection with box
float tnear, tfar;
int hit = intersectBox(eyeRay, boxMin, boxMax, &tnear, &tfar);
if (!hit) return;
if (tnear < 0.0f) tnear = 0.0f; // clamp to near plane
// march along ray from front to back, accumulating color
float4 sum = make_float4(0.0f);
float t = tnear;
float3 pos = eyeRay.o + eyeRay.d*tnear;
float3 step = eyeRay.d*tstep;
for (int i=0; i<maxSteps; i++)
{
// read from 3D texture
// remap position to [0, 1] coordinates
float sample = tex3D<float>(tex, pos.x*0.5f+0.5f, pos.y*0.5f+0.5f, pos.z*0.5f+0.5f);
//sample *= 64.0f; // scale for 10-bit data
// lookup in transfer function texture
float4 col = tex1D<float4>(transferTex, (sample-transferOffset)*transferScale);
col.w *= density;
// "under" operator for back-to-front blending
//sum = lerp(sum, col, col.w);
// pre-multiply alpha
col.x *= col.w;
col.y *= col.w;
col.z *= col.w;
// "over" operator for front-to-back blending
sum = sum + col*(1.0f - sum.w);
// exit early if opaque
if (sum.w > opacityThreshold)
break;
t += tstep;
if (t > tfar) break;
pos += step;
}
sum *= brightness;
// write output color
d_output[y*imageW + x] = rgbaFloatToInt(sum);
}
extern "C"
void setTextureFilterMode(bool bLinearFilter)
{
if (texObject)
{
checkCudaErrors(cudaDestroyTextureObject(texObject));//销毁texture对象
}
cudaResourceDesc texRes;//定义资源描述子
memset(&texRes,0,sizeof(cudaResourceDesc));//将textRes初始化为0
texRes.resType = cudaResourceTypeArray;//资源类型
texRes.res.array.array = d_volumeArray;//数组数据
cudaTextureDesc texDescr;//定义纹理描述子
memset(&texDescr,0,sizeof(cudaTextureDesc));//将texDescr初始化为0
texDescr.normalizedCoords = true;//指示纹理读取是否标准化
texDescr.filterMode = bLinearFilter ? cudaFilterModeLinear : cudaFilterModePoint;//纹理过滤模式
texDescr.addressMode[0] = cudaAddressModeWrap;//Wrap地址模式
texDescr.addressMode[1] = cudaAddressModeWrap;//Wrap地址模式
texDescr.addressMode[2] = cudaAddressModeWrap;//Wrap地址模式
texDescr.readMode = cudaReadModeNormalizedFloat;//读取texture作为normal浮点数
checkCudaErrors(cudaCreateTextureObject(&texObject, &texRes, &texDescr, NULL));//创建一个纹理对象
}
extern "C"
void initCuda(void *h_volume, cudaExtent volumeSize)
{
// create 3D array
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<VolumeType>();//创建CUDA通道格式描述子
checkCudaErrors(cudaMalloc3DArray(&d_volumeArray, &channelDesc, volumeSize));//创建volume显存
// copy data to 3D array
cudaMemcpy3DParms copyParams = {0};//CUDA 3D显存复制参数
copyParams.srcPtr = make_cudaPitchedPtr(h_volume, volumeSize.width*sizeof(VolumeType), volumeSize.width, volumeSize.height);//编码源存储器地址
copyParams.dstArray = d_volumeArray;//目的地显存地址
copyParams.extent = volumeSize;//请求内存拷贝大小
copyParams.kind = cudaMemcpyHostToDevice;//类型的转移
checkCudaErrors(cudaMemcpy3D(©Params));//在3D对象之间复制数据
cudaResourceDesc texRes;//CUDA资源描述子
memset(&texRes, 0, sizeof(cudaResourceDesc));//将textRes初始化为0
texRes.resType = cudaResourceTypeArray;//资源类型
texRes.res.array.array = d_volumeArray;//数组数据
cudaTextureDesc texDescr;//定义纹理描述子
memset(&texDescr, 0, sizeof(cudaTextureDesc));//将texDescr初始化为0
texDescr.normalizedCoords = true; // access with normalized texture coordinates 使用归一化纹理坐标访问
texDescr.filterMode = cudaFilterModeLinear; // linear interpolation 线性差值
texDescr.addressMode[0] = cudaAddressModeClamp; // clamp texture coordinates 限制到边缘地址模式
texDescr.addressMode[1] = cudaAddressModeClamp;
texDescr.addressMode[2] = cudaAddressModeClamp;
texDescr.readMode = cudaReadModeNormalizedFloat;//读取texture作为normal浮点数
checkCudaErrors(cudaCreateTextureObject(&texObject, &texRes, &texDescr, NULL));//创建一个纹理对象
// create transfer function texture 创建传递函数纹理
float4 transferFunc[] =
{
{ 0.0, 0.0, 0.0, 0.0, },
{ 1.0, 0.0, 0.0, 1.0, },
{ 1.0, 0.5, 0.0, 1.0, },
{ 1.0, 1.0, 0.0, 1.0, },
{ 0.0, 1.0, 0.0, 1.0, },
{ 0.0, 1.0, 1.0, 1.0, },
{ 0.0, 0.0, 1.0, 1.0, },
{ 1.0, 0.0, 1.0, 1.0, },
{ 0.0, 0.0, 0.0, 0.0, },
};
cudaChannelFormatDesc channelDesc2 = cudaCreateChannelDesc<float4>();//创建CUDA通道格式描述子
cudaArray *d_transferFuncArray;//CUDA 数组
checkCudaErrors(cudaMallocArray(&d_transferFuncArray, &channelDesc2, sizeof(transferFunc)/sizeof(float4), 1));//创建d_transferFuncArray显存
checkCudaErrors(cudaMemcpyToArray(d_transferFuncArray, 0, 0, transferFunc, sizeof(transferFunc), cudaMemcpyHostToDevice));//Copies data between host and device
memset(&texRes,0,sizeof(cudaResourceDesc));//将texRes初始化为0
texRes.resType = cudaResourceTypeArray;//资源类型
texRes.res.array.array = d_transferFuncArray;//CUDA 数组
memset(&texDescr,0,sizeof(cudaTextureDesc));//将texDescr初始化为0
texDescr.normalizedCoords = true; // access with normalized texture coordinates 使用归一化纹理坐标访问
texDescr.filterMode = cudaFilterModeLinear; // linear interpolation 线性差值
texDescr.addressMode[0] = cudaAddressModeClamp; // wrap texture coordinates
texDescr.readMode = cudaReadModeElementType;//读取指定元素类型的纹理
checkCudaErrors(cudaCreateTextureObject(&transferTex, &texRes, &texDescr, NULL));//创建一个纹理对象
}
extern "C"
void freeCudaBuffers()
{
checkCudaErrors(cudaDestroyTextureObject(texObject));//销毁一个纹理对象
checkCudaErrors(cudaDestroyTextureObject(transferTex));// 销毁一个纹理对象
checkCudaErrors(cudaFreeArray(d_volumeArray));//Frees an array on the device
checkCudaErrors(cudaFreeArray(d_transferFuncArray));//Frees an array on the device
}
extern "C"
void render_kernel(dim3 gridSize, dim3 blockSize, uint *d_output, uint imageW, uint imageH,
float density, float brightness, float transferOffset, float transferScale)
{
d_render<<<gridSize, blockSize>>>(d_output, imageW, imageH, density,
brightness, transferOffset, transferScale,
texObject, transferTex);
}
extern "C"
void copyInvViewMatrix(float *invViewMatrix, size_t sizeofMatrix)
{
checkCudaErrors(cudaMemcpyToSymbol(c_invViewMatrix, invViewMatrix, sizeofMatrix));//Copies data to the given symbol on the device
}
#endif // #ifndef _VOLUMERENDER_KERNEL_CU_
volumeRender.cpp
/*
* Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
/*
Volume rendering sample
This sample loads a 3D volume from disk and displays it using
ray marching and 3D textures.
Note - this is intended to be an example of using 3D textures
in CUDA, not an optimized volume renderer.
Changes
sgg 22/3/2010
- updated to use texture for display instead of glDrawPixels.
- changed to render from front-to-back rather than back-to-front.
*/
// OpenGL Graphics includes
#include <helper_gl.h>
#if defined (__APPLE__) || defined(MACOSX)
#pragma clang diagnostic ignored "-Wdeprecated-declarations"
#include <GLUT/glut.h>
#ifndef glutCloseFunc
#define glutCloseFunc glutWMCloseFunc
#endif
#else
#include <GL/freeglut.h>
#endif
// CUDA Runtime, Interop, and includes
#include <cuda_runtime.h>
#include <cuda_gl_interop.h>
#include <cuda_profiler_api.h>
#include <vector_types.h>
#include <vector_functions.h>
#include <driver_functions.h>
// CUDA utilities
#include <helper_cuda.h>
// Helper functions
#include <helper_cuda.h>
#include <helper_functions.h>
#include <helper_timer.h>
#include <time.h>
typedef unsigned int uint;
typedef unsigned char uchar;
#define MAX_EPSILON_ERROR 5.00f
#define THRESHOLD 0.30f
// Define the files that are to be save and the reference images for validation
const char *sOriginal[] =
{
"volume.ppm",
NULL
};
const char *sReference[] =
{
"ref_volume.ppm",
NULL
};
const char *sSDKsample = "CUDA 3D Volume Render";//
//const char *volumeFilename = "Bucky.raw";
//cudaExtent volumeSize = make_cudaExtent(32, 32, 32);
//typedef unsigned char VolumeType;
const char* volumeFilename = "img_itk.raw";
cudaExtent volumeSize = make_cudaExtent(512, 512, 143);//创建cuda显存
typedef unsigned char VolumeType;
//char *volumeFilename = "mrt16_angio.raw";
//cudaExtent volumeSize = make_cudaExtent(416, 512, 112);
//typedef unsigned short VolumeType;
uint width = 512, height = 512;//生成的OpenGL大小
//uint width = 1536, height = 1536;
dim3 blockSize(16, 16);//block大小,二维,16*16=256个线程
dim3 gridSize;//grid大小,即block个数
float3 viewRotation;
float3 viewTranslation = make_float3(0.0, 0.0, -4.0f);
float invViewMatrix[12];
float density = 0.05f;
float brightness = 1.0f;
float transferOffset = 0.0f;
float transferScale = 1.0f;
bool linearFiltering = true;
GLuint pbo = 0; // OpenGL pixel buffer object
GLuint tex = 0; // OpenGL texture object
struct cudaGraphicsResource *cuda_pbo_resource; // CUDA Graphics Resource (to transfer PBO) CUDA图形资源(用于传输PBO pixel buffer object)
StopWatchInterface *timer = 0; //定义 StopWatch Interface接口
// Auto-Verification Code 自动验证码
const int frameCheckNumber = 2;
int fpsCount = 0; // FPS count for averaging FPS计算平均值
int fpsLimit = 1; // FPS limit for sampling 采样FPS极限
int g_Index = 0;
unsigned int frameCount = 0;
int *pArgc;
char **pArgv;
#ifndef MAX
#define MAX(a,b) ((a > b) ? a : b)
#endif
extern "C" void setTextureFilterMode(bool bLinearFilter);//创建一个3D texture
extern "C" void initCuda(void *h_volume, cudaExtent volumeSize);//create 3D array
extern "C" void freeCudaBuffers();//释放cuda对象
extern "C" void render_kernel(dim3 gridSize, dim3 blockSize, uint *d_output, uint imageW, uint imageH,
float density, float brightness, float transferOffset, float transferScale);//render kernel
extern "C" void copyInvViewMatrix(float *invViewMatrix, size_t sizeofMatrix);//逆矩阵
void initPixelBuffer();//初始化openGL
void computeFPS()
{
frameCount++;
fpsCount++;
if (fpsCount == fpsLimit)
{
char fps[256];
float ifps = 1.f / (sdkGetAverageTimerValue(&timer) / 1000.f);// 1s/返回计时器执行的平均时间
sprintf(fps, "Volume Render: %3.1f fps", ifps);
glutSetWindowTitle(fps);//OpenGL标题
fpsCount = 0;
fpsLimit = (int)MAX(1.f, ifps);
sdkResetTimer(&timer);//重置计时器的计数器。
}
}
// render image using CUDA 使用CUDA渲染图像
void render()
{
copyInvViewMatrix(invViewMatrix, sizeof(float4)*3);//拷贝逆矩阵
// map PBO to get CUDA device pointer
uint *d_output;//opengl指针
// map PBO to get CUDA device pointer
checkCudaErrors(cudaGraphicsMapResources(1, &cuda_pbo_resource, 0));//Map graphics resources for access by CUDA
size_t num_bytes;
checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void **)&d_output, &num_bytes,
cuda_pbo_resource));//Get an device pointer through which to access a mapped graphics resource.
//printf("CUDA mapped PBO: May access %ld bytes\n", num_bytes);
// clear image
checkCudaErrors(cudaMemset(d_output, 0, width*height*4));//清除image
// call CUDA kernel, writing results to PBO
render_kernel(gridSize, blockSize, d_output, width, height, density, brightness, transferOffset, transferScale);//开始渲染
getLastCudaError("kernel failed");
checkCudaErrors(cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0));
}
// display results using OpenGL (called by GLUT)
void display()
{
sdkStartTimer(&timer);
// use OpenGL to build view matrix
GLfloat modelView[16];
glMatrixMode(GL_MODELVIEW);
glPushMatrix();
glLoadIdentity();
glRotatef(-viewRotation.x, 1.0, 0.0, 0.0);
glRotatef(-viewRotation.y, 0.0, 1.0, 0.0);
glTranslatef(-viewTranslation.x, -viewTranslation.y, -viewTranslation.z);
glGetFloatv(GL_MODELVIEW_MATRIX, modelView);
glPopMatrix();
invViewMatrix[0] = modelView[0];
invViewMatrix[1] = modelView[4];
invViewMatrix[2] = modelView[8];
invViewMatrix[3] = modelView[12];
invViewMatrix[4] = modelView[1];
invViewMatrix[5] = modelView[5];
invViewMatrix[6] = modelView[9];
invViewMatrix[7] = modelView[13];
invViewMatrix[8] = modelView[2];
invViewMatrix[9] = modelView[6];
invViewMatrix[10] = modelView[10];
invViewMatrix[11] = modelView[14];
clock_t start, end;
start = clock();
render();
end = clock();
printf("render time: %f \n", double(end - start) / CLOCKS_PER_SEC);
// display results
glClear(GL_COLOR_BUFFER_BIT);
// draw image from PBO
glDisable(GL_DEPTH_TEST);
glPixelStorei(GL_UNPACK_ALIGNMENT, 1);
#if 0
// draw using glDrawPixels (slower)
glRasterPos2i(0, 0);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
glDrawPixels(width, height, GL_RGBA, GL_UNSIGNED_BYTE, 0);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);
#else
// draw using texture
// copy from pbo to texture
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
glBindTexture(GL_TEXTURE_2D, tex);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, 0);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);
// draw textured quad
glEnable(GL_TEXTURE_2D);
glBegin(GL_QUADS);
glTexCoord2f(0, 0);
glVertex2f(0, 0);
glTexCoord2f(1, 0);
glVertex2f(1, 0);
glTexCoord2f(1, 1);
glVertex2f(1, 1);
glTexCoord2f(0, 1);
glVertex2f(0, 1);
glEnd();
glDisable(GL_TEXTURE_2D);
glBindTexture(GL_TEXTURE_2D, 0);
#endif
glutSwapBuffers();
glutReportErrors();
sdkStopTimer(&timer);
computeFPS();
}
void idle()
{
glutPostRedisplay();
}
void keyboard(unsigned char key, int x, int y)
{
switch (key)
{
case 27:
#if defined (__APPLE__) || defined(MACOSX)
exit(EXIT_SUCCESS);
#else
glutDestroyWindow(glutGetWindow());
return;
#endif
break;
case 'f':
linearFiltering = !linearFiltering;
setTextureFilterMode(linearFiltering);
break;
case '+':
density += 0.01f;
break;
case '-':
density -= 0.01f;
break;
case ']':
brightness += 0.1f;
break;
case '[':
brightness -= 0.1f;
break;
case ';':
transferOffset += 0.01f;
break;
case '\'':
transferOffset -= 0.01f;
break;
case '.':
transferScale += 0.01f;
break;
case ',':
transferScale -= 0.01f;
break;
default:
break;
}
printf("density = %.2f, brightness = %.2f, transferOffset = %.2f, transferScale = %.2f\n", density, brightness, transferOffset, transferScale);
glutPostRedisplay();
}
int ox, oy;
int buttonState = 0;
void mouse(int button, int state, int x, int y)
{
if (state == GLUT_DOWN)
{
buttonState |= 1<<button;
}
else if (state == GLUT_UP)
{
buttonState = 0;
}
ox = x;
oy = y;
glutPostRedisplay();
}
void motion(int x, int y)
{
float dx, dy;
dx = (float)(x - ox);
dy = (float)(y - oy);
if (buttonState == 4)
{
// right = zoom
viewTranslation.z += dy / 100.0f;
}
else if (buttonState == 2)
{
// middle = translate
viewTranslation.x += dx / 100.0f;
viewTranslation.y -= dy / 100.0f;
}
else if (buttonState == 1)
{
// left = rotate
viewRotation.x += dy / 5.0f;
viewRotation.y += dx / 5.0f;
}
ox = x;
oy = y;
glutPostRedisplay();
}
int iDivUp(int a, int b)
{
return (a % b != 0) ? (a / b + 1) : (a / b);
}
void reshape(int w, int h)
{
width = w;
height = h;
initPixelBuffer();
// calculate new grid size
gridSize = dim3(iDivUp(width, blockSize.x), iDivUp(height, blockSize.y));
glViewport(0, 0, w, h);
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glOrtho(0.0, 1.0, 0.0, 1.0, 0.0, 1.0);
}
void cleanup()
{
sdkDeleteTimer(&timer);
freeCudaBuffers();
if (pbo)
{
cudaGraphicsUnregisterResource(cuda_pbo_resource);
glDeleteBuffers(1, &pbo);
glDeleteTextures(1, &tex);
}
// Calling cudaProfilerStop causes all profile data to be
// flushed before the application exits
checkCudaErrors(cudaProfilerStop());
}
void initGL(int *argc, char **argv)
{
// initialize GLUT callback functions
glutInit(argc, argv);
glutInitDisplayMode(GLUT_RGB | GLUT_DOUBLE);
glutInitWindowSize(width, height);
glutCreateWindow("CUDA volume rendering");
if (!isGLVersionSupported(2,0) ||
!areGLExtensionsSupported("GL_ARB_pixel_buffer_object"))
{
printf("Required OpenGL extensions are missing.");
exit(EXIT_SUCCESS);
}
}
void initPixelBuffer()
{
if (pbo)
{
// unregister this buffer object from CUDA C ;从CUDA C中注销这个缓冲区对象
checkCudaErrors(cudaGraphicsUnregisterResource(cuda_pbo_resource));//从CUDA C中注销CUDA图形资源(用于传输PBO pixel buffer object)
// delete old buffer
glDeleteBuffers(1, &pbo);// delete old bufferobject
glDeleteTextures(1, &tex);// delete old texture object
}
// create pixel buffer object for display;为显示创建像素缓冲对象
glGenBuffers(1, &pbo);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, width*height*sizeof(GLubyte)*4, 0, GL_STREAM_DRAW_ARB);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);
// register this buffer object with CUDA
checkCudaErrors(cudaGraphicsGLRegisterBuffer(&cuda_pbo_resource, pbo, cudaGraphicsMapFlagsWriteDiscard));//用CUDA注册这个缓冲区对象,CUDA只会写入该资源,不会读取该资源,从CUDA到openGL
// create texture for display
glGenTextures(1, &tex);
glBindTexture(GL_TEXTURE_2D, tex);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
glBindTexture(GL_TEXTURE_2D, 0);
}
// Load raw data from disk
void *loadRawFile(char *filename, size_t size)
{
FILE *fp = fopen(filename, "rb");
if (!fp)
{
fprintf(stderr, "Error opening file '%s'\n", filename);
return 0;
}
void *data = malloc(size);
size_t read = fread(data, 1, size, fp);
fclose(fp);
#if defined(_MSC_VER_)
printf("Read '%s', %Iu bytes\n", filename, read);
#else
printf("Read '%s', %zu bytes\n", filename, read);
#endif
return data;
}
void runSingleTest(const char *ref_file, const char *exec_path)
{
bool bTestResult = true;
uint *d_output;
checkCudaErrors(cudaMalloc((void **)&d_output, width*height*sizeof(uint)));//分配显存
checkCudaErrors(cudaMemset(d_output, 0, width*height*sizeof(uint)));//初始化
float modelView[16] =
{
1.0f, 0.0f, 0.0f, 0.0f,
0.0f, 1.0f, 0.0f, 0.0f,
0.0f, 0.0f, 1.0f, 0.0f,
0.0f, 0.0f, 4.0f, 1.0f
};//变换矩阵
invViewMatrix[0] = modelView[0];
invViewMatrix[1] = modelView[4];
invViewMatrix[2] = modelView[8];
invViewMatrix[3] = modelView[12];
invViewMatrix[4] = modelView[1];
invViewMatrix[5] = modelView[5];
invViewMatrix[6] = modelView[9];
invViewMatrix[7] = modelView[13];
invViewMatrix[8] = modelView[2];
invViewMatrix[9] = modelView[6];
invViewMatrix[10] = modelView[10];
invViewMatrix[11] = modelView[14];
// call CUDA kernel, writing results to PBO
copyInvViewMatrix(invViewMatrix, sizeof(float4)*3);//
// Start timer 0 and process n loops on the GPU
int nIter = 10;
for (int i = -1; i < nIter; i++)
{
if (i == 0)
{
cudaDeviceSynchronize();//等待GPU同步完成
sdkStartTimer(&timer);//开启计时器
}
render_kernel(gridSize, blockSize, d_output, width, height, density, brightness, transferOffset, transferScale);//渲染函数
}
cudaDeviceSynchronize();//等待GPU同步完成
sdkStopTimer(&timer);//结束计时器
// Get elapsed time and throughput, then log to sample and master logs 获取运行时间和吞吐量,然后记录到样本日志和主日志中
double dAvgTime = sdkGetTimerValue(&timer)/(nIter * 1000.0);
printf("volumeRender, Throughput = %.4f MTexels/s, Time = %.5f s, Size = %u Texels, NumDevsUsed = %u, Workgroup = %u\n",
(1.0e-6 * width * height)/dAvgTime, dAvgTime, (width * height), 1, blockSize.x * blockSize.y);
getLastCudaError("Error: render_kernel() execution FAILED");
checkCudaErrors(cudaDeviceSynchronize());
unsigned char *h_output = (unsigned char *)malloc(width*height*4);
checkCudaErrors(cudaMemcpy(h_output, d_output, width*height*4, cudaMemcpyDeviceToHost));
sdkSavePPM4ub("volume.ppm", h_output, width, height);
bTestResult = sdkComparePPM("volume.ppm", sdkFindFilePath(ref_file, exec_path), MAX_EPSILON_ERROR, THRESHOLD, true);
cudaFree(d_output);
free(h_output);
cleanup();
exit(bTestResult ? EXIT_SUCCESS : EXIT_FAILURE);
}
// Program main
int
main(int argc, char **argv)
{
pArgc = &argc;
pArgv = argv;
char *ref_file = NULL;
#if defined(__linux__)
setenv ("DISPLAY", ":0", 0);
#endif
//start logs
printf("%s Starting...\n\n", sSDKsample);
if (checkCmdLineFlag(argc, (const char **)argv, "file"))
{
getCmdLineArgumentString(argc, (const char **)argv, "file", &ref_file);
fpsLimit = frameCheckNumber;
}
if (ref_file)
{
findCudaDevice(argc, (const char **)argv);// Initialization code to find the best CUDA Device
}
else
{
// First initialize OpenGL context, so we can properly set the GL for CUDA.
// This is necessary in order to achieve optimal performance with OpenGL/CUDA interop.
initGL(&argc, argv);//首先初始化OpenGL上下文
findCudaDevice(argc, (const char **)argv);//初始化CUDA
}
// parse arguments
char *filename;
if (getCmdLineArgumentString(argc, (const char **) argv, "volume", &filename))
{
volumeFilename = filename;
}
int n;
if (checkCmdLineFlag(argc, (const char **) argv, "size"))
{
n = getCmdLineArgumentInt(argc, (const char **) argv, "size");
volumeSize.width = volumeSize.height = volumeSize.depth = n;
}
if (checkCmdLineFlag(argc, (const char **) argv, "xsize"))
{
n = getCmdLineArgumentInt(argc, (const char **) argv, "xsize");
volumeSize.width = n;
}
if (checkCmdLineFlag(argc, (const char **) argv, "ysize"))
{
n = getCmdLineArgumentInt(argc, (const char **) argv, "ysize");
volumeSize.height = n;
}
if (checkCmdLineFlag(argc, (const char **) argv, "zsize"))
{
n= getCmdLineArgumentInt(argc, (const char **) argv, "zsize");
volumeSize.depth = n;
}
// load volume data
char *path = sdkFindFilePath(volumeFilename, argv[0]);//查找文件
if (path == 0)
{
printf("Error finding file '%s'\n", volumeFilename);
exit(EXIT_FAILURE);
}
size_t size = volumeSize.width*volumeSize.height*volumeSize.depth*sizeof(VolumeType);//获取文件大小
void *h_volume = loadRawFile(path, size);//读取raw数据
initCuda(h_volume, volumeSize);//初始化显存
free(h_volume);//释放内存数据
sdkCreateTimer(&timer);//创建计时器
printf("Press '+' and '-' to change density (0.01 increments)\n"
" ']' and '[' to change brightness\n"
" ';' and ''' to modify transfer function offset\n"
" '.' and ',' to modify transfer function scale\n\n");
// calculate new grid size
gridSize = dim3(iDivUp(width, blockSize.x), iDivUp(height, blockSize.y));//根据blocksize和图像的大小计算grid大小
if (ref_file)
{
runSingleTest(ref_file, argv[0]);
}
else
{
// This is the normal rendering path for VolumeRender
glutDisplayFunc(display);
glutKeyboardFunc(keyboard);
glutMouseFunc(mouse);
glutMotionFunc(motion);
glutReshapeFunc(reshape);
glutIdleFunc(idle);
initPixelBuffer();
#if defined (__APPLE__) || defined(MACOSX)
atexit(cleanup);
#else
glutCloseFunc(cleanup);
#endif
glutMainLoop();
}
}