CUDA编程之CUDA Sample-3_CUDA_Features-bindlessTexture

本文链接：https://blog.csdn.net/zhujingwen_tongji/article/details/141607910

CUDA 中的 bindlessTexture 演示了如何在 CUDA 中使用"无绑定"的纹理访问方式。

纹理采样背景:
- 在传统的 CUDA 纹理采样中,我们需要先将纹理绑定到一个特定的纹理单元,然后在内核函数中使用 tex1D() 或 tex2D() 等函数进行采样。
- 这种方式要求在内核函数调用之前进行繁琐的纹理绑定操作。
无绑定纹理采样:
- 无绑定纹理采样允许在内核函数中直接访问纹理,而无需进行任何纹理绑定操作。
- 这种方式使用 CUDA 的"无绑定纹理"特性,通过使用特殊的纹理句柄来直接访问纹理数据。
- 这样可以减少 CPU 与 GPU 之间的同步开销,提高整体性能。
示例程序解析:
- 该示例程序首先在 CPU 端创建一个纹理对象,并将其注册为无绑定纹理。
- 然后在 CUDA 内核函数中,使用无绑定纹理句柄直接访问纹理数据,执行采样操作。
- 在 OpenGL 渲染流程中,也使用相同的无绑定纹理句柄来绑定纹理,进行最终的图像渲染。
性能优势:
- 与传统的纹理绑定方式相比,无绑定纹理采样可以减少 CPU 与 GPU 之间的同步开销,从而提高整体性能。
- 这种方式特别适用于需要频繁切换纹理的场景,例如实时渲染、流畅动画等。

这个sample解释了cudaSurfaceObject, cudaTextureObject, MipMap 在CUDA里的支持。

MipMap

Mipmaps 是一个预先计算的纹理图像序列,其中每一个级别的图像大小是前一级别的一半。这种方式可以避免在纹理缩小时出现的别名失真问题。

具体来说:

什么是 Mipmaps:
- Mipmaps 是一组由较高分辨率的原始纹理逐步滤波生成的较低分辨率纹理图像序列。
- 每一个 Mipmap 级别的图像大小都是前一级别的一半,形成了一个金字塔形的纹理图像集合。
Mipmaps 的作用:
- 在纹理被缩小显示时,使用较低分辨率的 Mipmap 级别可以避免出现别名失真现象。
- 别名失真是由于太少的纹理像素被用于渲染而造成的锯齿、闪烁等视觉瑕疵。
- 使用 Mipmaps 可以根据纹理在屏幕上的实际大小,选择合适的分辨率 Mipmap 级别进行采样,从而获得更平滑的纹理效果。
Mipmaps 的生成和使用:
- 通常在纹理加载时,GPU 会自动生成对应的 Mipmaps 序列。
- 在纹理采样时,GPU 会根据纹理在屏幕上的实际大小,选择合适的 Mipmap 级别进行采样。
- 这种自动的 Mipmap 选择和采样,可以大大提升渲染质量,减少别名失真。

Mipmaps 是一种用于避免纹理缩小时别名失真的有效技术,广泛应用于实时图形渲染领域。通过预计算并存储不同分辨率的纹理图像序列,可以在不同尺度下获得最佳的纹理采样效果。

考虑一个情景：当物体在场景中离观察者很远，最终只用一个屏幕像素来显示时，这个像素该如何通过纹素确定呢？如果使用最近邻滤波来获取这个纹素，那么显示效果并不理想。需要使用纹素的均值来反映物体在场景中离我们很远这个效果，对于一个 256×256的纹理，计算平均值是一个耗时工作，不能实时计算，因此可以通过提前计算一组这样的纹理用来满足这种需求。这组提前计算的按比例缩小的纹理就是Mipmaps。Mipmaps纹理大小每级是前一等级的一半，按大小递减顺序排列为：

原始纹理 256×256
Mip 1 = 128×128
Mip 2 = 64×64
Mip 3 = 32×32
Mip 4 = 16×16
Mip 5 = 8×8
Mip 6 = 4×4
Mip 7 = 2×2
Mip 8 = 1×1

OpenGL会根据物体离观察者的距离选择使用合适大小的Mipmap纹理。Mipmap纹理示意图如下所示（来自wiki Mipmap）：
Mipmap

代码流程

加载图像，从磁盘加载三张图片到app里
将这些数据从host端copy到device端
映射
基于level0生成 Mipmaps,然后创建无绑定纹理
生成一个 4x4 的 2D 纹理,并存储对应的无绑定纹理引用
从 2D 纹理中读取数据,并解码纹理对象。根据纹理级别从 2D 纹理中读取数据,然后将图像数据写入 PBO(像素缓冲区)
把PBO里的数据复制到Frambuffer，Famebuffer送到屏幕进行展示。

Kernel代码

__global__ void d_render(uchar4 *d_output, uint imageW, uint imageH, float lod,
                         cudaTextureObject_t atlasTexture) {
  uint x = blockIdx.x * blockDim.x + threadIdx.x;
  uint y = blockIdx.y * blockDim.y + threadIdx.y;

  float u = x / (float)imageW;
  float v = y / (float)imageH;

  if ((x < imageW) && (y < imageH)) {
    // read from 2D atlas texture and decode texture object
    uint2 texCoded = tex2D<uint2>(atlasTexture, u, v);
    cudaTextureObject_t tex = decodeTextureObject(texCoded);

    // read from cuda texture object, use template to specify what data will be
    // returned. tex2DLod allows us to pass the lod (mip map level) directly.
    // There is other functions with CUDA 5, e.g. tex2DGrad, that allow you
    // to pass derivatives to perform automatic mipmap/anisotropic filtering.
    float4 color = tex2DLod<float4>(tex, u, 1 - v, lod);
    // In our sample tex is always valid, but for something like your own
    // sparse texturing you would need to make sure to handle the zero case.

    // write output color
    uint i = y * imageW + x;
    d_output[i] = to_uchar4(color * 255.0);
  }
}

extern "C" void renderAtlasImage(dim3 gridSize, dim3 blockSize,
                                 uchar4 *d_output, uint imageW, uint imageH,
                                 float lod) {
  // psuedo animate lod
  lod = fmodf(lod, highestLod * 2);
  lod = highestLod - fabs(lod - highestLod);

#ifdef SHOW_MIPMAPS
  lod = 0.0f;
#endif

  d_render<<<gridSize, blockSize>>>(d_output, imageW, imageH, lod,
                                    atlasImage.textureObject);

  checkCudaErrors(cudaGetLastError());
}

//
// MipMap Generation

//  A key benefit of using the new surface objects is that we don't need any
//  global binding points anymore. We can directly pass them as function
//  arguments.

__global__ void d_mipmap(cudaSurfaceObject_t mipOutput,
                         cudaTextureObject_t mipInput, uint imageW,
                         uint imageH) {
  uint x = blockIdx.x * blockDim.x + threadIdx.x;
  uint y = blockIdx.y * blockDim.y + threadIdx.y;

  float px = 1.0 / float(imageW);
  float py = 1.0 / float(imageH);

  if ((x < imageW) && (y < imageH)) {
    // take the average of 4 samples

    // we are using the normalized access to make sure non-power-of-two textures
    // behave well when downsized.
    float4 color = (tex2D<float4>(mipInput, (x + 0) * px, (y + 0) * py)) +
                   (tex2D<float4>(mipInput, (x + 1) * px, (y + 0) * py)) +
                   (tex2D<float4>(mipInput, (x + 1) * px, (y + 1) * py)) +
                   (tex2D<float4>(mipInput, (x + 0) * px, (y + 1) * py));

    color /= 4.0;
    color *= 255.0;
    color = fminf(color, make_float4(255.0));

    surf2Dwrite(to_uchar4(color), mipOutput, x * sizeof(uchar4), y);
  }
}

void generateMipMaps(cudaMipmappedArray_t mipmapArray, cudaExtent size) {
  size_t width = size.width;
  size_t height = size.height;

#ifdef SHOW_MIPMAPS
  cudaArray_t levelFirst;
  checkCudaErrors(cudaGetMipmappedArrayLevel(&levelFirst, mipmapArray, 0));
#endif

  uint level = 0;

  while (width != 1 || height != 1) {
    width /= 2;
    width = MAX((size_t)1, width);
    height /= 2;
    height = MAX((size_t)1, height);

    cudaArray_t levelFrom;
    checkCudaErrors(cudaGetMipmappedArrayLevel(&levelFrom, mipmapArray, level));
    cudaArray_t levelTo;
    checkCudaErrors(
        cudaGetMipmappedArrayLevel(&levelTo, mipmapArray, level + 1));

    cudaExtent levelToSize;
    checkCudaErrors(cudaArrayGetInfo(NULL, &levelToSize, NULL, levelTo));
    checkHost(levelToSize.width == width);
    checkHost(levelToSize.height == height);
    checkHost(levelToSize.depth == 0);

    // generate texture object for reading
    cudaTextureObject_t texInput;
    cudaResourceDesc texRes;
    memset(&texRes, 0, sizeof(cudaResourceDesc));

    texRes.resType = cudaResourceTypeArray;
    texRes.res.array.array = levelFrom;

    cudaTextureDesc texDescr;
    memset(&texDescr, 0, sizeof(cudaTextureDesc));

    texDescr.normalizedCoords = 1;
    texDescr.filterMode = cudaFilterModeLinear;

    texDescr.addressMode[0] = cudaAddressModeClamp;
    texDescr.addressMode[1] = cudaAddressModeClamp;
    texDescr.addressMode[2] = cudaAddressModeClamp;

    texDescr.readMode = cudaReadModeNormalizedFloat;

    checkCudaErrors(
        cudaCreateTextureObject(&texInput, &texRes, &texDescr, NULL));

    // generate surface object for writing

    cudaSurfaceObject_t surfOutput;
    cudaResourceDesc surfRes;
    memset(&surfRes, 0, sizeof(cudaResourceDesc));
    surfRes.resType = cudaResourceTypeArray;
    surfRes.res.array.array = levelTo;

    checkCudaErrors(cudaCreateSurfaceObject(&surfOutput, &surfRes));

    // run mipmap kernel
    dim3 blockSize(16, 16, 1);
    dim3 gridSize(((uint)width + blockSize.x - 1) / blockSize.x,
                  ((uint)height + blockSize.y - 1) / blockSize.y, 1);

    d_mipmap<<<gridSize, blockSize>>>(surfOutput, texInput, (uint)width,
                                      (uint)height);

    checkCudaErrors(cudaDeviceSynchronize());
    checkCudaErrors(cudaGetLastError());

    checkCudaErrors(cudaDestroySurfaceObject(surfOutput));

    checkCudaErrors(cudaDestroyTextureObject(texInput));

#ifdef SHOW_MIPMAPS
    // we blit the current mipmap back into first level
    cudaMemcpy3DParms copyParams = {0};
    copyParams.dstArray = levelFirst;
    copyParams.srcArray = levelTo;
    copyParams.extent = make_cudaExtent(width, height, 1);
    copyParams.kind = cudaMemcpyDeviceToDevice;
    checkCudaErrors(cudaMemcpy3D(&copyParams));
#endif

    level++;
  }
}

这段代码是 CUDA 渲染引擎的一部分,主要包含两个内核函数:d_render 和 d_mipmap。

d_render 内核函数:
- 这个函数主要用于从纹理图集(atlas texture)中读取并渲染输出颜色。
- 它首先计算当前线程的 2D 坐标(x, y),并将其转换为 UV 坐标(u, v)。
- 然后它使用 tex2D 函数从 CUDA 纹理对象(atlasTexture)中读取编码的纹理数据,并解码为一个新的纹理对象(tex)。
- 接下来,它使用 tex2DLod 函数从解码后的纹理对象中读取颜色数据,并应用 LOD(Level of Detail)值。
- 最后,它将读取的颜色数据转换为 uchar4 格式,并写入到输出数组(d_output)中。
renderAtlasImage 函数:
- 这个函数是一个 CPU 端的包装函数,用于在 GPU 上启动 d_render 内核函数。
- 它首先计算一个 "pseudo-animated" LOD 值,用于控制纹理的细节级别。
- 然后它调用 d_render 内核函数,并传入网格大小、块大小、输出数组、图像大小和 LOD 值等参数。
- 最后,它检查 CUDA 操作是否成功。
d_mipmap 内核函数:
- 这个函数用于生成纹理的 mipmap 级别。
- 它首先计算当前线程的 2D 坐标(x, y),并将其转换为归一化的 UV 坐标。
- 然后它使用 tex2D 函数从输入纹理对象(mipInput)中读取 4 个相邻像素的颜色值,并计算它们的平均值。
- 最后,它将计算出的平均颜色值写入到输出的 CUDA 表面对象(mipOutput)中。
generateMipMaps 函数:
- 这个函数负责生成完整的 mipmap 金字塔。
- 它遍历 mipmap 的各个级别,并依次调用 d_mipmap 内核函数来生成下一个级别的 mipmap。
- 在每个级别,它首先创建输入的纹理对象和输出的表面对象,然后启动 d_mipmap 内核函数来计算下一级 mipmap。
- 最后,它清理创建的纹理和表面对象。

这段代码实现了一个基于 CUDA 的渲染引擎,能够从一个纹理图集中读取数据并生成多级 mipmap。这种方法可以提高渲染性能,特别是在处理大型纹理时。