cudaMallocPitch()数组的使用

最新推荐文章于 2024-05-17 20:46:53 发布

东坡先生

最新推荐文章于 2024-05-17 20:46:53 发布

阅读量319

点赞数

分类专栏： CUDA

CUDA 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

名称 cudaMallocPitch – 向GPU分配存储器

概要 cudaError_t cudaMallocPitch( void** devPtr，size_t* pitch，size_t widthInBytes，size_t height )

说明向设备分配至少widthInBytes*height字节的线性存储器，并以*devPtr的形式返回指向所分配存储器的指针。该函数可以填充所分配的存储器，以确保在地址从一行更新到另一行时，给定行的对应指针依然满足对齐要求。cudaMallocPitch()以*pitch的形式返回间距，即所分配存储器的宽度，以字节为单位。间距用作存储器分配的一个独立参数，用于在2D数组内计算地址。如果给定一个T类型数组元素的行和列，可按如下方法计算地址：

T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;

对于2D数组的分配，建议程序员考虑使用cudaMallocPitch()来执行间距分配。由于硬件中存在间距对齐限制，如果应用程序将在设备存储器的不同区域之间执行2D存储器复制（无论是线性存储器还是CUDA数组），这种方法将非常有用。

例子：为EmuDebug
原来《CUDA编程指南》上给出的pitch的类型为int，在实际运行时与cudaMallocPitch()类型不匹配。

[cpp]view plaincopy 
   
 /************************************************************************/  
 /*  This is a example of the CUDA program.  
 /************************************************************************/   
   
 #include <stdio.h>  
 #include <stdlib.h>  
 #include <cuda_runtime.h>  
 #include <cutil.h>  
   
 /************************************************************************/   
 /* myKernel                                                           */   
 /************************************************************************/   
 __global__ void myKernel(float* devPtr,int height,int width,int pitch)   
 {   
     for(int r=0;r    {   
         float* row=(float*)((char*)devPtr+r*pitch);   
         for (int c=0;c        {   
             float element=row[c];   
             printf("%f\n",element);//模拟运行   
         }   
     }   
 }   
   
 /************************************************************************/   
 /* Main CUDA                                                            */   
 /************************************************************************/   
 int main(int argc, char* argv[])   
 {   
     size_t width=10;   
     size_t height=10;   
   
     float* decPtr;   
    //pitch的值应该为size_t在整形的时，与函数参数不匹配   
     size_t pitch;   
     cudaMallocPitch((void**)&decPtr,&pitch,width*sizeof(float),height);    
     myKernel<<<1,1>>>(decPtr,10,10,pitch);   
     cudaFree(decPtr);   
   
     printf("%d\n",pitch);   
   
     //CUT_EXIT(argc, argv);   
   
     return 0;   
 }  

patch的理解：

　　C语言申请2维内存时，一般是连续存放的。a[y][x]存放在第y*widthofx*sizeof(元素)+x*sizeof(元素)个字节。
但在cuda的global memory访问中，从256字节对齐的地址(addr=0, 256, 512, ...)开始的连续访问是最有效率的。
这样，为了提高内存访问的效率，有了cudaMallocPitch函数。
　　cudaMallocPitch函数分配的内存中，数组的每一行的第一个元素的开始地址都保证是对齐的。因为每行有多少个
数据是不确定的，widthofx*sizeof(元素)不一定是256的倍数。故此，为保证数组的每一行的第一个元素的开始地址
对齐，cudaMallocPitch在分配内存时，每行会多分配一些字节，以保证widthofx*sizeof(元素)+多分配的字节是

256的倍数(对齐)。这样，y*widthofx*sizeof(元素)+x*sizeof(元素)来计算a[y][x]的地址就不正确了。
而应该是y*[widthofx*sizeof(元素)+多分配的字节]+x*sizeof(元素)。
而函数中返回的pitch的值就是widthofx*sizeof(元素)+多分配的字节。

东坡先生

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
cudaMallocPitch()数组的使用

名称 cudaMallocPitch – 向GPU分配存储器概要 cudaError_t cudaMallocPitch( void** devPtr，size_t* pitch，size_t widthInBytes，size_t height )说明向设备分配至少widthInBytes*height字节的线性存储器，并以*devPtr的形式返回指向所分配存储器的指针。该函数
复制链接

扫一扫