可能出现的代码情况:
template<int COL_NUM>
__global__ void kCopy(int row_size, float **col_valuess, float *row_values) {
int row_index = blockIdx.x * blockDim.x + threadIdx.x;
if (row_index >= row_size) return;
volatile __shared__ float* shared_col_valuess[COL_NUM];
int col_index = threadIdx.x;
if (col_index < COL_NUM) shared_col_valuess[col_index] = col_valuess[col_index];
__syncthreads();
float row_value = 0.0f;
#pragma unroll
for (int col_index = 0; col_index < COL_NUM; col_index++) {
row_value += shared_col_simss[col_index][row_index];
}
row_values[row_index] = row_value;
}
不知道是否有看出问题所在:
其实问题在于位置顺序:
1.先判断row_index是否有效
2.将col_valuess地址进行共享内存缓存
3.按row进行值
问题在于1中如果出现threadIdx.x < COL_NUM, 但是其blockIdx.x * blockDim.x + threadIdx.x >= row_size时,
会导致有一部分的地址没有被缓存到共享内存中,
这样在3进行计算的时候,某一列col的地址会无法访问,导致bug错误因此正确,应该先缓存地址到共享内存中
template<int COL_NUM>
__global__ void kCopy(int row_size, float **col_valuess, float *row_values) {
volatile __shared__ float* shared_col_valuess[COL_NUM];
int col_index = threadIdx.x;
if (col_index < COL_NUM) shared_col_valuess[col_index] = col_valuess[col_index];
__syncthreads();
int row_index = blockIdx.x * blockDim.x + threadIdx.x;
if (row_index >= row_size) return;
float row_value = 0.0f;
#pragma unroll
for (int col_index = 0; col_index < COL_NUM; col_index++) {
row_value += shared_col_simss[col_index][row_index];
}
row_values[row_index] = row_value;
}
因此,建议:
在使用共享内存的时候,
因保证共享内存前面的各个线程不会因某些条件而加快退出
如果存在加快退出的线程,应该将缓存地址到共享内存的操作提前