CUDA学习（七十九）-CSDN博客

关于__host__ __device__ lambdas的说明:
与__device__ lambdas不同，可以从主机代码调用__host__ __device__ lambdas。如前所述，CUDA编译器用一个已命名的占位符类型的实例替换主机代码中定义的扩展lambda表达式。扩展__host__ __device__ lambda的占位符类型通过间接函数调用调用orignal lambda的operator（）。
间接函数调用的存在可能会导致扩展的__host__ __device__ lambda被主机编译器优化得比只隐式或明确__host__的lambda更优化。在后一种情况下，主机编译器可以轻松地将lambda的主体内联到调用上下文中。但是如果扩展了__host__ __device__ lambda，主机编译器会遇到间接函数调用，并且可能无法轻松内联原始__host__ __device__ lambda主体。
*this Capture By Value:
当在非静态类成员函数中定义lambda时，并且lambda的主体引用类成员变量时，C ++ 11 / C ++ 14规则要求该类的该指针由值捕获，而不是被引用的成员变量。如果lambda是在主机函数中定义的扩展的__device__或__host__ __device__ lambda，并且在GPU上执行lambda，则如果此指针指向主机内存，则在GPU上访问引用的成员变量将导致运行时错误.
例子：

#include <cstdio>
template <typename T>
__global__ void foo(T in) { printf("\n value = %d", in()); }
struct S1_t {
    int xxx;
    __host__ __device__ S1_t(void) : xxx(10) { };
    void doit(void) {
        auto lam1 = [=] __device__{
            // reference to "xxx" causes
            // the 'this' pointer (S1_t*) to be captured by value
            return xxx + 1;
        };
        // Kernel launch fails at run time because 'this->xxx'
        // is not accessible from the GPU
        foo << <1, 1 >> >(lam1);
        cudaDeviceSynchronize();
    }
};
int main(void) {
    S1_t s1;
    s1.doit();
}

C ++ 17通过添加新的“ this”捕获模式解决了这个问题。在这种模式下，编译器复制由“ this”表示的对象，而不是按值捕获指针。下面详细描述“* this”捕捉模式：http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0018r3.html。
当使用--expt-extended-lambda nvcc标志时，CUDA编译器支持在__device__和__global__函数中定义的lambda以及在主机代码中定义的扩展__device__ lambda的“* this”捕获模式。
上面的示例修改为使用“* this”捕获模式：

#include <cstdio>
template <typename T>
__global__ void foo(T in) { printf("\n value = %d", in()); }
struct S1_t {
    int xxx;
    __host__ __device__ S1_t(void) : xxx(10) { };
    void doit(void) {
        // note the "*this" capture specification
        auto lam1 = [=, *this] __device__{
            // reference to "xxx" causes
            // the object denoted by '*this' to be captured by
            // value, and the GPU code will access copy_of_star_this->xxx
            return xxx + 1;
        };
        // Kernel launch succeeds
        foo << <1, 1 >> >(lam1);
        cudaDeviceSynchronize();
    }
};
int main(void) {
    S1_t s1;
    s1.doit();
}

“*this”捕获模式不允许用于在主机代码中定义的未注释的lambdas或扩展的__host__ device lambdas。支持和不支持的用法示例：

struct S1_t {
    int xxx;
    __host__ __device__ S1_t(void) : xxx(10) { };
    void host_func(void) {
        // OK: use in an extended __device__ lambda
        auto lam1 = [=, *this] __device__{ return xxx; };
        // Error: use in an extended __host__ __device__ lambda
        auto lam2 = [=, *this] __host__ __device__{ return xxx; };
        // Error: use in an unannotated lambda in host function
        auto lam3 = [=, *this]{ return xxx; };
    }
    __device__ void device_func(void) {
        // OK: use in a lambda defined in a __device__ function
        auto lam1 = [=, *this] __device__{ return xxx; };
        // OK: use in a lambda defined in a __device__ function
        auto lam2 = [=, *this] __host__ __device__{ return xxx; };
        // OK: use in a lambda defined in a __device__ function
        auto lam3 = [=, *this]{ return xxx; };
    }
    __host__ __device__ void host_device_func(void) {
        // OK: use in an extended __device__ lambda
        auto lam1 = [=, *this] __device__{ return xxx; };
        // Error: use in an extended __host__ __device__ lambda
        auto lam2 = [=, *this] __host__ __device__{ return xxx; };
        // Error: use in an unannotated lambda in a __host__ __device__ function
        auto lam3 = [=, *this]{ return xxx; };
    }
};

补充笔记：
ADL Lookup：如前所述，在调用主机编译器之前，CUDA编译器将用占位符类型的实例替换扩展的lambda表达式。占位符类型的一个模板参数使用包含原始lambda表达式的函数的地址。对于参数类型涉及扩展lambda表达式的闭包类型的任何主机函数调用，这可能会导致其他名称空间参与参数相关查找（ADL）。这可能会导致主机编译器选择错误的功能。
例子：

namespace N1 {
    struct S1_t { };
    template <typename T> void foo(T);
};
namespace N2 {
    template <typename T> int foo(T);
    template <typename T> void doit(T in) { foo(in); }
}
void bar(N1::S1_t in)
{
    /* extended __device__ lambda. In the code sent to the host compiler,
    this
    is replaced with the placeholder type instantiation expression
    ' __nv_dl_wrapper_t< __nv_dl_tag<void (*)(N1::S1_t in),(&bar),1> > { }'
    As a result, the namespace 'N1' participates in ADL lookup of the
    call to "foo" in the body of N2::doit, causing ambiguity.
    */
    auto lam1 = [=] __device__{};
    N2::doit(lam1);
}

在上面的例子中，CUDA编译器用包含N1命名空间的占位符类型替换了扩展lambda。因此，名称空间N1参与了N2 :: doit主体中的foo（in）的ADL查找，并且主机编译失败，因为找到了多个过载候选N1 :: foo和N2 :: foo。
timg