AMD ROCM软件栈 -clr项目分析

最新推荐文章于 2024-05-11 09:54:29 发布

光明顶魔镜工程师

最新推荐文章于 2024-05-11 09:54:29 发布

阅读量1k

点赞数 10

文章标签：深度学习 AMD rocm clr

本文链接：https://blog.csdn.net/a1171936767/article/details/137911399

版权

CLR

项目层级结构

CLR 编译依赖项目：

comgr ：Comgr 库提供用于编译和检查 AMDGPU 代码对象的应用程序接口

rocclr/device/comgrctx.hpp  #gpu代码编译支持

CLR项目目录结构

hipamd ： hip编程模型api实现
opencl：opencl 2.0 编程模型api实现
rocclr：对 hip编程模型、opencl编程模型的支持，对设备的操作进行了抽象

opencl/amdocl

1. clGetPlatformIDs

核心代码如下

  if (!amd::Runtime::initialized()) {
    amd::Runtime::init();
  }

在amd::Runtime::init();中执行的多个init，其中roc::Device::init进行了设备扫描，并将设备注册保存到一个列表，供后续代码使用

在Device::init()的实现 rocclr/device/device.cpp +470中我们主要关注HSA部分，核心代码如下

  if (HSA_STATUS_SUCCESS != hsa_iterate_agents(iterateAgentCallback, nullptr)) {
    return false;
  }

本段代码调用了iterateAgentCallback枚举了agents，在iterateAgentCallback中调用了hsa_amd_agent_iterate_memory_pools进行具体实现。agent是HSA api中的一个概念，相当于一个handle用于表示一个参与了HSA内存模型的设备，通过agent可以提交AQL包执行。

// IMPORTANT: Note that we are initialiing HSA stack first and then
// GPU stack. The order of initialization is signiicant and if changed
// amd::Device::registerDevice() must be accordingly modified.
#if defined(WITH_HSA_DEVICE)
  if ((GPU_ENABLE_PAL != 1) || flagIsDefault(GPU_ENABLE_PAL)) {
    // Return value of roc::Device::init()
    // If returned false, error initializing HSA stack.
    // If returned true, either HSA not installed or HSA stack
    //                   successfully initialized.
    ret = roc::Device::init();

    if (!amd::IS_HIP) {   //OPENCL
      ret |= roc::NullDevice::init();
    }
  }
#endif  // WITH_HSA_DEVICE

对于设备的初始化根据编程模型的不同分为了HIP和OPENCL

继续查看 roc::Device::init();具体的设备初始化 rocclr/…/rocm/rocdevice.cpp +441，这里调用了HSA的api获取设备信息进行解析。roc_device的创建需要传入agent，用于实现具体的设备操作。

  for (auto agent : gpu_agents_) {
    std::unique_ptr<Device> roc_device(new Device(agent));
    if (!roc_device) {
      LogError("Error creating new instance of Device on then heap.");
      continue;
    }
    if (!roc_device->create()) {  //调用hsa接口执行设备创建
      LogError("Error creating new instance of Device.");
      continue;
    }
    // Check to see if a global CU mask is requested
    if (amd::IS_HIP && ROC_GLOBAL_CU_MASK[0] != '\0') {
      roc_device->getGlobalCUMask(ROC_GLOBAL_CU_MASK);
    }

    roc_device.release()->registerDevice();
  }

在roc_device->create()中会调用hsa abi获取设备的isa对并为其创建一些基本对象，最后调用roc_device.release()->registerDevice() 进行注册。

2. clCreateCommandQueueWithProperties

核心代码如下 opencl/amdocl/cl_command.cpp:146

  amd::CommandQueue* queue = NULL;
  {
    amd::ScopedLock lock(amdContext.lock());

    // Check if the app creates a host queue
    if (!(properties & CL_QUEUE_ON_DEVICE)) {
      queue = new amd::HostQueue(amdContext, amdDevice, properties, queueRTCUs, priority);
    } else {
      // Is it a device default queue
      if (properties & CL_QUEUE_ON_DEVICE_DEFAULT) {
        queue = amdContext.defDeviceQueue(amdDevice);
        // If current context has one already then return it
        if (NULL != queue) {
          queue->retain();
          *not_null(errcode_ret) = CL_SUCCESS;
          return as_cl(queue);
        }
      }
      // Check if runtime can allocate a new device queue on this context
      if (amdContext.isDevQueuePossible(amdDevice)) {
        queue = new amd::DeviceQueue(amdContext, amdDevice, properties, queueSize);
      }
    }

      if (amd::Agent::shouldPostCommandQueueEvents()) {
        amd::Agent::postCommandQueueCreate(as_cl(queue->asCommandQueue()));
      }

      *not_null(errcode_ret) = CL_SUCCESS;
      return as_cl(queue);
  }

解析期望的queue properties，例如队列大小，以及amd特有的特性支持
判断在host还是GPU上创建队列
执行队列创建 new amd::HostQueue、new amd::DeviceQueue

创建的host队列用一个线程承载，每个host队列有一个loop处理队列的事务

    //! The command queue thread entry point.
    void run(void* data) {
      HostQueue* queue = static_cast<HostQueue*>(data);
      virtualDevice_ = queue->device().createVirtualDevice(queue);
      if (virtualDevice_ != nullptr) {
        queue->loop(virtualDevice_);
        Release();
      } else {
        acceptingCommands_ = false;
        queue->flush();
      }
    }

queue->loop(virtualDevice_); 实现在rocclr/platform/commandqueue.cpp +168，是一个while(true)结构，他不断的从队列中取出command处理，处理分为两部分，第一部分为处理命令的事件(+193 - 208)、第二部分为命令的执行(+223 - 233)

处理命令的事件 cclr/platform/commandqueue.cpp +182

    // Process the command's event wait list.
    const Command::EventWaitList& events = command->eventWaitList();
    bool dependencyFailed = false;
    ClPrint(LOG_DEBUG, LOG_CMD, "Command (%s) processing: %p ,events.size(): %d",
            getOclCommandKindString(command->type()), command, events.size());
    for (const auto& it : events) { //枚举每个事件
      // Only wait if the command is enqueued into another queue.
      if (it->command().queue() != this) {
        // Runtime has to flush the current batch only if the dependent wait is blocking
        if (it->command().status() != CL_COMPLETE) {
          ClPrint(LOG_DEBUG, LOG_CMD, "Command (%s) %p awaiting event: %p", getOclCommandKindString(command->type()), command, it);
          virtualDevice->flush(head, true); //硬件相关
          tail = head = NULL;
          dependencyFailed |= !it->awaitCompletion(); //等待事件
        }
      }
    }

命令执行 cclr/platform/commandqueue.cpp +217
```
    // Submit to the device queue.
    command->submit(*virtualDevice);
```
这里命令的执行其实就是把命令提交到设备的队列上去，amd::Command的submit定义是一个纯虚函数，其实现在具体的Command中，例如WriteMemoryCommand、ReadMemoryCommand，WriteMemoryCommand类中submit实现如下：
```
virtual void submit(device::VirtualDevice& device) { device.submitWriteMemory(*this); }
```
device.submitWriteMemory的定义也是个纯虚函数，由继承它的类实现，例如VirtualGPU::submitWriteMemory 实现在rocclr/device/rocm/rocvirtual.cpp +1658

在命令队列创建完毕后，后续接口基本都依赖于命令队列，基本流程为创建一个命令例如WriteMemoryCommand、ReadMemoryCommand、NDRangeKernelCommand，然后调用enqueue方法，放到队列中，然后被队列的loop部分取出处理

postCommandQueueCreate中的agent->callbacks_.CommandQueueCreate来源于Agent::setCallbacks rocclr/platform/agent.cpp +81

static int32_t CL_API_CALL SetCallbacks(vdi_agent* agent, const vdi_agent_callbacks* callbacks,
                                       size_t size) {
  return Agent::get(agent)->setCallbacks(callbacks, size);
}

int32_t Agent::setCallbacks(const vdi_agent_callbacks* callbacks, size_t size) {
  // FIXME_lmoriche: check size
  memcpy(&callbacks_, callbacks, size);
  return CL_SUCCESS;
}

SetCallbacks是对外提供的一个API，callback函数列表似乎是外部设置的,vdi_agent_callbacks定义在rocclr/include/vdi_agent_amd.h +97

typedef struct _vdi_agent_callbacks {
  /* Context Callbacks */
  acContextCreate_fn ContextCreate;
  acContextFree_fn ContextFree;

  /* Command Queue Callbacks */
  acCommandQueueCreate_fn CommandQueueCreate;
  acCommandQueueFree_fn CommandQueueFree;

  /* Event Callbacks */
  acEventCreate_fn EventCreate;
  acEventFree_fn EventFree;
  acEventStatusChanged_fn EventStatusChanged;

  /* Memory Object Callbacks */
  acMemObjectCreate_fn MemObjectCreate;
  acMemObjectFree_fn MemObjectFree;
  acMemObjectAcquired_fn MemObjectAcquired;

  /* Sampler Callbacks */
  acSamplerCreate_fn SamplerCreate;
  acSamplerFree_fn SamplerFree;

  /* Program Callbacks */
  acProgramCreate_fn ProgramCreate;
  acProgramFree_fn ProgramFree;
  acProgramBuild_fn ProgramBuild;

  /* Kernel Callbacks */
  acKernelCreate_fn KernelCreate;
  acKernelFree_fn KernelFree;
  acKernelSetArg_fn KernelSetArg;

} vdi_agent_callbacks;

3. clGetCommandQueueInfo

代码实现入口在opencl/amdocl/cl_command.cpp +317

代码根据param_name在switch中对各种情况进行处理，最终都调用amd::clGetInfo()实现，此函数在opencl/amdocl/cl_common.hpp +43实现

template <typename T>
static inline cl_int
clGetInfo(
    T& field,
    size_t param_value_size,
    void* param_value,
    size_t* param_value_size_ret)
{
    const void *valuePtr;
    size_t valueSize;

    std::tie(valuePtr, valueSize)		//解绑元组信息到valuePtr、valueSize变量
        = detail::ParamInfo<typename std::remove_const<T>::type>::get(field);
    /*......*/
        if (param_value != NULL) {  //信息拷贝
        ::memcpy(param_value, valuePtr, valueSize);
        if (param_value_size > valueSize) {
            ::memset(static_cast<address>(param_value) + valueSize,
                '\0', param_value_size - valueSize);
        }
    }
}

信息存储在detail类中，通过其get方法可以获取到field对应的数据，数据以元组方式保存，包含数据地址和大小，最终使用memcpy拷贝到param_value指向的内存，完成信息的获取

4. clRetainCommandQueue、clReleaseCommandQueue

clRetainCommandQueue主要功能是增加队列的引用计数，代码实现入口在opencl/amdocl/cl_command.cpp +259

RUNTIME_ENTRY(cl_int, clRetainCommandQueue, (cl_command_queue command_queue)) {
  if (!is_valid(command_queue)) {
    return CL_INVALID_COMMAND_QUEUE;
  }
  as_amd(command_queue)->retain();
  return CL_SUCCESS;
}
RUNTIME_EXIT

as_amd(command_queue)->retain(); 实现在rocclr/platform/runtime.cpp +109

uint ReferenceCountedObject::retain() {
    //referenceCount_ 为std::atomic类型
  return referenceCount_.fetch_add(1, std::memory_order_relaxed) + 1;
}

以原子方式将referenceCount_加一，对于clReleaseCommandQueue则是将引用计数减一

5. clSetDefaultDeviceCommandQueue

6. clCreateBuffer

核心代码如下

  // check if the ptr is in the svm space, if yes, we need return SVM buffer
  amd::Memory* svmMem = amd::MemObjMap::FindMemObj(host_ptr);
  if ((NULL != svmMem) && (flags & CL_MEM_USE_HOST_PTR)) {
    size_t svmSize = svmMem->getSize();
    size_t offset = static_cast<address>(host_ptr) - static_cast<address>(svmMem->getSvmPtr());
    if (size + offset > svmSize) {
      LogWarning("invalid parameter \"size\"");
      return (cl_mem)0;
    }
    mem = new (amdContext) amd::Buffer(*svmMem, flags, offset, size);
    svmMem->setHostMem(host_ptr);
  } else {
    mem = new (amdContext) amd::Buffer(amdContext, flags, size);
  }

 mem->create(host_ptr)

代码会检查在host还是GPU上创建buffer ，向amd::Buffer传入不同的context，最后调用mem的create方法

amd::Buffer继承关系如下

7. clEnqueueNDRangeKernel

核心代码如下

amd::NDRangeKernelCommand* command = new amd::NDRangeKernelCommand(hostQueue, eventWaitList, *as_amd(kernel), ndrange);
command->enqueue();

NDRangeKernelCommand继承关系如下

8. clEnqueueWriteBuffer

核心代码如下

  amd::CopyMetadata copyMetadata(!blocking_write, amd::CopyMetadata::CopyEnginePreference::SDMA);
  amd::WriteMemoryCommand* command = new amd::WriteMemoryCommand(
      hostQueue, CL_COMMAND_WRITE_BUFFER, eventWaitList, *dstBuffer, dstOffset, dstSize,
      ptr, 0, 0, copyMetadata);
  command->enqueue();

先创建了一个amd::WriteMemoryCommand，然后调用它的enqueue方法。

WriteMemoryCommand继承关系如下

9. clEnqueueReadBuffer

核心代码如下

 amd::ReadMemoryCommand* command =
      new amd::ReadMemoryCommand(hostQueue, CL_COMMAND_READ_BUFFER_RECT, eventWaitList, *srcBuffer,
                                 srcStart, size, ptr, bufRect, hostRect, copyMetadata);
 command->enqueue();

ReadMemoryCommand继承关系如下

通过上面对queue的分析可知，ReadMemoryCommand的具体执行是在它的submit()方法中，rocclr/platform/command.hpp+476

class ReadMemoryCommand : public OneMemoryArgCommand {
    ...
	virtual void submit(device::VirtualDevice& device) { device.submitReadMemory(*this); }
    ...
}

VirtualGPU::submitReadMemory
	-->DmaBlitManager::readBuffer
		-->kb->copyBuffer()
		-->hsaCopyStaged()

readBuffer的实现在rocclr/…/rocm/rocblit.cpp +64

基于BlitManager派生了3个类，用于处理不同类型的readBuffer操作

最终的实现是调用hsa_amd_memory_copy_xxx的api。

10. clCreateUserEvent

核心代码如下：

RUNTIME_ENTRY_RET(cl_event, clCreateUserEvent, (cl_context context, cl_int* errcode_ret)) {
  if (!is_valid(context)) {
    *not_null(errcode_ret) = CL_INVALID_CONTEXT;
    return (cl_event)0;
  }

  amd::Event* event = new amd::UserEvent(*as_amd(context));
  if (event == NULL) {
    *not_null(errcode_ret) = CL_OUT_OF_HOST_MEMORY;
    return (cl_event)0;
  }

  event->retain(); //增加引用计数
  *not_null(errcode_ret) = CL_SUCCESS;
  return as_cl(event);
}
RUNTIME_EXIT

主要依赖amd::UserEvent实现，代码先new了一个amd::UserEvent，然后对引用计数加一，amd::UserEvent构造代码在platform/command.hpp+369，构造代码主要实例化了Command，context_并将Event状态设置为CL_SUBMITTED

  UserEvent(Context& context) : Command(CL_COMMAND_USER), context_(context) {
    setStatus(CL_SUBMITTED);
  }

setStatus实现代码在platform/command.cpp +112，主要进行了以下操作：

性能统计处理
处理状态变化的回调函数
调用compare_exchange_strong() 设置currentStatus
根据Agent::shouldPostEventEvents() 调用Agent的postEventStatusChanged
检查status，如果已经完成则进行释放相关的操作

amd::UserEvent继承关系如下

11. clSetUserEventStatus

clSetUserEventStatus中的setStatus与clCreateUserEvent中的一样，均是amd::Event中的方法

12. clWaitForEvents

clWaitForEvents可以等待多个事件，这些事件必须在同一个context。

awaitCompletion实现如下

bool Event::awaitCompletion() {
  if (status() > CL_COMPLETE) {
    // Notifies the current command queue about waiting
    if (!notifyCmdQueue(kCpuWait)) {
      return false;
    }

    ClPrint(LOG_DEBUG, LOG_WAIT, "Waiting for event %p to complete, current status %d",
      this, status());
    auto* queue = command().queue();
    if ((queue != nullptr) && queue->vdev()->ActiveWait()) {
      while (status() > CL_COMPLETE) { //等待状态完成
        amd::Os::yield();
      }
    } else {
      ScopedLock lock(lock_);

      // Wait until the status becomes CL_COMPLETE or negative.
      while (status() > CL_COMPLETE) {
        lock_.wait();
      }
    }
    ClPrint(LOG_DEBUG, LOG_WAIT, "Event %p wait completed", this);
  }

  return status() == CL_COMPLETE;
}

在notifyCmdQueue中主要是提交了Marker命令，后续根据queue->vdev()->ActiveWait()的不同执行了不同方式的等待。等待就是等待status发生变化，status由Event::setStatus() 设置，在device/rocm/rocvirtual.cpp、platform/command.cpp、platform/commandqueue.cpp中有比较多的地方调用了此接口。

13. clGetEventInfo

此API实现与clGetCommandQueueInfo相同

14. clSetEventCallback

用于注册一个回调函数，以便在特定事件状态发生变化时异步地通知应用程序

具体实现位于rocclr/platform/command.cpp +119

bool Event::setCallback(int32_t status, Event::CallBackFunction callback, void* data) {
  assert(status >= CL_COMPLETE && status <= CL_QUEUED && "invalid status");

  CallBackEntry* entry = new CallBackEntry(status, callback, data);
  if (entry == NULL) {
    return false;
  }

  entry->next_ = callbacks_; //将entry加入链表头部
  while (!callbacks_.compare_exchange_weak(entry->next_, entry))
    ;  // Someone else is also updating the head of the linked list! reload.

  // Check if the event has already reached 'status'
  if (this->status() <= status && entry->callback_ != CallBackFunction(0)) {
    if (entry->callback_.exchange(NULL) != NULL) {
      callback(as_cl(this), status, entry->data_);
    }
  }

  return true;
}

回调函数是在一个链表中，代码先new CallBackEntry对象，然后将entry加入到链表头部，最后通过compare_exchange_weak来更新回调函数的entry。

15. clRetainEvent 、clReleaseEvent

这两者实现与clRetainCommandQueue、clReleaseCommandQueue相同，都是基于ReferenceCountedObject实现，对引用计数进行加减

clReleaseEvent同理

16. clEnqueueMarkerWithWaitList

clEnqueueMarkerWithWaitList函数用于在命令队列中创建一个标记事件，并可以指定等待的事件列表。它的实现位于opencl/amdocl/cl_execute.cpp +632。

在amd::clSetEventWaitList中主要是将cl_event转换为了amd::Event 并放到eventWaitList中，后续在将eventWaitList作为amd::Marker的构造参数传入，最终执行命令的入队操作

Marker在命令队列中排队的命令，可用于标记命令队列中在标记命令之前排队的所有命令。标记命令会返回一个事件，应用程序可利用该事件对标记事件进行排队等待，即等待标记命令之前排队的所有命令完成。

在 OpenCL 中，Marker 是一种特殊的事件对象，用于标记命令队列中的特定点。Marker 本身不执行任何操作，但可以用于在命令队列中创建一个点，以便稍后在该点之前或之后插入其他命令，并通过其他事件等待 Marker。Marker 的主要作用是帮助管理命令队列中的命令执行顺序，以及为异步操作建立同步点。

17. clEnqueueBarrierWithWaitList

clEnqueueBarrierWithWaitList 函数用于在命令队列中创建一个屏障，用于确保在此屏障之前提交的命令完成后才执行屏障之后的命令，并可以指定等待的事件列表。

此函数与clEnqueueMarkerWithWaitList实现相似，只是在new amd::Marker传入的构造参数不同

18. clGetEventProfilingInfo

clGetEventProfilingInfo函数用于获取与事件相关的性能分析信息，例如事件的执行时间。

实现位于opencl/amdocl/cl_execute.cpp +882

它根据传入的param_name不同，在switch中返回对应的地址指针

19. clFlush

clFlush 函数用于将命令队列中的所有未执行的命令提交到设备中进行执行。它确保在之后的命令中可以看到之前提交的命令的效果。但是，它并不等待命令执行完成。

20. clFinish

clFinish 函数用于等待命令队列中的所有命令执行完成，然后再继续执行程序。它会阻塞当前线程，直到所有在指定命令队列中排队的命令都已执行完毕。

finish实现位于rocclr/platform/commandqueue.cpp +116，在finish()中主要进行了以下操作：

提交了一个Marker命令
根据cpu_wait等条件调用awaitCompletion() 等待命令执行完毕
调用release()减少command的引用计数

光明顶魔镜工程师

关注

10
点赞
踩
9

收藏

觉得还不错? 一键收藏
2
评论
AMD ROCM软件栈 -clr项目分析

项目层级结构fill:#333;color:#333;color:#333;fill:none;AMD CLR：Comgr 库提供用于编译和检查 AMDGPU 代码对象的应用程序接口CLR项目目录结构。
复制链接

扫一扫